parsing - How to parse a file in bash according to another file -


this question has answer here:

i have parlous problem , i'll show try make solution on example bellow. prefer solve problem in awk. functional solution welcome.

i have template file, tells me sequence of columns (in order):

template:

impact;distance;strand;flags;variant_class;symbol;symbol_source;hgnc_id;biotype;canonical;tsl;appris;ccds;ensp;swissprot;trembl;uniparc;refseq_match;gene_pheno;sift;polyphen;exon;intron;domains;hgvsc;hgvsp;hgvs_offset;gmaf;afr_maf;amr_maf;eas_maf;eur_maf;sas_maf;aa_maf;ea_maf;exac_maf;exac_adj_maf;exac_afr_maf;exac_amr_maf;exac_eas_maf;exac_fin_maf;exac_nfe_maf;exac_oth_maf;exac_sas_maf;clin_sig;somatic;pheno;pubmed;motif_name;motif_pos;high_inf_pos;motif_score_change 

all values semi-colon separated.

and input file, can not parse, because values template missing.

input:

impact=modifier;strand=1;variant_class=deletion;symbol=kif1b;biotype=protein_coding;ensp=np_055889.2;intron=24/46;hgvsc=nm_015074.3:c.2537+467delt;hgvs_offset=9 impact=modifier;strand=1;variant_class=deletion;symbol=kif1b;biotype=protein_coding;canonical=yes;ensp=xp_005263490.1;intron=26/48;hgvsc=xm_005263433.1:c.2675+467delt;hgvs_offset=9 impact=modifier;distance=4811;strand=-1;variant_class=deletion;symbol=c1orf127;biotype=protein_coding;canonical=yes;ensp=np_001164225.1;gmaf=-:0.1749;amr_maf=-:0.3011;eas_maf=-:0.1542;eur_maf=-:0.0794;sas_maf=-:0.2008;aa_maf=-:0.091 impact=modifier;strand=1;variant_class=insertion;biotype=misc_rna;canonical=yes;intron=1/1;hgvsc=xr_158744.2:n.96+764dupa;hgvs_offset=8;gmaf=a:0.4225;amr_maf=a:0.2723;eas_maf=a:0.5187;eur_maf=a:0.4643;sas_maf=a:0.3767;aa_maf=a:0.5613 impact=modifier;strand=1;variant_class=insertion;biotype=misc_rna;intron=1/1;hgvsc=xr_241119.1:n.41+204dupa;hgvs_offset=8;gmaf=a:0.4225;amr_maf=a:0.2723;eas_maf=a:0.5187;eur_maf=a:0.4643;sas_maf=a:0.3767;aa_maf=a:0.5613 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001030588.1;intron=2/4;hgvsc=nm_001035511.1:c.77+43dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001030589.1;intron=2/4;hgvsc=nm_001035512.1:c.77+43dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001030590.1;intron=1/3;hgvsc=nm_001035513.1:c.20+9288dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001265101.1;intron=2/3;hgvsc=nm_001278172.1:c.77+43dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;canonical=yes;ensp=np_002992.1;intron=2/5;hgvsc=nm_003001.3:c.77+43dupt;hgvs_offset=1 

i have tab-separate output. if of input value missing template put there - mark. need same values in same column.

output example:

impact  distance    strand  flags   variant_class   symbol  symbol_source   hgnc_id biotype impact=modifier -   strand=1    -   variant_class=deletion  symbol=kif1b    -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=kif1b    -   -   biotype=protein_coding impact=modifier distance=4811   strand=-1   -   variant_class=deletion  -   symbol=c1orf127 -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=insertion -   -   -   biotype=misc_rna impact=modifier -   strand=1    -   variant_class=insertion     -   -   biotype=misc_rna impact=modifier -   strand=1    -   variant_class=insertion symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=insertion symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=insertion symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=insertion symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=insertion symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=insertion symbol=sdhc -   -   biotype=misc_rna impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=misc_rna impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_coding impact=modifier -   strand=1    -   variant_class=deletion  symbol=sdhc -   -   biotype=protein_codin 

note: in example output not have columns.

my try parse input file awk:

awk -v ofs="\t" '{split($1,arr1,";"); print arr1[1],arr1[2]..}' 

this parsing works great not give me order , not work on missing values. thank help.

note:

this more explained question how find , print specific character in bash

easily done in native bash -- no awk required.

#!/usr/bin/env bash #          ^^^ ^^^^ # uses bash path, on macos x, can use macports bash 4 # ...vs 3.x version installed apple in /bin.  # read template values array ifs=';' read -r -a template < template  # print header printf '%s\t' "${template[@]}"; printf '\n'  # declare associative array (requires bash 4) declare -a data  # iterate on lines of input file, reading each array while ifs=';' read -r -a items;    # populate data map key/value items line   data=( )   item in "${items[@]}";     key=${item%%=*}     value=${item#*=}     data[$key]=$value   done    # iterate on template items, emitting field each   item in "${template[@]}";     if [[ ${data[$item]} ]];       printf -- '%s=%s\t' "$item" "${data[$item]}"     else       printf -- '-\t'     fi   done    # ...and emit newline after processing each input line   printf '%s\n'  done <input 

Comments

Popular posts from this blog

java - Jasper subreport showing only one entry from the JSON data source when embedded in the Title band -

serialization - Convert Any type in scala to Array[Byte] and back -

SonarQube Plugin for Jenkins does not find SonarQube Scanner executable -