parsing - How to parse a file in bash according to another file -
this question has answer here:
i have parlous problem , i'll show try make solution on example bellow. prefer solve problem in awk. functional solution welcome.
i have template file, tells me sequence of columns (in order):
template:
impact;distance;strand;flags;variant_class;symbol;symbol_source;hgnc_id;biotype;canonical;tsl;appris;ccds;ensp;swissprot;trembl;uniparc;refseq_match;gene_pheno;sift;polyphen;exon;intron;domains;hgvsc;hgvsp;hgvs_offset;gmaf;afr_maf;amr_maf;eas_maf;eur_maf;sas_maf;aa_maf;ea_maf;exac_maf;exac_adj_maf;exac_afr_maf;exac_amr_maf;exac_eas_maf;exac_fin_maf;exac_nfe_maf;exac_oth_maf;exac_sas_maf;clin_sig;somatic;pheno;pubmed;motif_name;motif_pos;high_inf_pos;motif_score_change
all values semi-colon separated.
and input file, can not parse, because values template missing.
input:
impact=modifier;strand=1;variant_class=deletion;symbol=kif1b;biotype=protein_coding;ensp=np_055889.2;intron=24/46;hgvsc=nm_015074.3:c.2537+467delt;hgvs_offset=9 impact=modifier;strand=1;variant_class=deletion;symbol=kif1b;biotype=protein_coding;canonical=yes;ensp=xp_005263490.1;intron=26/48;hgvsc=xm_005263433.1:c.2675+467delt;hgvs_offset=9 impact=modifier;distance=4811;strand=-1;variant_class=deletion;symbol=c1orf127;biotype=protein_coding;canonical=yes;ensp=np_001164225.1;gmaf=-:0.1749;amr_maf=-:0.3011;eas_maf=-:0.1542;eur_maf=-:0.0794;sas_maf=-:0.2008;aa_maf=-:0.091 impact=modifier;strand=1;variant_class=insertion;biotype=misc_rna;canonical=yes;intron=1/1;hgvsc=xr_158744.2:n.96+764dupa;hgvs_offset=8;gmaf=a:0.4225;amr_maf=a:0.2723;eas_maf=a:0.5187;eur_maf=a:0.4643;sas_maf=a:0.3767;aa_maf=a:0.5613 impact=modifier;strand=1;variant_class=insertion;biotype=misc_rna;intron=1/1;hgvsc=xr_241119.1:n.41+204dupa;hgvs_offset=8;gmaf=a:0.4225;amr_maf=a:0.2723;eas_maf=a:0.5187;eur_maf=a:0.4643;sas_maf=a:0.3767;aa_maf=a:0.5613 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001030588.1;intron=2/4;hgvsc=nm_001035511.1:c.77+43dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001030589.1;intron=2/4;hgvsc=nm_001035512.1:c.77+43dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001030590.1;intron=1/3;hgvsc=nm_001035513.1:c.20+9288dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;ensp=np_001265101.1;intron=2/3;hgvsc=nm_001278172.1:c.77+43dupt;hgvs_offset=11 impact=modifier;strand=1;variant_class=insertion;symbol=sdhc;biotype=protein_coding;canonical=yes;ensp=np_002992.1;intron=2/5;hgvsc=nm_003001.3:c.77+43dupt;hgvs_offset=1
i have tab-separate output. if of input value missing template put there -
mark. need same values in same column.
output example:
impact distance strand flags variant_class symbol symbol_source hgnc_id biotype impact=modifier - strand=1 - variant_class=deletion symbol=kif1b - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=kif1b - - biotype=protein_coding impact=modifier distance=4811 strand=-1 - variant_class=deletion - symbol=c1orf127 - biotype=protein_coding impact=modifier - strand=1 - variant_class=insertion - - - biotype=misc_rna impact=modifier - strand=1 - variant_class=insertion - - biotype=misc_rna impact=modifier - strand=1 - variant_class=insertion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=insertion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=insertion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=insertion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=insertion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=insertion symbol=sdhc - - biotype=misc_rna impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=misc_rna impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_coding impact=modifier - strand=1 - variant_class=deletion symbol=sdhc - - biotype=protein_codin
note: in example output not have columns.
my try parse input file awk:
awk -v ofs="\t" '{split($1,arr1,";"); print arr1[1],arr1[2]..}'
this parsing works great not give me order , not work on missing values. thank help.
note:
this more explained question how find , print specific character in bash
easily done in native bash -- no awk
required.
#!/usr/bin/env bash # ^^^ ^^^^ # uses bash path, on macos x, can use macports bash 4 # ...vs 3.x version installed apple in /bin. # read template values array ifs=';' read -r -a template < template # print header printf '%s\t' "${template[@]}"; printf '\n' # declare associative array (requires bash 4) declare -a data # iterate on lines of input file, reading each array while ifs=';' read -r -a items; # populate data map key/value items line data=( ) item in "${items[@]}"; key=${item%%=*} value=${item#*=} data[$key]=$value done # iterate on template items, emitting field each item in "${template[@]}"; if [[ ${data[$item]} ]]; printf -- '%s=%s\t' "$item" "${data[$item]}" else printf -- '-\t' fi done # ...and emit newline after processing each input line printf '%s\n' done <input
Comments
Post a Comment