Bioinformatics One-liners

Bioinformatics One-liners


Stephen Turner maintains a very useful list of one-liner commands at github.

Basic awk & sed

Extract fields 2, 4, and 5 from file.txt:

awk ‘{print $2,$4,$5}’ input.txt

Print each line where the 5th field is equal to abc123:

awk ‘$5 == “abc123”’ file.txt

Print each line where the 5th field is not equal to abc123:

awk ‘$5 != “abc123”’ file.txt

Print each line whose 7th field matches the regular expression:

awk ‘$7 ~ /^[a-f]/’ file.txt

Print each line whose 7th field does not match the regular expression:

awk ‘$7 !~ /^[a-f]/’ file.txt

Get unique entries in file.txt based on column 1 (takes only the first instance):

awk ‘!arr[$2]++’ file.txt

Print rows where column 3 is larger than column 5 in file.txt:

awk ‘$3>$5’ file.txt

Sum column 1 of file.txt:

awk ‘{sum+=$1} END {print sum}’ file.txt

Compute the mean of column 2:

awk ‘{x+=$2}END{print x/NR}’ file.txt

Number each line in file.txt:

sed = file.txt sed ‘N;s/\n/ /’

Replace all occurances of foo with bar in file.txt:

sed ‘s/foo/bar/g’ file.txt

Trim leading whitespace in file.txt:

sed ‘s/^[ \t]*//’ file.txt

Trim trailing whitespace in file.txt:

sed ‘s/[ \t]*$//’ file.txt

Trim leading and trailing whitespace in file.txt:

sed ‘s/^[ \t]//;s/[ \t]$//’ file.txt

Delete blank lines in file.txt:

sed ‘/^$/d’ file.txt

awk & sed for bioinformatics

Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):

cat file.txt awk ‘$1==”1”’ awk ‘$3>=1000000’ awk ‘$3<=2000000’

Basic sequence statistics. Print total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, its frequency, and percentage of total in file.fq:

cat myfile.fq | awk ‘((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,uniqu e,unique100/total,maxRead,count[maxRead],count[maxRead]100/total}’

Convert .bam back to .fastq:

samtools view file.bam | awk ‘BEGIN {FS=”\t”} {print “@” $1 “\n” $10 “\n+\n” $11}’ > file.fq

Keep only top bit scores in blast hits (best bit score only):

awk ‘{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }’ blastout.txt

Keep only top bit scores in blast hits (5 less than the top):

awk ‘{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }’ blastout.txt

Split a multi-FASTA file into individual FASTA files:

awk ‘/^>/{s=++d”.fa”} {print > s}’ multi.fa

Convert a FASTQ file to FASTA:

sed -n ‘1~4s/^@/>/p;2~4p’ file.fq > file.fa

Extract every 4th line starting at the second line (extract the sequence from FASTQ file):

sed -n ‘2~4p’ file.fq

sort, uniq, cut, etc.

You can read the rest here at github. Also, Stephen Turner wrote a blog post providing nice introduction to the list.



Written by M. //