In RNAseq analysis, we often need to add the expression estimates for various isoforms of a gene into a single number. For example, the Kallisto or Salmon measure expressions for all isoforms as separate numbers. Those numbers need to be aggregated for subsequent analysis steps for differentially expressed genes. This task is rather trivial for those using PERL or Python. If you want to do the entire analysis in R, the following code may help.
We first create a data frame with mock data for six isoforms of three genes and their corresponding expression levels.
genes=c( "comp18_c0_seq1", "comp18_c0_seq2", "comp18_c0_seq3", "comp19_c0_seq1", "comp19_c0_seq2", "comp19_c1_seq1") expr=c(1,2,3,4,5,6) d=data.frame(genes,expr)
The command “mutate(g=str_replace(genes,”_seq.*”,””))” adds as extra column based on the gene names, but without the isoform identifier “seq”.
d %>% mutate(g=str_replace(genes,"_seq.*",""))
The following R code adds up the expression levels of all genes based on their isoforms, and prints unique results.
d %>% mutate(g=str_replace(genes,"_seq.*","")) %>% group_by(g) %>% mutate(v=sum(expr)) %>% select(g,v) %>% unique