Skip to content

Instantly share code, notes, and snippets.

@hussius
Created November 17, 2016 09:39
Show Gist options
  • Select an option

  • Save hussius/5bd8f044eb4fda04bd83b5c3dbff5020 to your computer and use it in GitHub Desktop.

Select an option

Save hussius/5bd8f044eb4fda04bd83b5c3dbff5020 to your computer and use it in GitHub Desktop.
Sum transcript TPMs by gene using the FASTA file used for a Kallisto index
import sys
import gzip
if len(sys.argv)<3:
sys.exit("python sum_per_gene.py <cDNA FASTA file> <TPM table>")
ensg = {}
mapf = gzip.open(sys.argv[1])
ctr = 0
for line in mapf:
if not line.startswith('>'): continue
ctr += 1
if ctr % 1000 == 0: sys.stderr.write(ctr)
c = line.split(' ')
tx = c[0][1:]
gene = c[3].split(':')[1].split('.')[0]
ensg[tx] = gene
gene_tpm = {}
tpm = open(sys.argv[2])
print(tpm.readline().strip()) # Header
for line in tpm:
c = line.split()
tx = c[0]
vals = []
for val in c[1:len(c)]:
vals.append(float(val))
gene = ensg[tx]
if gene in gene_tpm:
prev = gene_tpm[gene]
curr = vals
new = [prev[i]+curr[i] for i in range(0,len(prev))]
gene_tpm[gene] = new
else:
gene_tpm[gene] = vals
for g in gene_tpm:
string_rep = [x for x in map(str, gene_tpm[g])]
print(g + '\t' + '\t'.join(string_rep))
@hussius
Copy link
Copy Markdown
Author

hussius commented Nov 17, 2016

Intended to collapse transcript TPMs by gene name based on the original FASTA file that Kallisto used for quantification

python sum_per_gene.py gzipped FASTA file TPM table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment