Last active
April 28, 2016 13:55
-
-
Save martijnvermaat/6465d41b0cc534c3fecf7ed946477886 to your computer and use it in GitHub Desktop.
Revisions
-
martijnvermaat created this gist
Apr 28, 2016 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,35 @@ # Genomic GenBank files and transcript identifiers Using the attached script we check all genomic GenBank files in the server cache for the `transcript_id` field in their `mRNA` features. ## Using the cache from `mutalyzer.nl` Category | Files checked | Transcripts | With ID | Without ID | % -------- | ------------- | ----------- | ------- | ---------- | ------ `NG_` | 3076 | 5989 | 5978 | 11 | 0.2 % `NC_` | 8 | 13649 | 13649 | 0 | 0.0 % `UD_` | 52760 | 306619 | 306617 | 2 | 0.0 % *Other* | 280 | 2215 | 2116 | 99 | 4.5 % ## Using the cache from `test.mutalyzer.nl` Category | Files checked | Transcripts | With ID | Without ID | % -------- | ------------- | ----------- | ------- | ---------- | ------ `NG_` | 961 | 1876 | 1865 | 11 | 0.6 % `NC_` | 10 | 13676 | 13649 | 27 | 0.2 % `UD_` | 1803 | 5875 | 5869 | 6 | 0.1 % *Other* | 860 | 6385 | 5542 | 843 | 13.2 % ## Notes The *Other* category is defined as all GenBank files not starting with: - `NG_` - `NC_` - `UD_` - `NM_` - `XM_` - `NR_` - `XR_` - `NP_` - `XP_` This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,42 @@ #!/usr/bin/env python # # Check annotated transcripts in genomic GenBank files for the 'transcript_id # field. # # For every annotated transcript, this reports: # 1. Gene symbol (or '<unknown gene>'). # 2. 'pseudo' if this is annotated as a pseudo gene, 'normal' otherwise. # 3. 'good' if the 'transcript_id' field is present, 'missing' otherwise. # # Example usage: # # $ ./check.py NG_1245.gb # MAST3 normal good # MAST3 normal good # MAST3 normal good # BLAAT3 pseudo missing # ABC5 normal missing # PIK3R2 normal good import sys from Bio import SeqIO record = SeqIO.read(sys.argv[1], 'genbank') for feature in record.features: if feature.type == 'source': if feature.qualifiers.get('organelle') == 'mitochondrion': break if any(t in ('mRNA', 'transcribed RNA') for t in feature.qualifiers.get('mol_type', [])): break if feature.type == 'mRNA': print ','.join(feature.qualifiers.get('gene', ['<unknown gene>'])), if 'pseudo' in feature.qualifiers: print 'pseudo', else: print 'normal', if 'transcript_id' in feature.qualifiers: print 'good' else: print 'missing'