martijnvermaat · April 28, 2016 13:55 · Apr 28, 2016
diff --git a/README.md b/README.md
@@ -0,0 +1,35 @@
+# Genomic GenBank files and transcript identifiers
+
+Using the attached script we check all genomic GenBank files in the server
+cache for the `transcript_id` field in their `mRNA` features.
+
+## Using the cache from `mutalyzer.nl`
+
+Category | Files checked | Transcripts | With ID | Without ID | %
+-------- | ------------- | ----------- | ------- | ---------- | ------
+`NG_`    | 3076          | 5989        | 5978    | 11         | 0.2 %
+`NC_`    | 8             | 13649       | 13649   | 0          | 0.0 %
+`UD_`    | 52760         | 306619      | 306617  | 2          | 0.0 %
+*Other*  | 280           | 2215        | 2116    | 99         | 4.5 %
+
+## Using the cache from `test.mutalyzer.nl`
+
+Category | Files checked | Transcripts | With ID | Without ID | %
+-------- | ------------- | ----------- | ------- | ---------- | ------
+`NG_`    | 961           | 1876        | 1865    | 11         | 0.6 %
+`NC_`    | 10            | 13676       | 13649   | 27         | 0.2 %
+`UD_`    | 1803          | 5875        | 5869    | 6          | 0.1 %
+*Other*  | 860           | 6385        | 5542    | 843        | 13.2 %
+
+## Notes
+
+The *Other* category is defined as all GenBank files not starting with:
+- `NG_`
+- `NC_`
+- `UD_`
+- `NM_`
+- `XM_`
+- `NR_`
+- `XR_`
+- `NP_`
+- `XP_`
diff --git a/check.py b/check.py
@@ -0,0 +1,42 @@
+#!/usr/bin/env python
+#
+# Check annotated transcripts in genomic GenBank files for the 'transcript_id
+# field.
+#
+# For every annotated transcript, this reports:
+# 1. Gene symbol (or '<unknown gene>').
+# 2. 'pseudo' if this is annotated as a pseudo gene, 'normal' otherwise.
+# 3. 'good' if the 'transcript_id' field is present, 'missing' otherwise.
+#
+# Example usage:
+#
+#   $ ./check.py NG_1245.gb
+#   MAST3 normal good
+#   MAST3 normal good
+#   MAST3 normal good
+#   BLAAT3 pseudo missing
+#   ABC5 normal missing
+#   PIK3R2 normal good
+
+
+import sys
+from Bio import SeqIO
+
+
+record = SeqIO.read(sys.argv[1], 'genbank')
+for feature in record.features:
+    if feature.type == 'source':
+        if feature.qualifiers.get('organelle') == 'mitochondrion':
+            break
+        if any(t in ('mRNA', 'transcribed RNA') for t in feature.qualifiers.get('mol_type', [])):
+            break
+    if feature.type == 'mRNA':
+        print ','.join(feature.qualifiers.get('gene', ['<unknown gene>'])),
+        if 'pseudo' in feature.qualifiers:
+            print 'pseudo',
+        else:
+            print 'normal',
+        if 'transcript_id' in feature.qualifiers:
+            print 'good'
+        else:
+            print 'missing'
No results found