Skip to content

Instantly share code, notes, and snippets.

@martijnvermaat
Last active April 28, 2016 13:55
Show Gist options
  • Select an option

  • Save martijnvermaat/6465d41b0cc534c3fecf7ed946477886 to your computer and use it in GitHub Desktop.

Select an option

Save martijnvermaat/6465d41b0cc534c3fecf7ed946477886 to your computer and use it in GitHub Desktop.

Revisions

  1. martijnvermaat created this gist Apr 28, 2016.
    35 changes: 35 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,35 @@
    # Genomic GenBank files and transcript identifiers

    Using the attached script we check all genomic GenBank files in the server
    cache for the `transcript_id` field in their `mRNA` features.

    ## Using the cache from `mutalyzer.nl`

    Category | Files checked | Transcripts | With ID | Without ID | %
    -------- | ------------- | ----------- | ------- | ---------- | ------
    `NG_` | 3076 | 5989 | 5978 | 11 | 0.2 %
    `NC_` | 8 | 13649 | 13649 | 0 | 0.0 %
    `UD_` | 52760 | 306619 | 306617 | 2 | 0.0 %
    *Other* | 280 | 2215 | 2116 | 99 | 4.5 %

    ## Using the cache from `test.mutalyzer.nl`

    Category | Files checked | Transcripts | With ID | Without ID | %
    -------- | ------------- | ----------- | ------- | ---------- | ------
    `NG_` | 961 | 1876 | 1865 | 11 | 0.6 %
    `NC_` | 10 | 13676 | 13649 | 27 | 0.2 %
    `UD_` | 1803 | 5875 | 5869 | 6 | 0.1 %
    *Other* | 860 | 6385 | 5542 | 843 | 13.2 %

    ## Notes

    The *Other* category is defined as all GenBank files not starting with:
    - `NG_`
    - `NC_`
    - `UD_`
    - `NM_`
    - `XM_`
    - `NR_`
    - `XR_`
    - `NP_`
    - `XP_`
    42 changes: 42 additions & 0 deletions check.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,42 @@
    #!/usr/bin/env python
    #
    # Check annotated transcripts in genomic GenBank files for the 'transcript_id
    # field.
    #
    # For every annotated transcript, this reports:
    # 1. Gene symbol (or '<unknown gene>').
    # 2. 'pseudo' if this is annotated as a pseudo gene, 'normal' otherwise.
    # 3. 'good' if the 'transcript_id' field is present, 'missing' otherwise.
    #
    # Example usage:
    #
    # $ ./check.py NG_1245.gb
    # MAST3 normal good
    # MAST3 normal good
    # MAST3 normal good
    # BLAAT3 pseudo missing
    # ABC5 normal missing
    # PIK3R2 normal good


    import sys
    from Bio import SeqIO


    record = SeqIO.read(sys.argv[1], 'genbank')
    for feature in record.features:
    if feature.type == 'source':
    if feature.qualifiers.get('organelle') == 'mitochondrion':
    break
    if any(t in ('mRNA', 'transcribed RNA') for t in feature.qualifiers.get('mol_type', [])):
    break
    if feature.type == 'mRNA':
    print ','.join(feature.qualifiers.get('gene', ['<unknown gene>'])),
    if 'pseudo' in feature.qualifiers:
    print 'pseudo',
    else:
    print 'normal',
    if 'transcript_id' in feature.qualifiers:
    print 'good'
    else:
    print 'missing'