maccman/dna_analysis_guide.md

Created April 5, 2026 14:33

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/maccman/cfae0235601df63872f4b6c9baa0a899.js"></script>
Save maccman/cfae0235601df63872f4b6c9baa0a899 to your computer and use it in GitHub Desktop.

Download ZIP

The Hacker's Guide to DNA Sequencing and Offline Analysis with Cursor

Raw

dna_analysis_guide.md

The Hacker's Guide to DNA Sequencing and Offline Analysis with Cursor

Most consumer DNA tests (like 23andMe or Ancestry) only look at a tiny fraction of your genome (less than 1%) using microarray genotyping. If you want to truly understand your genetic code, calculate massive polygenic risk scores, and run offline analyses without trusting a third party, you need Whole Genome Sequencing (WGS) and an AI coding assistant like Cursor.

Here is the step-by-step guide on how to get your full 5GB DNA sequence, download it to your local machine, and use Cursor to uncover your true health risks.

Step 1: Sequence Your DNA (Whole Genome Sequencing)

To do this, you need a service that provides 30x Whole Genome Sequencing (WGS) and allows you to download your raw VCF (Variant Call Format) file.

Recommended Provider: Nucleus Genomics

Why: They provide clinical-grade 30x WGS (reading 100% of your 3 billion base pairs). Crucially, they allow you to download your raw .vcf.gz file directly from their web portal.
Alternative: Nebula Genomics or Dante Labs (though turnaround times can be very slow).

Action:

Order the kit, spit in the tube, and wait ~6-8 weeks.
Once your results are ready, log into the portal and navigate to the "Files" or "Download" section.
Download the VCF File (it will likely be a highly compressed .vcf.gz file around 400MB - 1GB in size. Uncompressed, it's roughly 5GB).
Move this file into a local folder on your computer (e.g., ~/dna-analysis/data/my_dna.vcf.gz).

Step 2: Open Cursor and Start Analyzing

Open the folder containing your VCF file in Cursor.

VCF files are massive text files that list every single mutation (variant) you have compared to the standard human reference genome. They are too big to open in a normal text editor, but they are incredibly easy to parse using Python scripts. This is where Cursor shines.

Analysis Part 1: Monogenic (Single Gene) Traits

Some genes have a massive impact on your health based on just one or two mutations (like APOE4 for Alzheimer's or BRCA1 for breast cancer).

Prompt to paste into Cursor (Cmd+L / Ctrl+L):

"I have a highly compressed VCF file at data/my_dna.vcf.gz. I want to look up my specific genotype for famous health and longevity SNPs.

Write a Python script that:

Queries the Ensembl REST API to find the exact GRCh38 chromosomal coordinates for these specific rsIDs: rs429358 (APOE4), rs1801133 (MTHFR), rs1815739 (ACTN3 sprint vs endurance), and rs4680 (COMT warrior vs worrier).

Efficiently streams through my gzipped VCF file to find those exact coordinates.

Parses my Genotype (GT) from the VCF row to tell me my exact alleles (e.g., A/G, T/T).

Outputs a highly readable terminal report with the results."

Cursor will write a Python script (usually using requests and gzip), run it, and tell you exactly what your alleles are for those famous traits in seconds.

Analysis Part 2: Polygenic Risk Scores (PRS)

Most diseases (like heart disease, diabetes, or depression) aren't caused by one gene; they are caused by millions of tiny mutations acting together. This is called a Polygenic Risk Score (PRS).

The open-source PGS Catalog contains thousands of peer-reviewed algorithms (scoring files) developed by institutions like Harvard and the Broad Institute.

You can use Cursor to download these millions of weights and apply them to your DNA locally.

Prompt to paste into Cursor:

"I want to calculate my Polygenic Risk Score for Coronary Artery Disease entirely offline.

Write a Python script that does the following:

Fetches metadata for PGS ID PGS000013 (The Khera et al. 2018 CAD model) from the PGS Catalog API.

Downloads the GRCh38 harmonized scoring file (.txt.gz) for that model.

Loads all 6.6 million variant weights from that scoring file into memory (mapping chromosome and position to the effect allele and weight).

Streams my data/my_dna.vcf.gz file. For every variant in my DNA that matches a variant in the scoring file, check if my genotype contains the effect allele.

Multiply my dosage (0, 1, or 2 effect alleles) by the weight, and sum it all up to calculate my raw Polygenic Risk Score.

Make it highly optimized so it can scan my 5GB VCF in under 30 seconds."

Cursor will build the pipeline. Once you have the raw score, you can ask Cursor to help you interpret it!

"My raw score for PGS000013 was 5.74. Can you look up the population average and standard deviation for this model and tell me what percentile of risk I fall into?"

Why do this locally?

Absolute Privacy: Your DNA is the most sensitive data you own. By running these Python scripts locally on your gzipped VCF file, your actual genetic sequence never leaves your laptop.
Infinite Upgrades: When a new paper is published in 2028 with a better algorithm for predicting longevity, you don't have to wait for a company to update their dashboard. You just search the PGS Catalog, grab the new ID, and run your script again.
No Black Boxes: You see exactly how the math works, which alleles are contributing to your risk, and what the raw data actually says.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment