Skip to content

Instantly share code, notes, and snippets.

@george-githinji
Forked from huddlej/LICENSE.txt
Created November 23, 2021 03:18
Show Gist options
  • Select an option

  • Save george-githinji/1547c2016636e033bb23978551b051d5 to your computer and use it in GitHub Desktop.

Select an option

Save george-githinji/1547c2016636e033bb23978551b051d5 to your computer and use it in GitHub Desktop.
Command line tool to convert annotated phylogenetic trees nextstrain.org's JSON format to a tidy data frame of tree attributes

Convert Auspice tree JSON to a data frame

An example script to convert an Auspice tree JSON to a data frame for processing by downstream analyses.

Setup

Install Nextstrain.

Usage

Convert a tree JSON from the ncov workflow into a table using all attributes annotated on the root node of the tree by default. This command will only emit attributes for tips of the tree.

python3 auspice_tree_to_table.py \
  auspice/ncov_gisaid_europe.json \
  ncov_gisaid_europe.tsv

Alternately, convert the JSON to a table requesting values for internal nodes and only for specific attributes of each node.

python3 auspice_tree_to_table.py \
  auspice/ncov_gisaid_europe.json \
  ncov_gisaid_europe.tsv \
  --attributes num_date S1_mutations \
  --include-internal-nodes  

The output of the above command looks like this:

name	num_date	S1_mutations
NODE_0000000	2019.97	0.00
hCoV-19/Wuhan/Hu-1/2019	2019.98	0.00
NODE_0000023	2019.97	0.00
hCoV-19/Wuhan/WH01/2019	2019.98	0.00
hCoV-19/mink/Netherlands/NB02_06KS/2020	2020.33	2.00
NODE_0000002	2020.05	0.00
hCoV-19/Hangzhou/HZ-1/2020	2020.05	0.00
NODE_0000003	2020.17	0.00
hCoV-19/USA/CA-CZB-1092/2020	2020.33	0.00
hCoV-19/France/10060KV/2020	2020.17	0.00
NODE_0000005	2020.00	0.00
hCoV-19/Greece/226_35576/2020	2020.21	0.00
NODE_0000006	2020.16	0.00
hCoV-19/Spain/Valencia22/2020	2020.19	0.00
hCoV-19/Spain/Valencia002/2020	2020.17	0.00
hCoV-19/Spain/Valencia003/2020	2020.18	0.00
hCoV-19/Spain/Valencia8/2020	2020.17	0.00
NODE_0000010	2020.05	1.00
NODE_0000007	2020.08	1.00
hCoV-19/Germany/BavPat3-ChVir1020/2020	2020.08	1.00
hCoV-19/Germany/BavPat1-ChVir929/2020	2020.08	1.00
NODE_0000012	2020.10	1.00
hCoV-19/mink/Netherlands/NB01_01KS/2020	2020.33	1.00
NODE_0000019	2020.17	1.00
hCoV-19/France/10006HC/2020	2020.17	1.00
hCoV-19/France/10078MA/2020	2020.17	1.00
NODE_0000020	2020.15	1.00
hCoV-19/USA/FL_5125/2020	2020.16	1.00
NODE_0000011	2020.17	1.00
hCoV-19/Greece/227_35969/2020	2020.22	1.00
NODE_0000013	2020.18	1.00
hCoV-19/USA/NY-PV09116/2020	2020.21	1.00
hCoV-19/Greece/41_36861/2020	2020.24	1.00
NODE_0000014	2020.12	1.00
NODE_0000015	2020.16	1.00
hCoV-19/Greece/246_32206/2020	2020.19	1.00
hCoV-19/Greece/248_32261/2020	2020.19	1.00
NODE_0000016	2020.14	1.00
hCoV-19/France/10023FD/2020	2020.17	2.00
NODE_0000017	2020.17	1.00
hCoV-19/USA/WI-UW-268/2020	2020.26	1.00
hCoV-19/France/40003KA/2020	2020.17	1.00
import argparse
from augur.utils import json_to_tree
import json
import pandas as pd
import sys
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("tree", help="auspice tree JSON")
parser.add_argument("output", help="tab-delimited file of attributes per node of the given tree")
parser.add_argument("--include-internal-nodes", action="store_true", help="include data from internal nodes in output")
parser.add_argument("--attributes", nargs="+", help="names of attributes to export from the given tree")
args = parser.parse_args()
# Load tree from JSON.
with open(args.tree, "r", encoding="utf-8") as fh:
tree_json = json.load(fh)
tree = json_to_tree(tree_json)
# Collect attributes per node from the tree to export.
records = []
if args.attributes:
attributes = args.attributes
else:
attributes = sorted(
set(tree.root.node_attrs.keys()) |
set(tree.root.branch_attrs.keys())
)
for node in tree.find_clades():
if node.is_terminal() or args.include_internal_nodes:
record = {
"name": node.name
}
for attribute in attributes:
if attribute in node.node_attrs:
value = node.node_attrs[attribute]
elif attribute in node.branch_attrs:
value = node.branch_attrs[attribute]
else:
print(f"Could not find attribute '{attribute}' for node '{node.name}'.", file=sys.stderr)
value = None
if value is not None:
if isinstance(value, dict) and "value" in value:
value = value["value"]
record[attribute] = value
records.append(record)
# Convert records to a data frame and save as a tab-delimited file.
df = pd.DataFrame(records)
df.to_csv(args.output, sep="\t", header=True, index=False, columns=["name"] + list(attributes), float_format="%.2f")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment