Skip to content

Instantly share code, notes, and snippets.

@irusri
Last active October 26, 2020 16:11
Show Gist options
  • Select an option

  • Save irusri/1705eedd321838cf749b5ae3e5559bda to your computer and use it in GitHub Desktop.

Select an option

Save irusri/1705eedd321838cf749b5ae3e5559bda to your computer and use it in GitHub Desktop.

Creating Phylogenetic trees

Here are the steps that we used to create the Phylogenetic tree for ortholog genes descibed in the paper.

1.) We have to extract the protein sequences from Phytozome genome portal.

2.) We made BLAST indices using the following command for Eucalyptus grandis v2.0, Malus domestica v1.0, Salix purpurea v1.0, Theobroma cacao v2.1, Citrus sinensis v1.1, Prunus persica v1.0 and Betula pendula v1.0 sequences.

####
# CREATING BLAST INDICES
####
formatdb -p T -i Betula_pendula.pep.fix.fasta -n betula -o T
formatdb -p T -i Egrandis_297_v2.0.protein.fa -n eugra -o T
formatdb -p T -i Csinensis_154_v1.1.protein.fa -n citrus -o T
formatdb -p T -i Tcacao_233_v1.1.protein.fa -n cacao -o T
formatdb -p T -i Spurpurea_289_v1.0.protein.fa -n sarracenia -o T
formatdb -p T -i Mdomestica_196_v1.0.protein.fa -n malus -o T
formatdb -p T -i Ppersica_139_peptide.fa -n persica -o T

# Merege all protein sequences and generate BLAST indices
cat Betula_pendula.pep.fix.fasta Egrandis_297_v2.0.protein.fa Csinensis_154_v1.1.protein.fa Tcacao_233_v1.1.protein.fa Spurpurea_289_v1.0.protein.fa Mdomestica_196_v1.0.protein.fa Ppersica_139_peptide.fa > all_$
formatdb -p T -i all_protein.fa -n all -o T

3.) BLASTp 146 protein sequences against previously generated indices using the following command.

# (2) Betula pendula v1.0
blastp -query query.fasta \
-db betula \
-evalue 1e-5 -num_threads 8 -out potri_betula.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"

# (2) Theobroma cacao v2.1
blastp -query query.fasta \
-db cacao \
-evalue 1e-5 -num_threads 8 -out potri_cacao.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"

# (3) Citrus sinensis v1.1
blastp -query query.fasta \
-db citrus \
-evalue 1e-5 -num_threads 8 -out potri_citrus.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"

# (4) Eucalyptus grandis v2.0
blastp -query query.fasta \
-db eugra \
-evalue 1e-5 -num_threads 8 -out potri_eugra.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"

# (5) Salix purpurea v1.0
blastp -query query.fasta \
-db sarracenia \
-evalue 1e-5 -num_threads 8 -out potri_sarracenia.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"

# (6) Malus domestica v1.0
blastp -query query.fasta \
-db malus \
-evalue 1e-5 -num_threads 8 -out potri_malus.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"

# (7) Prunus persica v1.0
blastp -query query.fasta \
-db persica \
-evalue 1e-5 -num_threads 8 -out potri_persica.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"

4.) Filtering the Best BLAST hits by sorting bit score ascending order and e-value descending order using the following command.

# Filtering best BLAST
sort -k1,1 -k12,12nr -k11,11g  potri_betula.outfmt6  | sort -u -k1,1 > potri_betula.besthit
sort -k1,1 -k12,12nr -k11,11g  potri_cacao.outfmt6   | sort -u -k1,1 > potri_cacao.besthit
sort -k1,1 -k12,12nr -k11,11g  potri_citrus.outfmt6  | sort -u -k1,1 > potri_citrus.besthit
sort -k1,1 -k12,12nr -k11,11g  potri_eugra.outfmt6   | sort -u -k1,1 > potri_eugra.besthit
sort -k1,1 -k12,12nr -k11,11g  potri_sarracenia.outfmt6 | sort -u -k1,1 > potri_sarracenia.besthit
sort -k1,1 -k12,12nr -k11,11g  potri_malus.outfmt6   | sort -u -k1,1 > potri_malus.besthit
sort -k1,1 -k12,12nr -k11,11g  potri_persica.outfmt6 | sort -u -k1,1 > potri_persica.besthit

5.) Extract the Best BLAST protein sequences using fastcmd.

# Merge all beshits into one file
cat *.besthit | awk '{print $2}' > all_besthit.genes

# Extract best blast sequences
fastacmd -d all -i all_besthit.genes -l 10000000000000 -o out.fasta

# Remove lcl| from sequence header
sed -i 's/lcl|//g' out.fasta
sed -i 's/*//g out.fasta

6.) Then we added 146 Populus trichocarpa sequences from query.fasta file and 87 Arabidopsis thaliana sequences from PlantGenIE Best BLAST results.

cat out.fasta query.fasta arabidopsis_best_hits.fasta > final_input.fasta

7.) Now we used final_input.fasta as input to MAFFT Multiple Sequence Alignment Software.

MAFFT flavour: auto 
Gap extension penalty: 0.123
Gap opening penalty: 1.53
Direction of nucleotide sequences : do not adjust direction
Matrix selection : No matrix
Reorder output: No
Output format: FASTA

8.) Then we used the above output as input to BMGE to find phylogenetic informative regions with the following parameters.

Sliding window size: 3
Maximum entropy threshold: 0.5
Gap rate cutoff: 0.5
Minimum block size: 5
If sequence type is DNA:
Matrix: PAM250
If sequence type is Protein or Codon:
Matrix: BLOSUM62

9.) Then we estimate the Approximately Maximum-Likelihood Phylogenies using FastTree.

Model: if nucleotide sequence: gtr, if amino-acid sequence: lg
	Use Gamma distribution: Yes
	Join options: neighbor-joining (default)

10.) Final results is available on our FTP. Tree can be visulized in different newick Tree viewers such as ITOL or TreeView.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment