Creating Phylogenetic trees
Here are the steps that we used to create the Phylogenetic tree for ortholog genes descibed in the paper.
1.) We have to extract the protein sequences from Phytozome genome portal.
2.) We made BLAST indices using the following command for Eucalyptus grandis v2.0, Malus domestica v1.0, Salix purpurea v1.0, Theobroma cacao v2.1, Citrus sinensis v1.1, Prunus persica v1.0 and Betula pendula v1.0 sequences.
####
# CREATING BLAST INDICES
####
formatdb -p T -i Betula_pendula.pep.fix.fasta -n betula -o T
formatdb -p T -i Egrandis_297_v2.0.protein.fa -n eugra -o T
formatdb -p T -i Csinensis_154_v1.1.protein.fa -n citrus -o T
formatdb -p T -i Tcacao_233_v1.1.protein.fa -n cacao -o T
formatdb -p T -i Spurpurea_289_v1.0.protein.fa -n sarracenia -o T
formatdb -p T -i Mdomestica_196_v1.0.protein.fa -n malus -o T
formatdb -p T -i Ppersica_139_peptide.fa -n persica -o T
# Merege all protein sequences and generate BLAST indices
cat Betula_pendula.pep.fix.fasta Egrandis_297_v2.0.protein.fa Csinensis_154_v1.1.protein.fa Tcacao_233_v1.1.protein.fa Spurpurea_289_v1.0.protein.fa Mdomestica_196_v1.0.protein.fa Ppersica_139_peptide.fa > all_$
formatdb -p T -i all_protein.fa -n all -o T3.) BLASTp 146 protein sequences against previously generated indices using the following command.
# (2) Betula pendula v1.0
blastp -query query.fasta \
-db betula \
-evalue 1e-5 -num_threads 8 -out potri_betula.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"
# (2) Theobroma cacao v2.1
blastp -query query.fasta \
-db cacao \
-evalue 1e-5 -num_threads 8 -out potri_cacao.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"
# (3) Citrus sinensis v1.1
blastp -query query.fasta \
-db citrus \
-evalue 1e-5 -num_threads 8 -out potri_citrus.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"
# (4) Eucalyptus grandis v2.0
blastp -query query.fasta \
-db eugra \
-evalue 1e-5 -num_threads 8 -out potri_eugra.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"
# (5) Salix purpurea v1.0
blastp -query query.fasta \
-db sarracenia \
-evalue 1e-5 -num_threads 8 -out potri_sarracenia.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"
# (6) Malus domestica v1.0
blastp -query query.fasta \
-db malus \
-evalue 1e-5 -num_threads 8 -out potri_malus.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"
# (7) Prunus persica v1.0
blastp -query query.fasta \
-db persica \
-evalue 1e-5 -num_threads 8 -out potri_persica.outfmt6 \
-outfmt "6 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore"
4.) Filtering the Best BLAST hits by sorting bit score ascending order and e-value descending order using the following command.
# Filtering best BLAST
sort -k1,1 -k12,12nr -k11,11g potri_betula.outfmt6 | sort -u -k1,1 > potri_betula.besthit
sort -k1,1 -k12,12nr -k11,11g potri_cacao.outfmt6 | sort -u -k1,1 > potri_cacao.besthit
sort -k1,1 -k12,12nr -k11,11g potri_citrus.outfmt6 | sort -u -k1,1 > potri_citrus.besthit
sort -k1,1 -k12,12nr -k11,11g potri_eugra.outfmt6 | sort -u -k1,1 > potri_eugra.besthit
sort -k1,1 -k12,12nr -k11,11g potri_sarracenia.outfmt6 | sort -u -k1,1 > potri_sarracenia.besthit
sort -k1,1 -k12,12nr -k11,11g potri_malus.outfmt6 | sort -u -k1,1 > potri_malus.besthit
sort -k1,1 -k12,12nr -k11,11g potri_persica.outfmt6 | sort -u -k1,1 > potri_persica.besthit
5.) Extract the Best BLAST protein sequences using fastcmd.
# Merge all beshits into one file
cat *.besthit | awk '{print $2}' > all_besthit.genes
# Extract best blast sequences
fastacmd -d all -i all_besthit.genes -l 10000000000000 -o out.fasta
# Remove lcl| from sequence header
sed -i 's/lcl|//g' out.fasta
sed -i 's/*//g out.fasta
6.) Then we added 146 Populus trichocarpa sequences from query.fasta file and 87 Arabidopsis thaliana sequences from PlantGenIE Best BLAST results.
cat out.fasta query.fasta arabidopsis_best_hits.fasta > final_input.fasta7.) Now we used final_input.fasta as input to
MAFFT Multiple Sequence Alignment Software.
MAFFT flavour: auto
Gap extension penalty: 0.123
Gap opening penalty: 1.53
Direction of nucleotide sequences : do not adjust direction
Matrix selection : No matrix
Reorder output: No
Output format: FASTA8.) Then we used the above output as input to BMGE to find phylogenetic informative regions with the following parameters.
Sliding window size: 3
Maximum entropy threshold: 0.5
Gap rate cutoff: 0.5
Minimum block size: 5
If sequence type is DNA:
Matrix: PAM250
If sequence type is Protein or Codon:
Matrix: BLOSUM629.) Then we estimate the Approximately Maximum-Likelihood Phylogenies using FastTree.
Model: if nucleotide sequence: gtr, if amino-acid sequence: lg
Use Gamma distribution: Yes
Join options: neighbor-joining (default)10.) Final results is available on our FTP. Tree can be visulized in different newick Tree viewers such as ITOL or TreeView.