Introduction
This page offers detailed introduction of HybSuite. Feel free to explore!
𧬠Pipeline overview
HybSuite performs end-to-end hybrid capture (Hyb-Seq) phylogenomic analysis from raw reads (Hyb-Seq preferred; compatible with RNA-seq, WGS, and genome skimming data) to phylogenetic trees.
The full pipeline is composed of 4 stages:

Stage 1: NGS dataset construction
- (1) Optionally download public raw reads from NCBI (via SRA Toolkit );
- (2) Optionally integrate user-provided raw reads (if provided);
- (3) Raw reads trimming (via Trimmomatic);
Stage 2: Data assembly and paralog retrieval
- (1) Target loci assembly and putative paralogs retrieval (via HybPiper)
- (2) Integrate pre-assembled sequences (if provided);
- (3) Filter putative paralogs;
- (4) Plot recovery heatmap and paralog heatmap of original and filtered sequences;
Stage 3: Paralog handling
- Optionally execute seven paralogs-handling methods (HRS, RLWP, LS, MO, MI, RT, 1to1; see our Tutorial and generate filtered alignments for downstream analysis:
- HRS:
(1) Retrieve seqeunces via command hybpiper retrieve_sequences in HybPiper;
(2) Integrate pre-assembled sequences (if provided);
(3) Filter sequences by length to remove potential mis-assembled seqeunces;
(4) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner);
(5) Filter trimmed alignments to generate final alignments. - RLWP:
(1) Retrieve seqeunces via hybpiper retrieve_sequences via HybPiper;
(2) Integrate pre-assembled sequences (if provided);
(3) Filter sequences by length to remove potential mis-assembled seqeunces;
(4) Remove loci with putative paralogs masked in more than samples;
(5) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner);
(6) Filter trimmed alignments to generate final alignments. - PhyloPypruner pipeline (LS, MI, MO, RT, 1to1):
(1) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner) for all putative paralogs;
(2) Gene trees inference of all putative paralogs;
(3) Obtain orthogroup alignments using tree-based orthology inference algorithms (via PhyloPypruner);
(4) Realign (via MAFFT) and trim (via trimAl or HMMCleaner) the orthogroup alignments;
(5) Filter trimmed orthogroup alignments to generate final alignments. - ParaGone pipeline (MI, MO, RT, 1to1):
(1) Use the directory cantaining all putative paralogs generated in stage 2 as input;
(2) Obtain orthogroup alignments using tree-based orthology inference algorithms via ParaGone;
(3) Filter trimmed orthogroup alignments to generate final alignments.
Stage 4: Species tree inference
- Multiple species tree inference methods available:
β¨ Features
π Transparent: Full workflow visibility with real-time progress logging at each step
π Reproducible: Automatically archives exact software commands & parameters for every run
π§© Modular: Execute individual stages or complete pipeline in one command
β‘ Flexible: 7 paralog handling methods & 5+ species tree inference options
π Scalable: Built-in parallelization for large-scale phylogenomic datasets
π Advantages
1. End-to-end pipeline from reads to trees
- Processes data from raw reads to phylogenetic trees with single-command workflows
- Supports both full pipeline execution and modular stage-specific operations
- Minimizes manual intervention while maintaining flexibility
2. Unique functionality of integrating pre-assembled sequences
- Allows for integrating pre-assembled loci sequences into the working dataset. (click here to grasp skills)
3. Customizable sequences filtering strategies
- Dual filtering strategies for both loci and samples
- Configurable thresholds for read depth, missing data, and sequence quality
- Enables dataset optimization for different study goals
4. Advanced paralog-handling methods
- Implements 7 distinct methods for paralog detection and processing
- Includes both similarity-based and topology-based approaches
- Improves orthology assessment accuracy
5. Multi-method Phylogenetic tree inference
plot_paralog_heatmap.py (click here to grasp skills);plot_recovery_heatmap.py (click here to grasp skills)modified_phypartspiecharts.py (click here to grasp skills)
- Parallel processing across samples and loci (option
-process), which can significantly improve computational efficiency.
1 - Changelog
1.1.7 January, 2026
- New function: Steps control in stage 4
- Added support for controlling individual steps within Stage 4, allowing users to selectively run specific steps (e.g., gene tree inference, alignment trimming, species tree inference) rather than executing the entire stage in one go. here for details.
1.1.6 September, 2025
New dependency: Plotly
Plotly has been integrated into the new script plot_recovery_heatmap_v2.py to generate an interactive HTML heatmap visualizing target locus recovery in Stage 2. This heatmap provides useful guidance for parameter selection in Stages 3β4.
TreeShrink integrated into Stage 3
TreeShrink has been incorporated into Stage 3 as an optional processing step. Users can enable it by setting the option -run_treeshrink to TRUE. TreeShrink removes genes with excessively long branches and is available for all seven paralog-handling pipelines except ParaGone, as TreeShrink is already implemented within the ParaGone pipeline.
1.1.5 September, 2025 - MAJOR UPDATE !
1.1.3-1.1.4 August, 2025
Fixed some bugs in stages control. These versions have been abondoned.
1.1.2 August, 2025
Integrated ASTRAL-IV into the pipeline stage 4.
Usage Update:
New dependency:
1.1.1 August, 2025
Fixed some common bugs.
2 - Example dataset
This page provides detailed instructions on how to run the example dataset included with HybSuite.
1. Download the example dataset
If you have downloaded the HybSuite source package, a directory named example_dataset is already included. In this case, no additional download is required.
Alternatively, you can download the repository on your server using:
git clone https://github.com/Yuxuanliu-HZAU/HybSuite
cd HybSuite/example_dataset
The directory example_dataset contains two folders: Angiosperms353 and Arabidopsis100, respectively encompassing all inputs for running HybSuite pipeline for the corresponding two example datasets in our analyss.
Example dataset 1: Angiosperms353
Angiosperms353/
βββ Input_list.txt
βββ Target_file_Angiosperms353.fasta
βββ Input_sequences/
βββ Elaeagnus_pungens.fasta
βββ Hippophae_rhamnoides.fasta
This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):
Elaeagnus_angustifolia SRR12569928
Elaeagnus_bambusetorum SRR27547630
Elaeagnus_henryi SRR15533155
Elaeagnus_macrophylla SRR23618743
Elaeagnus_mollis SRR30566771
Hippophae_neurocarpa SRR17549374
Hippophae_salicifolia ERR7621632
Hippophae_tibetana SRR17549370
Shepherdia_argentea ERR7621633
Barbeya_oleoides SRR16214280 Outgroup
Elaeagnus_oldhamii A
Elaeagnus_pungens B
Hippophae_rhamnoides B
- Identifiers prefixed with
SRR or ERR: Public raw NGS data of the corresponding samples (the first row) ready to be downloaded in HybSuite pipeline. - Identifier
A: User-provided raw NGS data of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline. - Identifier
B: User-provided pre-assembled sequences of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline. - Identifier
Outgroup : Specifing the outgroup taxon.
Note
- In this example dataset, input data include public raw data, user-provided raw data, and pre-assembled sequences,. Thus, corresponding data are provided in the directory
Input_sequences. - The outgroup taxon is
Barbeya_oleoides.
This directory should contain either user-provided raw reads, pre-assembled sequences, or both, according to the information provided in Input_list.txt.
- type1:
user-provided raw reads
In our analysis, only the data of species Elaeagnus oldhamii belongs to user-provided raw reads, which needs to be downloaded here prior to running HybSuite pipeline.
After downloading the raw data, transfer them to FASTQ.GZ format and move them to this directory. The two pair-ended files should be named as:
Elaeagnus_oldhamii_1.fastq.gz
Elaeagnus_oldhamii_2.fastq.gz
- type2:
pre-assembled sequences
Two taxa with pre-assembled sequences are provided: Elaeagnus_pungens, and Hippophae_rhamnoides (corresponding to the taxon name along with the identifier B provided in the sample list file Sample_list.tsv. Their FASTA files are named as Elaeagnus_pungens.fasta and Hippophae_rhamnoides.fasta respectively. (<taxon>.fasta)
Target_file_Angiosperms353.fasta
This file is the target sequence file for Angiosperms353.
The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:
>Elaeagnus-pungens-4471
AATGTCATCCAGGATAAATATCGGTTGGAAGCTGCAAATACTGACTGGATGAACAAGTAC
AAAGGCTCTAGTAAGCTTCTATTGCATCCAAGGAACACTGAGGAGGTTTCACAGATACTC
...
>Hippophae-rhamnoides-4527
GAAGAGAGGGTTGTAGTATTAGTGATTGGTGGAGGAGGAAGAGAACATGCTCTTTGCTAT
GCAATGAATCGATCACCATCCTGCGATGCAGTCTTTTGTGCTCCTGGCAATGCTGGGATT
...
>Hippophae-salicifolia-4691
CAGAGACTGCCTCCATTGTCAACTGATCCCAACAGATGCGAGCGTGCATTTGTTGGAAAC
ACGATAGGTCAAGCAAATGGTGTGTACGACAAGCCAATCGATCTCCGATTCTGTGATTAC
...
Example dataset 2: Arabidopsis100
Arabidopsis100/
βββ Input_list.txt
βββ Target_file_Arabidopsis100.fasta
This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):
Elaeagnus angustifolia SRR26705271
Elaeagnus bambusetorum SRR26757993
Elaeagnus henryi SRR26705270
Elaeagnus macrophylla SRR26753865
Elaeagnus mollis SRR26758012
Elaeagnus oldhamii SRR26705501
Elaeagnus pungens SRR26705285
Hippophae neurocarpa SRR26705287
Hippophae rhamnoides SRR26756417
Hippophae salicifolia SRR26705274
Hippophae tibetana SRR26704952
Shepherdia argentea SRR26756705
Barbeya_oleoides SRR26756183 Outgroup
Note
- In this example dataset, all input data are public raw data from NCBI SRA (the identifiers are prefixed with SRR/ERR), which are ready to be downloaded in HybSuite pipeline. Thus, no data needs to be provided locally by users.
- The outgroup taxon is
Barbeya_oleoides.
Target_file_Arabidopsis_thaliana100.fasta
This file is the target sequence file for Arabidopsis100.
The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:
>Locus-1
MAFRRVLTTVILFCYLLISSQSIEFKNSQKPHKIQGPIKTIVVVVMENRSFDHILGWLKSTRPEIDGLTGKESNPLNVSDPNSKKIFVSDDAVFVDMDPGHSFQAIREQIFGSNDTSGDPKMNGFAQQSESMEPGMAKNVMSGFKPEVLPVYTELANEFGVFDRWFASVPTSTQPNRFYVHSATSHGCSSNVKKDLVKGFPQKTIFDSLDENGLSFGIYYQNIPATFFFKSLRRLKHLVKFHSYALKFKLDAKLGKLPNYSVVEQRYFDIDLFPANDDHPSHDVAAGQRFVKEVYETLRSSPQWKEMALLITYDEHGGFYDHVPTPVKGVPNPDGIIGPDPFYFGFDRLGVRVPTFLISPWIEKGTVIHEPEGPTPHSQFEHSSIPATVKKLFNLKSHFLTKRDAWAGTFEKYFRIRDSPRQDCPEKLPEVKLSLRPWGAKEDSKLSEFQVELIQLASQLVGDHLLNSYPDIGKNMTVSEGNKYAEDAVQKFLEAGMAALEAGADENTIVTMRPSLTTRTSPSEGTNKYIGSY*
>Locus-2
MSDQQLETEINFWGETSEEDYFNLKGIIGSKSFFTSPRGLNLFTRSWLPSSSSPPRGLIFMVHGYGNDVSWTFQSTPIFLAQMGFACFALDIEGHGRSDGVRAYVPSVDLVVDDIISFFNSIKQNPKFQGLPRFLFGESMGGAICLLIQFADPLGFDGAVLVAPMCKISDKVRPKWPVDQFLIMISRFLPTWAIVPTEDLLEKSIKVEEKKPIAKRNPMRYNEKPRLGTVMELLRVTDYLGKKLKDVSIPFIIVHGSADAVTDPEVSRELYEHAKSKDKTLKIYDGMMHSMLFGEPDDNIEIVRKDIVSWLNDRCGGDKTKTQV*
>Locus-3
MSSRENPSGICKSIPKLISSFVDTFVDYSVSGIFLPQDPSSQNEILQTRFEKPERLVAIGDLHGDLEKSREAFKIAGLIDSSDRWTGGSTMVVQVGDVLDRGGEELKILYFLEKLKREAERAGGKILTMNGNHEIMNIEGDFRYVTKKGLEEFQIWADWYCLGNKMKTLCSGLDKPKDPYEGIPMSFPRMRADCFEGIRARIAALRPDGPIAKRFLTKNQTVAVVGDSVFVHGGLLAEHIEYGLERINEEVRGWINGFKGGRYAPAYCRGGNSVVWLRKFSEEMAHKCDCAALEHALSTIPGVKRMIMGHTIQDAGINGVCNDKAIRIDVGMSKGCADGLPEVLEIRRDSGVRIVTSNPLYKENLYSHVAPDSKTGLGLLVPVPKQVEVKA*
Note
- Differing from the target file for Angiosperms353, the target sequences in this file are all protein sequences.
- Even so, there is no need to specify the type of target file, since HybSuite can automatically recognize the sequence type (nucleotide/protein).
3. Run the pipeline
First of all, change your working directory to the downloaded example dataset file:
cd <the path to the directory of "example_dataset">
Next, create output directories (or specify an existing directory when running HybSuite):
mkdir -p ./Angiosperms353/Output ./Arabidopsis100/Output
After setting the right working directory, run the following commands for the two example datasets:
Angiosperms353
hybsuite full_pipeline \
-input_list ./Angiosperms353/Input_list.txt \
-input_data ./Angiosperms353/Input_sequences \
-output_dir ./Angiosperms353/Output \
-nt 5 \
-process 5 \
-t ./Angiosperms353/Target_file_Angiosperms353.fasta \
-seqs_min_length 100 \
-seqs_min_sample_coverage 0.1 \
-PH 1234567 \
-sp_tree 14
Arabidopsis100
hybsuite full_pipeline \
-input_list ./Arabidopsis100/Input_list.txt \
-output_dir ./Arabidopsis100/Output \
-nt 5 \
-process 5 \
-t ./Arabidopsis100/Target_file_Arabidopsis_thaliana100.fasta \
-seqs_min_length 100 \
-seqs_min_sample_coverage 0.1 \
-PH 1234567 \
-sp_tree 14
Tip
The command format hybsuite full_pipeline [options] ... and hybsuite stage1 [options] ... applies to the Conda installation.
If HybSuite is installed locally from source, run:
bash <absolute_path_to_HybSuite/bin/HybSuite.sh> full_pipeline ...
instead of hybsuite full_pipeline.
3 - Extension tools
Apart from the main pipeline, we also offer some extention tools for results visualization and statistic analysis. This page tells you how to use them!
(1) Overview
Paralogs are homologous genes that arise from gene duplication within the same species. plot_paralog_heatmap.py is a Python script for analyzing and visualizing paralog distribution patterns across samples and loci. As a part of the HybSuite toolkit, it processes unaligned FASTA file for each locus to:
- Count paralogous sequences for each sample at each locus and generate a TSV format data table recording the counts.
- Generate heatmaps to visualize paralog distribution patterns, with auto-adjusted dimensions based on sample and locus counts.
- Support multi-threading for improved efficiency.
(2) Dependencies
- If you’ve already installed all HybSuite dependencies in
<conda_env>, activate it to run this script:
conda activate <conda_env>
- Otherwise, manually install the dependencies first:
pip install pandas seaborn matplotlib numpy
This script processes the following input files:
- (1) Input Directory (required, specified by
-i/--input_dir):
A directory containing multiple FASTA format files, each FASTA file represents a locus and contains sequences from multiple samples. Files should be named <locus_name>.fasta, <locus_name>_paralogs.fasta or <locus_name>_paralogs_all.fasta.
Tip
FASTA File Format Requirements:
- Sequence headers start with ‘>’
- Header format should be
>sample_name [other information] or simply >sample_name - Mutiple sequences in one sample in one locus are allowed.
For example:
>Sample1
ATGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>Sample2
ATGCTAGCTATCGATCGATCGATCGATCGA
>Sample3
ATGCTCGATCGATCGATCGATCGATCGATC
NOTE:
If a sample has only one single sequence in one locus, then this sequence is the orthology one. If a single sample have multiple sequences in one locus, then these seuqences are putative paralogs.
The paralog sequence FASTA files retrieved by hybpiper paralog_retriever can directly be the input files for this script. For example, the 5942_paralogs_all.fasta in our dataset:
>Elaeagnus_angustifolia.main NODE_1_length_1285_cov_267.250828,Elaeagnus-pungens-Elaeagnus_pungens,0,201,98.51,(1),312,915
ACCTTCCTTGACCTCAAGACCGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATCATTACAGACCTGCAGAACAAACGCCGTGGCACCGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGCGGAAAGAAGCCCTACCCACAAAAGAAAACGGGTAGGGCTCGACAGGGGTCCAAGAGAACTCCACTCCGGCCAGGTGGAGGAGTCGTCTTTGGGCCTAAGCCCAGAGATTGGAGCATCAAGATCAATAGAAAGGAGAAAAGGTTGGCGATTTCGACAGCAATGTCTAGTGCAGCTGCGAATACGATCGTGGTGGAGGATTTTTGGGACAATATGGATAAACCCAGGACGAAGGATTTTATAGCTGCTATGAAGAGGTGGGGTTTAAATCCACCGGGAGAGAAAGCTATGTTTATGATGGACGAAATTTCGGATAACGTGAGGCTTTCAAGTAGAAATATTCCGAAAGTGAAGGTTTTGACCCCGAGGACTTTGAATTTGTTTGATATTTTAAATGCGGATAAGTTGGTGCTTACCCCTGCTGCTGTGGATTACTTGAATGGACGTTATGGTGTTAATTATGAGGGTGAGAGT
>Elaeagnus_angustifolia.0 NODE_2_length_1266_cov_276.023549,Elaeagnus-pungens-Elaeagnus_pungens,2,201,89.95,(-1),301,898
CTTGATCTCAAAACAGCACCACCCGAAACTGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCGTGGGACTGCCTCAACCCTAACCCGTGGTGAGGTTAGAGGTGGTGGGAAAAAACCTTACCCACAAAAGAAAACGGGTCGGGCCCGACAAGGGTCCAAGAGAACTCCACTCCGTCCCGGTGGAGGTGTCGTTTTTGGTCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTGGTGGAGGATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGAGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCGGAAGAGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCTAATTATGAGGGGGAGAAT
>Elaeagnus_macrophylla.main NODE_2_length_1269_cov_183.823826,Elaeagnus-pungens-Elaeagnus_pungens,0,201,88.06,(1),297,900
ACATACCTCGATCTCAAAACAGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCCTGGGACTGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGTGGAAAGAAACCTTACCCACAAAAGAAAACGGGTCGCGCTCGACAAGGGTCAAAAAGAACTCCACTCCGTCCAGGTGGAGGTGTCGTTTTTGGGCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTAGTGGAGCATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGGGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCTGAAGTGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCAAATTATGAGGGTGAGAAT
...
- (2) Species List File: (Optional, specified by
-species_list)
If you want to analyze only specific species, you can provide a species list file:
Sample1
Sample2
Sample3
(4) Basic usage
python plot_paralog_heatmap.py \
-i <input_dir> \
-opr <counts.tsv> \
-oph <heatmap.png> \
[options] ...
- Required parameters in basic usage
-i/--input_dir
Directory containing FASTA files with paralogous sequences (formatted as <locus_name>_paralogs.fasta)- At least one output option must be specified:
-opr, --output_paralog_report
Generate a TSV file containing paralog counts-oph, --output_paralog_heatmap
Generate a heatmap visualization (format determined by file extension)
(5) Full parameters:
General options:
-t THREADS, --threads THREADS
Number of threads to use for processing (default: 1)
--species_list SPECIES_LIST
File containing list of species to include in the analysis (one species per line)
--output_species_list OUTPUT_SPECIES_LIST
Output file to save the list of processed species
Heatmap customization options:
--dpi DPI DPI (dots per inch) for output image (default: 300)
--fig_length FIG_LENGTH
Figure length in inches (default: auto-calculated based on number of loci)
--fig_height FIG_HEIGHT
Figure height in inches (default: auto-calculated based on number of species)
--sample_font SAMPLE_FONT
Font size for sample labels in points (default: 10)
--gene_font GENE_FONT
Font size for gene labels in points (default: 10)
--hide_xlabels Hide x-axis labels (locus names)
--hide_ylabels Hide y-axis labels (sample names)
--no_grid Do not show grid lines in heatmap
--color {black,blue,red,green,purple,orange,yellow,brown,pink}
Color scheme for heatmap gradient (default: black)
--show_values Show numerical values in heatmap cells (only for values >= 2)
--grid_color GRID_COLOR
Color for grid lines in heatmap (default: grey)
--add_markers Add visual markers in cells (dots for 1s, diagonal lines for 0s)
(6) Output examples
- TSV Report (paralog_counts.tsv):
(use -opr, --output_paralog_report to specify the output filename and directory.)
Species gene1 gene2 gene3
Sample1 2 1 0
Sample2 1 1 1
Sample3 0 2 1
- Heatmap Visualization
The heatmap uses color intensity to represent the number of recovered sequences:- White: No sequences (0)
- Light color: Single sequence (1), representing single copy orthologs.
- Darker color: Multiple sequences (β₯2), representing putative paralogs.
- For example, the following figure is the default output of our test dataset
Arabidopsis100.
You can find it in <output_dir>/02-All_paralogs/04-Filtered_paralog_reports_and_heatmap/Filtered_paralog_heatmap.png after running HybSuite by following our guide. In the HybSuite stage 2 pipeline, this script was applied to generate the heatmaps for original paralogs and filtered paralogs and by default, --show_values is used to display the specific numbers of recovered sequences in each locus of each sample.

- When running this script manually, the recovered sequence counts won’t display if you don’t use the
--show_values option.:

- To clearly show the type of sequence in each locus of each sample, it is advisable to use
--add_markers plus --show_value to add markers and numbers to the figure:X: No sequences (0)Β·: Single sequence (1), representing single copy orthologs.<number>: Multiple sequences (β₯2), representing putative paralogs.
python plot_paralog_heatmap.py ... -add_markers --show_value

- Besides, you can also use
--color to change a new color theme:
python plot_paralog_heatmap.py ... --color red

python plot_paralog_heatmap.py ... --color blue

NOTE: Our script only provides nine color themes, they are black(default), red, blue, purple, green, orange, yellow, brown and pink.
(7) Use cases
- Paralog Distribution Analysis: Identify which species and genes tend to have more paralogs
- Data Quality Assessment: Evaluate completeness of sequencing and assembly
- Evolutionary Analysis: Study gene duplication events across different species
- Data Visualization: Generate high-quality visualizations for papers or reports
(8) Tips and tricks
- For large datasets, increase the thread count (-t parameter) to speed up processing
- If sample names are long, use a smaller sample font size (–sample_font)
- This script can adjust the image dimensions automatically (which might be best for visualization in many cases). You can also use
--fig_length and --fig_height to manually adjust your image. - Use –show_values to display specific paralog counts directly on the heatmap (for counts β₯2)
(1) Overview
plot_recovery_heatmap_v2.py visualizes sequence recovery across samples and loci. It generates heatmaps showing the percentage of sequence length recovered for each gene in each sample, relative to reference sequences.
It highlights:
- Well-recovered loci across samples
- Samples with poor overall recovery
- Recovery patterns indicating systematic biases
Key features:
- Calculates sequence lengths from FASTA files
- Generates comprehensive sequence length tables in TSV format
- Creates customizable heatmaps with multiple color schemes
- Supports comparison against mean or maximum reference lengths
- Offers extensive visualization options including value display and grid customization
- Provides multi-threading support for processing large datasets
- Support for interactive Plotly HTML output.
(2) Dependencies
- If you’ve already installed all HybSuite dependencies in
<conda_env>, activate it to run this script:
conda activate <conda_env>
- Otherwise, manually install the dependencies first:
pip install biopython pandas seaborn matplotlib numpy plotly
The script requires:
- Python 3.6 or higher
- Biopython (for sequence parsing)
- NumPy and Pandas (for data manipulation)
- Matplotlib and Seaborn (for visualization)
The script automatically checks for required packages and will provide clear error messages if any are missing.
- Directory of FASTA files - Each file should contain sequences for a single locus across multiple samples
- Supported file extensions:
.fna, .fasta, .fa. - Each sequence header should start with the species/sample name (e.g.,
>species_name rest_of_header).
- Target sequence file - A single FASTA file containing reference sequences for all target loci
- Each sequence ID should include the locus name at the end, separated by a hyphen (e.g.,
>ref-locusnameA). - The script automatically detects whether references are nucleotide or protein sequences.
- Species list file - A simple text file with one species name per line (
-s <FILE>) - If not provided, species names will be automatically extracted from the FASTA files
(5) Basic usage
The basic command requires only the input directory and reference file:
python plot_recovery_heatmap_v2.py -i /path/to/fasta_files -r /path/to/reference.fasta \
-output_heatmap /path/to/recovery_heatmap(without_suffix)
This will:
- Calculate sequence lengths for each sample and locus.
- Generate a
seq_lengths.tsv file in the current directory. (use --output_seq_lengths to change the output path). - Create a heatmap image named
recovery_heatmap.png in the directory /path/to/.
(6) Example
If you have finished running our pipeline, open the directory 02-All_paralogs/03-Filtered_paralogs, which is one of the output directories of our example dataset Angiosperms353. Then you will find there are many FASTA files in it:
4471_paralogs_all.fasta
4527_paralogs_all.fasta
4691_paralogs_all.fasta
4724_paralogs_all.fasta
...
- Locus name
4471, 4527, 4691, 4724 … - Filename suffix:
_paralogs_all - Suffix:
.fasta
In that case:
- use the option
-i/--input_dir to specify the path to this directory; - use the option
-r to specify the path to the Angiosperms353 target file (locus names in target file must be corresponded with that in your input directory); - use the option
--filename_suffix to specify the filename suffix _paralogs_all to enable the script extract the locus name from the filename.
Run:
cd /path/to/02-paralogs/03-Filtered_paralogs/
python plot_recovery_heatmap_v2.py -i . -r /path/to/Target_file_Angiosperms353.fasta --filename_suffix "_paralogs_all" -output_heatmap ./recovery_heatmap.html -gw 0
Then you can obtain an interactive heatmap HTML file ./recovery_heatmap.html:
- The blue bars along with x- and y-axes indicate how many loci are recovered in each sample and how many samples each locus are recovered in, respectively.
- The color intensity of each cell indicates the proportion of gene length recovered for a given sample (y-axis) at a specific target locus (x-axis). When multiple sequences are recovered for a locus within a sample (putative paralogs), only the longest sequence is retained for visualization in the heatmap.
Now, let’s play with this interactive html file for fun and better effect!
- Choose the button “Sort by” as “Descending” to sort samples and loci on the heatmap from high to low recovery.
- Click on the “Plus” (+) and “Minus” (-) icons in the upper right corner to zoom in and out of the heatmap.
- Click on the “AutoScale” icon in the upper right corner to auto-scale the heatmap.
- Click the “Camera” (π·) icon in the upper right corner to download the current heatmap view as a PNG file.
Tip
- If some samples recover very few or no loci, we recommend replacing their data sources or increasing the value of
-seqs_min_loci_coverage to exclude these low-quality samples from downstream analyses.
(7) Full parameters
usage: plot_recovery_heatmap_v2.py [-h] -i INPUT_DIR -r REFERENCE [-s SPECIES_LIST] [--filename_suffix FILENAME_SUFFIX]
[--output_species_list OUTPUT_SPECIES_LIST] [--output_heatmap OUTPUT_HEATMAP]
[--output_seq_lengths OUTPUT_SEQ_LENGTHS] [-t THREADS]
[--color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}] [--title TITLE]
[--use_max] [--xlabel XLABEL] [--ylabel YLABEL] [-gw GRID_WIDTH]
plot_recovery_heatmap_v2.py - A visualization tool in HybSuite
This script is a component of the HybSuite toolkit, designed for visualizing sequence recovery
rates across different taxa and loci. It generates heatmaps that display the percentage of
sequence length recovered for each gene in each taxon, relative to either the average or
maximum length of reference sequences.
Key features:
1. Calculates sequence lengths and generates a seq_lengths.tsv file
2. Calculates the percentage length recovered relative to reference sequences
3. Generates customizable heatmaps showing recovery rates
4. Supports both average and maximum reference length comparisons
5. Offers flexible visualization options
Both the seq_lengths.tsv file and heatmap generation are optional outputs.
Part of HybSuite
optional arguments:
-h, --help show this help message and exit
-i INPUT_DIR, --input_dir INPUT_DIR
Directory containing FASTA files for each locus
-r REFERENCE, --ref REFERENCE
Target sequence file (FASTA format)
-s SPECIES_LIST, --species_list SPECIES_LIST
File containing list of species names (one per line). If not provided, species names will be extracted from FASTA files
--filename_suffix FILENAME_SUFFIX
Suffix(es) to remove from input FASTA filenames to get locus names. Multiple suffixes can be separated by commas. Example: "_paralogs_all". If not specified, the input filenames will be recognized as loci names.
--output_species_list OUTPUT_SPECIES_LIST, -osp OUTPUT_SPECIES_LIST
Output file for extracted species list (when species_list is not provided)
--output_heatmap OUTPUT_HEATMAP, -oh OUTPUT_HEATMAP
Output path and filename for the heatmap (default: recovery_heatmap.html). Should end with .html extension.
--output_seq_lengths OUTPUT_SEQ_LENGTHS, -osl OUTPUT_SEQ_LENGTHS
Output file for sequence lengths (TSV format). If not provided, sequence lengths will be written to seq_lengths.tsv in current directory
-t THREADS, --threads THREADS
Number of threads to use (default: 1)
--color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}
Color scheme for the HTML heatmap (default: blue). Available options: viridis, magma, inferno, plasma, cividis, turbo, purple, blue, green, black
--title TITLE Main title of the heatmap (default: "Percentage length recovery for each gene")
--use_max Use maximum length instead of average length from reference sequences
--xlabel XLABEL X-axis label (default: "Locus")
--ylabel YLABEL Y-axis label (default: "Sample")
-gw GRID_WIDTH, --grid_width GRID_WIDTH
The value of grid width of the heatmap, recommended to set as "0" when the locus number is huge (default: 0.5)
(1) Overview
RLWP.py (Remove Loci With Paralogs) is a Python script within the HybSuite toolkit designed to filter out genetic loci with excessive paralog occurrences. Paralogs are gene copies that arise from gene duplication events and can complicate phylogenetic analyses. This tool identifies and removes loci that exceed a user-defined threshold of paralog presence across samples, helping to improve the quality of downstream analyses by maintaining only single-copy orthologous markers.
Key features:
- Filters loci based on paralog occurrence statistics
- Supports multi-threading for improved performance
- Provides detailed logging and reporting
- Offers in-place filtering or non-destructive output to a separate directory
- Works with various sequence file formats (suffix can be
FASTA, FNA, fasta, fa)
(2) Dependencies
- If you’ve already installed all HybSuite dependencies in
<conda1_env>, activate it to run this script:
conda activate <conda1_env>
- Otherwise, manually install the dependencies first:
pip install biopython pandas
RLWP.py requires two main types of input:
- A directory containing sequence files: A directory containing nucleotide sequence files in FASTA format (.fa, .fasta, .fna, .fas or their uppercase variants). Each file should represent one locus.
- Paralog statistics file: A tab-separated values (TSV) file containing paralog counts per sample for each locus.
0: No any sequence is recovered in this loci of the sample.1: The only recovered sequence in this loci of the sample is a single-copy orthology sequence.- more than
1: Putative paralogs exist in this loci of the sample.
This file should have:
- Sample IDs in the first column
- Locus names as column headers
- Values representing the number of paralogs found for each sample-locus combination
NOTE: This file can be generated by running plot_paralog_heatmap.py (with the option -oph, see here)
Example paralog statistics file format:
Sample Locus1 Locus2 Locus3
sample1 1 2 1
sample2 1 1 3
sample3 2 1 1
(4) Basic usage
- Remove loci where >2 samples show putative paralogs:
python RLWP.py -i input_directory -p paralog_statistics.tsv -s 2 -or deletion_report.tsv
Required parameters:
-i, --input_dir: Directory containing sequence files-p, --paralog_heatma: Path to paralog statistics file (TSV format)-s, --samples_threshold: Minimum number of samples with paralogs to trigger locus removal-or, --output_report: Path for saving the deletion report
Optional parameters:
-o, --output_dir: Optional directory to output filtered files (preserves originals)-t, --threads: Number of threads to use for parallel processing (default: 1)
(5) Output examples
Tips and tricks
- Choosing the Right Threshold: Start with a conservative threshold (e.g., 5% of your total samples) and adjust based on your dataset characteristics.
- Non-destructive Workflow: Use the
-o option to create a filtered copy of your data without modifying original files:
python RLWP.py -i <input_directory> -p paralog_reports.tsv -s 3 -o filtered_data
Start with a conservative threshold: (e.g., 5% of your total samples) and adjust based on your dataset characteristics.
Performance Optimization: For large datasets, increase thread count to speed up processing
(1) Overview
filter_seqs_by_length.py is a Python script within the HybSuite package that filters DNA sequences based on length criteria. It allows filtering sequences using absolute minimum length or relative length compared to reference sequences.- It is particularly useful for removing short, potentially truncated sequences before downstream analyses, helping to ensure high-quality datasets for phylogenomic analysis.
- It processes multiple FASTA files in parallel, reference file can be DNA or protein sequences.
- It provides detailed logging and reporting of filtered sequences, making it easy to track what was removed and why.
(2) Dependencies
conda activate <conda1_env>
- Otherwise, manually install the dependencies first:
pip install biopython pandas
filter_seqs_by_length.py requires two types of input:
- A directory containing sequence files in FASTA format:
- Supported extensions: .fa, .fasta, .fna, .fas (case-insensitive)
- Each file is assumed to contain sequences from a single locus
- Filename determines locus ID (e.g., the locus name of
GeneName.fasta is GeneName)
- Reference Sequences
- The format of reference sequences is as the same as the requirement of HybSuite main program (here to check).
(4) Basic usage
- Filter sequences by absolute minimum length:
python filter_seqs_by_length.py -i input_directory --min_length 300
- Filter using reference sequences to according to length ratio (compared to mean length or maximum length of each locus in reference file):
python filter_seqs_by_length.py -i input_directory -r reference.fasta --mean_length_ratio 0.7
- Combine multiple criteria to filter:
python filter_seqs_by_length.py -i input_directory -r reference.fasta \
--min_length 200 --mean_length_ratio 0.6 --max_length_ratio 0.5
- Save output to a different directory rather than modifing original files:
python filter_seqs_by_length.py -i input_directory --output_dir filtered_sequences
- Generate report of removed sequences:
python filter_seqs_by_length.py -i input_directory -r reference.fasta \
--mean_length_ratio 0.7 --output_report removed_seqs.tsv
(5) Output examples
Filtered FASTA Files
Filtered sequences are written either:
- To the original files (overwriting them)
- To a new directory if
--output_dir is specified
Removed Sequences Report
When using --output_report, a TSV file is created with details of removed sequences:
| File | Sequence_ID | Length | Mean_Length_Ratio | Max_Length_Ratio |
|---|
| gene1.fasta | Sample1_gene1 | 125 | 0.435 | 0.391 |
| gene2.fasta | Sample3_gene2 | 78 | 0.213 | 0.185 |
(6) Use cases
Cleaning Assembled Data
- Remove truncated sequences resulting from poor assembled and captured sequences:
python filter_seqs_by_length.py -i captured_exons -r reference.fasta \
--mean_length_ratio 0.3 --output_report removed_sequences.tsv
Tips and Tricks
- Locus Identification: Ensure filenames match locus IDs in reference sequences (part after last hyphen).
- Preserving Originals: Always use
--output_dir when testing filtering parameters to avoid overwriting original files. - Speed Optimization: Set
-t to match available CPU cores for maximum performance. - Multiple Filters: Combining
--min_length with ratio-based filters creates more stringent filtering. - Protein Sequences: The script automatically detects the type of reference file (DNA/protein) and adjusts length calculations appropriately.
(1) Overview
filter_sequences_by_sample_and_locus_coverage.py is a Python script designed to remove samples with low loci coverage and loci with low sample coverage in phylogenomic dataset based on user-defined threshold.
This tool can:
- Filter samples and loci based on minimum coverage thresholds
- Generate reports of removed samples and loci
- Process files in parallel to improve performance
(2) Dependencies
conda activate <conda1_env>
- Otherwise, manually install the dependencies first:
pip install biopython pandas
The tool processes FASTA files in a directory with the following requirements:
- A directory containing sequence files in FASTA format:
- Supported extensions:
.fa, .fasta, .fna, .fas, FNA (case-insensitive). - Each file is assumed to contain sequences from a single locus.
- Filename determines locus ID (e.g., the locus name of
GeneName.fasta is GeneName).
(4) Basic usage
python filter_seqs_by_sample_and_locus_coverage.py -i input_directory --min_sample_coverage 0.5 --min_locus_coverage 0.7
Required parameters
-i, --input: Directory containing FASTA files.--min_sample_coverage: Minimum sample coverage ratio (0-1) for each locus (default: 0.0)- `–min_locus_coverage: Minimum locus coverage ratio (0-1) for each sample (default: 0.0)
Optional parameters
-o, --output_dir: Directory for filtered sequences (if not specified, original files are modified)-t, --threads: Number of threads to use (default: 1)--removed_samples_info: Output TSV file for removed samples coverage information--removed_loci_info: Output TSV file for removed loci coverage information
(5) Output examples
- Example of
removed_samples.tsv (specified by --removed_samples_info):
Sample Locus_Coverage
Species1 0.45
Species2 0.32
- Example of
removed_loci.tsv (specified by --removed_loci_info):
Locus Sample_Coverage
Locus1 0.38
Locus2 0.42
(6) Use cases
- The following running codes remove loci that appear in less than 60% of samples and samples that have less than 50% of loci.
python filter_seqs_by_sample_and_locus_coverage.py \
-i assembled_loci/ -o filtered_loci/ \
--min_sample_coverage 0.6 --min_locus_coverage 0.5 \
--removed_samples_info removed_samples.tsv --removed_loci_info removed_loci.tsv
(7) Tips and tricks
- Choosing Coverage Thresholds: Start with lower thresholds (e.g., 0.3-0.5) and gradually increase until you achieve the desired balance between data completeness and taxon/locus sampling.
- Preserving Original Files: Always use the
-o option to output to a new directory when experimenting with different thresholds. - Removing Problematic Samples Only: Set
--min_locus_coverage without setting --min_sample_coverage to filter out only low-coverage samples while keeping all loci. - Tracking Removed Data: Always use the
--removed_samples_info and --removed_loci_info options to keep records of what was filtered out for documentation and troubleshooting. - Performance Optimization: Use the
-t option with a value close to your CPU core count for faster processing of large datasets.
(1) Overview
(2) Basic usage
The basic usage of modified_phypartspiecharts.py is nearly as the same as that of phypartspiecharts.py. The only difference is that users have to use --output to specify the path and filename of the output visiualization results when running modified_phypartspiecharts.py, rather than --svg_name in phypartspiecharts.py.
python modified_phypartspiecharts.py \
species_tree phyparts_prefix num_genes ...
- Required Parameters in basic usage
species_tree: Path to species tree file (Newick format)phyparts_root: Prefix of PhyParts output filesnum_genes: Total number of gene trees
(3) Extended functionality
Compared to the original version, modified_phypartspiecharts.py offers the following extended functionality:
a. Running Efficiency Control
- Multithreading Support: Use
-nt/--threads <NUM> for multithreaded processing, significantly improving speed for large datasets
b. Output Files Control
Support for SVG and PDF format outputs: use --output <output_file> and specify your outputfile with the extension .pdf or .svg.
Additional Statistical Output: use --stat parameter to export detailed node statistics to TSV files.
The detailed node statistics table generated by --stat parameter contains the following columns:
Node: Node IDSupport(blue): Number of genes supporting the species treeTopConflict(green): Number of genes with main conflictOtherConflict(red): Number of genes with other conflictNoSignal(gray): Number of genes with no signalSupport/Total_Ratio: Ratio of supporting to conflicting genes
The file also includes the average ratios for internal nodes including Support/Total Ratio, Conflict/Total Ratio, NoSignal/Total Ratio, Support/Signal_Ratio and Conflict/Signal_Ratio.
For example:
Node Support(blue) TopConflict(green) OtherConflict(red) NoSignal(gray) Support/Total_Ratio Conflict/Total_Ratio NoSignal/Total_Ratio Support/Signal_Ratio Conflict/Signal_Ratio
0 85 0 0 157 0.3512 0.0000 0.6488 1.0000 0.0000
1 202 7 31 2 0.8347 0.1570 0.0083 0.8417 0.1583
2 135 41 59 7 0.5579 0.4132 0.0289 0.5745 0.4255
3 205 6 20 11 0.8471 0.1074 0.0455 0.8874 0.1126
4 123 22 91 6 0.5083 0.4669 0.0248 0.5212 0.4788
5 164 33 36 9 0.6777 0.2851 0.0372 0.7039 0.2961
6 190 18 29 5 0.7851 0.1942 0.0207 0.8017 0.1983
7 91 35 112 4 0.3760 0.6074 0.0165 0.3824 0.6176
8 129 18 86 9 0.5331 0.4298 0.0372 0.5536 0.4464
Average ratios for internal nodes only:
Support/Total Ratio: 0.6079
Conflict/Total Ratio: 0.2957
NoSignal/Total Ratio: 0.0964
Support/Signal Ratio: 0.6963
Conflict/Signal Ratio: 0.3037
c. Extended Visualization Functionality
- Support for controlling whether taxonomic names use italic font: use
--no_italic - Support for flexible number displayed on branches: use
--show_num_mode <NUM> - Support for controlling tree branch width display: use
--line_width <NUM> - Support for controlling pie chart size: use
--pie_size <NUM> - Support for controlling leaf node label size: use
--tip_size <NUM> - Support for controlling node number label size: use
--number_size <NUM> - Support for circular, cladogram, and phylogram display types: use
--tree_type <circular|cladogram>

(4) Full Options
options:
-h, --help show this help message and exit
--taxon_subst TAXON_SUBST
Comma-delimited file to translate tip names.
--output OUTPUT Output filename with extension (.svg or .pdf)
--output_node_tree Generate an additional tree file with '_nodes' suffix showing:
- All node identifiers in the tree
- No pie charts
- No numerical annotations
--no_ladderize Don't ladderize tree
--to_csv Export data to CSV
--tree_type {circle,cladogram,phylo}
Tree visualization type (cladogram or circle, default: cladogram)
--line_width VT_LINE_WIDTH
Width of tree branches (default: 0)
--no_italic Display species names in normal font style (default: italic)
--tip_size TIP_SIZE_FACTOR
Scale factor for tip label font size (default: 1.0)
--number_size NUMBER_SIZE_FACTOR
Scale factor for gene tree count font size (default: 1.0)
--show_num_mode SHOW_NUM_MODE
Control what numbers to show on branches (specify 0-2 digits):
0: Hide all numbers
1: Number of genes supporting species tree (blue)
2: Number of genes conflicting with species tree (red+green)
3: Number of genes with no signal (gray)
4: Proportion of supporting genes (blue/total)
5: Proportion of conflicting genes ((red+green)/total)
6: Proportion of no signal genes (gray/total)
7: Ratio of supporting to all signal genes (blue/(blue+red+green))
8: Ratio of conflicting to all signal genes (red+green/(blue+red+green))
9: Original node support values from the input tree
Example: --show_num_mode 0 (hide all numbers)
--show_num_mode 1 (show only support number)
--show_num_mode 12 (default, show support and conflict numbers)
--show_num_mode 47 (show support number and support/conflict ratio)
--show_num_mode 9 (show original node support values)
--pie_size PIE_SIZE_FACTOR
Scale factor for pie chart size (default: 1.0)
--stat STAT_OUTPUT Output file path for node statistics (TSV format)
-nt THREADS, --threads THREADS
Number of threads to use (default: 1)
Citation
If using this tool, please cite:
(1) Overview
Fasta_formatter.py is a Python script for reformatting FASTA sequences into either interleaved (60 characters per line) or single-line format. It supports multi-threading for faster processing of large files.
(2) Dependencies
- If you’ve already installed all HybSuite dependencies in
<conda_env>, activate it to run this script:
conda activate <conda_env>
- Otherwise, manually install the dependencies first:
pip install pathlib concurrent.futures
(3) Basic usage
python Fasta_formatter.py \
-i <input_fasta> \
-o <output_fasta> \
--inter|--single \
[-nt <threads>]
Required parameters:
-i/--input: Input FASTA file-o/--output: Output file path--inter: Output in interleaved format (60 characters per line)--single: Output in single-line format
Optional parameters:
-nt/--threads: Number of threads (default: 1)
(4) Example
Convert a FASTA file to interleaved format with 4 threads:
python Fasta_formatter.py -i sequences.fasta -o sequences_formatted.fasta --inter -nt 4
Convert to single-line format:
python Fasta_formatter.py -i sequences.fasta -o sequences_singleline.fasta --single
(5) Use cases
- Data preprocessing: Prepare sequences for downstream analysis tools that require specific FASTA formatting
- File standardization: Convert between different FASTA formats for compatibility
- Large file processing: Use multi-threading to speed up formatting of big datasets
(1) Overview
rename_assembled_data.py is a python script in the HybSuite package designed to handle batch renaming operations for assembled data directories produced in the HybSuite stage 2 and their contents. It provides a comprehensive solution for renaming directories, files, and file contents while maintaining data integrity and consistency.
Key features:
- Recursively renames directory structures, file names, and file contents
- Handles potential naming conflicts safely
- Supports both single directory and batch renaming operations
(2) Basic usage
Single directory renaming
To rename a single directory and all its contents:
python rename_assembled_data.py -i /path/to/directory -n new_name
Parameters:
-i, --input: Path to the directory you want to rename-n, --new_name: The new name to replace the old name
Example:
python rename_assembled_data.py -i ./sample_001 -n sample_002
Batch renaming
For batch renaming multiple directories, create a tab-delimited file containing old and new name pairs:
python rename_assembled_data.py --rename_list path/to/rename_list.txt -p /path/to/parent_directory
Parameters:
--rename_list: Path to a tab-delimited file containing old_name and new_name pairs-p, --parent_dir: Path to the parent directory containing all the folders to be processed
The rename list file should be formatted as follows (tab-delimited):
old_name1 new_name1
old_name2 new_name2
old_name3 new_name3
Example:
python rename_assembled_data.py --rename_list rename_pairs.txt -p ./assembled_data
The script will:
- Process each directory listed in the rename file
- Rename all matching files and directories within each target directory
- Replace matching content within files
- Provide a summary of successful and failed operations
Note: The script includes safety checks and will skip operations that might cause conflicts or data loss.
4 - Full parameters
This page provides the full options and parameters for each subcommand, along with additional explanations and links where necessary. The available subcommands can be viewed using the command:
hybsuite -h/--help
or:
bash <the path to HybSuite.sh> -h/--help
Parameters for running hybsuite stage1
Stage 1 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage1 ...
Mandatory arguments: -input_list -input_data (required when including user-provided data) -output_dir
Essential arguments: -sra_maxsize -NGS_dir -nt -process
Arguments for inputs:
-input_list <FILE> The file listing input sample names and corresponding data types. (Default: None)
-input_data <DIR> The directory containing all input data (required when the inputs include your own data / pre-assembled data). (Default: None).
Arguments for outputs:
-output_dir <DIR> The output directory for all pipeline results (better to be consistent across all stages). (Default: None)
-NGS_dir <DIR> The output directory containing raw and cleaned reads files (Default: <output_dir>/NGS_dataset).
Notes: Pre-existing cleaned reads will skip reads trimming steps.
General arguments:
=== Threads control ===
-nt <INT|AUTO> Global thread setting. (Default: 1)
-nt_fasterq_dump <INT>
fasterq-dump threads. (Default: 1)
-nt_pigz <INT> pigz compression threads. (Default: 1)
-nt_trimmomatic <INT> Trimmomatic threads. (Default: 1)
=== Parallel control ===
-process <INT|all> Number of public data downloading and raw reads trimming to run concurrently. (Default: 1)
"all" means running all samples concurrently. (be cautious to set this option)
=== Public raw reads doenloading control ===
-rm_sra <TRUE/FALSE> Whether to remove SRA files after conversion. (Default: TRUE)
-download_format <fastq|fastq_gz>
Downloaded data format. (Default: fastq_gz)
=== Logfile Control ===
-log_mode <simple|cmd|full>
The output mode of hybsuite logfile. (Default: cmd)
Arguments for integrated tools:
=== SRAToolkit ===
-sra_maxsize <NUM> The maximum size of sra files to download. (Default: 20GB)
=== Trimmomatic ===
-trimmomatic_leading_quality <3-40>
Leading base quality cutoff. (Default: 3)
-trimmomatic_trailing_quality <3-40>
Trailing base quality cutoff. (Default: 3)
-trimmomatic_min_length <36-100>
Minimum read length. (Default: 36)
-trimmomatic_sliding_window_s <4-10>
Sliding window size. (Default: 4)
-trimmomatic_sliding_window_q <15-30>
Window average quality. (Default: 15)
Command example:
# Run HybSuite stage1 with 1 thread and 1 parallel processing
$ hybsuite stage1 -input_list ./input_list.txt -input_data ./Input_data -NGS_dir ./NGS_dir -output_dir ./
# Run HybSuite stage1 with 5 threads and 5 parallel processing
$ hybsuite stage1 -input_list ./input_list.txt -input_data ./Input_data -NGS_dir ./NGS_dir -output_dir ./ -nt 5 -process 5
Parameters for running hybsuite stage2
Stage 2 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage2 ...
Mandatory arguments: -input_list -NGS_dir -t -output_dir
Essential arguments: -eas_dir -seqs_min_length -seqs_min_sample_coverage -nt -process
Arguments for inputs:
-input_list <FILE> The file listing input sample names and corresponding data types used in stage 1. (Default: None)
-input_data <DIR> The directory containing all input data (in this stage, only required when the inputs include pre-assembled data). (Default: None).
-NGS_dir <DIR> The directory containing NGS raw and cleaned reads files (generated in stage 1). (Default: ./NGS_dir)
-t <FILE> Target file for data assembly. (follows the format required in HybPiper)
Arguments for outputs:
-output_dir <DIR> The Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
-eas_dir <DIR> The output directory containing HybPiper assembly sequences. (Default: <output_dir>/01-Assembled_data)
Note: Pre-existing data in this directory will skip redundant assembly steps.
General arguments:
=== Putative paralogs filtering control ===
-seqs_min_length <INT>
Minimum sequence length for filtered paralogs. (Default: 0)
Putative paralogs shorter than this value will be filtered.
-seqs_mean_length_ratio <0-1>
Minimum sequence length ratio relative to the mean value per locus for putative paralogs. (Default: 0)
Putative paralogs shorter than this percentage of the maximum length will be filtered.
-seqs_max_length_ratio <0-1>
Minimum length ratio relative to the longest value per locus for putative paralogs. (Default: 0)
Putative paralogs shorter than this percentage of the maximum length will be filtered.
-seqs_min_sample_coverage <0-1>
Minimum sample coverage for putative paralogs. (Default: 0)
For all putative paralogs in stage 2, HRS and RLWP sequences in stage 3, loci lower than this sample coverage will be filtered.
-seqs_min_locus_coverage <0-1>
Minimum locus coverage for putative paralogs. (Default: 0)
For all putative paralogs in stage 2, taxa (samples) with lower than this locus coverage will be filtered.
=== Heatmap control ===
-heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
Color scheme for heatmap gradient. (Default: black)
=== Threads control ===
-nt <INT|AUTO> Global thread setting (Default: 1)
-nt_hybpiper <INT> HybPiper threads (Default: 1)
=== Parallel control ===
-process <INT|all> Number of data assembly ('hybpiper assemble') to run concurrently (Default: 1)
"all" means running all samples concurrently (be cautious to set this option)
=== Logfile control ===
-log_mode <simple|cmd|full>
The output mode of hybsuite logfile. (Default: cmd)
Arguments for integrated tools:
=== HybPiper ===
-hybpiper_mapping_tool <blast|diamond>
The tool used for mapping reads to targets in HybPiper (only for protein targets) (Default: blast)
-hybpiper_check_chimeric_contigs <FALSE|TRUE>
Check whether a stitched contig is a potential chimera of contigs from multiple paralogs when running "hybpiper assemble". (Default: TRUE)
-hybpiper_cov_cutoff <INT>
Specify the value of "-cov_cutoff" when running "hybpiper assemble" in Stage 2. (Default: 8)
Increasing this value may increase the loci recovery efficiency but potentially introducing errors.
Command example:
# Run HybSuite stage2 with filtering paralog sequences
$ hybsuite stage2 -NGS_dir ./NGS_dir -t ./Angiosperms353.fasta -output_dir ./ -nt 5 -process 5 -seqs_min_length 100 -seqs_min_sample_coverage 0.1
# Run HybSuite stage2 without filtering paralog sequences
$ hybsuite stage2 -NGS_dir ./NGS_dir -t ./Angiosperms353.fasta -output_dir ./ -nt 5 -process 5
Parameters for running hybsuite stage3
Stage 3 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage3 ...
Mandatory arguments: -input_list -eas_dir -paralogs_dir -t -output_dir
Essential arguments: -PH -prefix -run_phyparts -aln_min_sample -nt -process
Arguments for inputs:
-input_list <FILE> The file listing input sample names and corresponding data types used in stage 1&2. (Default: None)
-input_data <DIR> The directory containing all input data (in this stage, only required when the inputs include pre-assembled data). (Default: None).
-eas_dir <DIR> The output directory containing HybPiper assembly sequences (generated in stage 3). (Default: <output_dir>/01-Assembled_data)
-paralogs_dir <DIR> The directory containing all paralog sequences generated in stage 2 or by users themselves. (Default: None)
It's advisable to set this parameter as '<output_dir>/02-All_paralogs/03-Filtered_paralogs'.
-t <FILE> Target file for data assembly. (follows the format required in HybPiper)
Arguments for outputs:
-output_dir <DIR> Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
-prefix <STRING> Prefix for output files. (Default: HybSuite)
General arguments:
=== Paralog handling control ===
-PH <1-7|a|b|all> Paralog handling methods to execute: (one or more of them can be chosen)
1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1
a: PhyloPyPruner, b: ParaGone (Default: 1a)
=== Sequences and alignments filtering control ===
-seqs_min_length <INT>
Minimum sequence bp length for filtering HRS and RLWP sequences. (Default: 0)
HRS and RLWP sequences shorter than this value will be removed.
-aln_min_length <INT>
Minimum sequence bp length for filtering HRS and RLWP final alignments. (Default: 4)
-aln_min_sample <INT>
Minimum sample number for final alignments. (Default: 0)
Final alignments (aligned and trimmed) with sample number below this threshold will be removed.
=== Gene tree builder control ===
-gene_tree <1/2> Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1)
-gene_tree_bb <INT> Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)
=== Alignments trimming tool control ===
-trim_tool <1/2> Choose the software to trim/clean alignments. (1: trimAl; 2: HMMCleaner) (Default: 1)
=== Nucleotide ambiguity character replacement ===
-replace_n <TRUE|FALSE>
Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).
=== Threads control ===
-nt <INT|AUTO> Global thread setting. (Default: 1)
-nt_paragone <INT> ParaGone threads. (Default: 1)
-nt_phylopypruner <INT>
PhyloPyPruner threads. (Default: 1)
-nt_mafft <INT> MAFFT threads. (Default: 1)
-nt_amas <INT> AMAS.py threads. (Default: 1)
-nt_modeltest_ng <INT>
ModelTest-NG threads. (Default: 1)
-nt_iqtree <INT> IQ-TREE threads. (Default: 1)
-nt_fasttree <INT> FastTree threads. (Default: 1)
=== Parallel control ===
-process <INT|all> Number of multiple sequences aligning, alignments trimming, and gene trees inference to run concurrently. (Default: 1)
"all" means running all samples concurrently. (be cautious to set this option)
=== Heatmap control ===
-heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
Color scheme for heatmap gradient. (Default: black)
Arguments for integrated tools :
=== PhyloPyPruner ===
-pp_min_taxa <INT> Minimum taxa per cluster. (Default: 4)
-pp_min_support <0-1> Minimum support value. (Default: 0=auto)
-pp_trim_lb <INT> Trim long branches. (Default: 5)
=== ParaGone ===
-paragone_pool <INT> Parallel alignment tasks. (Default: 1, same as the option '-process')
-treeshrink_q_value <0-1>
TreeShrink quantile threshold (Default: 0.05)
-paragone_cutoff_value <FLOAT>
Branch length cutoff (Default: 0.3)
-paragone_minimum_taxa <INT>
Minimum taxa per alignment (Default: 4)
-paragone_min_tips <INT>
Minimum tips per tree (Default: 4)
=== HybPiper ===
-hybpiper_skip_chimeric_genes <FALSE|TRUE>
Whether to skip recovering sequences for putative chimeric genes when running "hybpiper retrieve_sequences" (HRS method) in Stage 3. (Default: FALSE)
-hybpiper_retrieved_seqs_type <dna|intron|supercontig>
The type of sequence to extract when running "hybpiper retrieve_sequences" in Stage 3. (default:dna, which means extracting coding sequences)
=== MAFFT ===
-mafft_algorithm <str>
MAFFT algorithm [auto|linsi] (Default: auto)
-mafft_adjustdirection <TRUE/FALSE>
Whether to adjust sequence directions (Default: TRUE)
-mafft_maxiterate <INT>
Maximum number of iterations for MAFFT (Default: auto)
Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
-mafft_pair <str>
Pairing strategy for MAFFT (Default: auto)
Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
=== trimAl ===
-trimal_mode <str>
trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
-trimal_gapthreshold <0-1>
Gap threshold (Default: 0.12)
-trimal_simthreshold <0-1>
Similarity threshold (Default: auto)
-trimal_cons <0-100>
Consensus threshold (Default: auto)
-trimal_block <INT>
Minimum block size (Default: auto)
-trimal_w <INT>
Window size (Default: auto)
-trimal_gw <INT>
Gap window size (Default: auto)
-trimal_sw <INT>
Similarity window size (Default: auto)
-trimal_resoverlap <0-1>
Minimum overlap of a positions with other positions in the column. (Default: auto)
-trimal_seqoverlap <0-100>
Minimum percentage of sequences without gaps in a column. (Default: auto)
=== HMMCleaner ===
-hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45)
Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)
Command example :
# Run HybSuite stage3 without alignments filtering
$ hybsuite stage3 -eas_dir ./01-Assembled_data -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5
# Run HybSuite stage3 with alignments filtering
$ hybsuite stage3 -eas_dir ./01-Assembled_data -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs -t ./Angiosperms353 -PH 124567b -output_dir ./ -nt -process 5 -aln_min_length 100 -aln_min_sample 0.1
Parameters for running hybsuite stage4
Stage 4 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage4 ...
Mandatory arguments: -input_list -aln_dir -output_dir
Essential arguments: -PH -sp_tree -prefix -run_phyparts -nt -process
Arguments for inputs:
-input_list <FILE> The file listing input sample names and corresponding data types used in stage 1&2. (Default: None)
-aln_dir The directory containing different orthogroups alignments generated in stage 3. (Default: <output_dir>/06-Final_alignments)
It's advisable to set this parameter as '<output_dir>/06-Final_alignments'.
-PH <1-7|a|b|all> Choose alignments generated via paralog handling methods as input:
1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1 (one or more of them can be chosen)
a: PhyloPyPruner, b: ParaGone (Default: 1a)
Arguments for outputs:
-output_dir <DIR> Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
-prefix <STRING> Prefix for output files. (Default: HybSuite)
General arguments:
=== Species tree builder control ===
-sp_tree <1-5|all> Species tree inference method:
1: IQ-TREE, 2: RAxML, 3: RAxML-NG, 4: ASTRAL-IV, 5: wASTRAL
=== Steps control ===
-run_coalescent_step <INT>
Control which coalescent analysis steps to run:
1: Construct single gene trees, 2: Combine and collapse gene trees, 3: Infer species tree, 4: Reroot gene trees, 5: PhyParts concordance analysis
(Default: 1234)
-run_concatenated_step <INT>
Control which concatenated analysis steps to run:
1: Construct concatenated alignment, 2: Infer species tree
(Default: 12)
=== Gene tree builder control ===
-gene_tree <1/2> Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1)
-gene_tree_bb <INT> Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)
=== Gene trees collapse threshold ===
-collapse_threshold <VALUE>
Specify the minimum support value threshold for internal nodes in gene trees. (Default: 0)
Nodes with support values β€ this threshold will be collapsed into polytomies.
=== Nucleotide ambiguity character replacement ===
-replace_n <TRUE|FALSE>
Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).
=== Threads control ===
-nt <INT|AUTO> Global thread setting. (Default: 1)
-nt_amas <INT> AMAS.py threads (Default: 1)
-nt_modeltest_ng <INT>
ModelTest-NG threads (Default: 1)
-nt_iqtree <INT> IQ-TREE threads (Default: 1)
-nt_fasttree <INT> FastTree threads (Default: 1)
-nt_raxml_ng <INT> RAxML-NG threads (Default: 1)
-nt_raxml <INT> RAxML threads (Default: 1)
-nt_astral4 <INT> ASTRAL-IV threads (Default: 1)
-nt_wastral <INT> wASTRAL threads (Default: 1)
-nt_astral_pro <INT> ASTRAL-Pro3 threads (Default: 1)
=== Parallel control ===
-process <INT|all> Number of gene trees inference in coalescent analysis to run concurrently. (Default: 1)
"all" means running all samples concurrently. (be cautious to set this option)
Arguments for integrated tools :
=== IQ-TREE (cancatenated analysis)===
-iqtree_bb <INT> IQ-TREE bootstrap replicates (Default: 1000)
-iqtree_alrt <INT> SH-aLRT replicates (Default: 1000)
-iqtree_run_option <str>
IQ-TREE run mode [standard|undo] (Default: undo)
-iqtree_partition <TRUE/FALSE>
Whether to use partition models in IQ-TREE (Default: TRUE)
-iqtree_constraint_tree <Treefile>
The pathway to the constraint tree for running IQ-TREE (Default: none)
=== ModelTest-NG ===
-run_modeltest_ng <TRUE/FALSE>
Whether to run ModelTest-NG (Default: TRUE)
=== RAxML ===
-raxml_m <str> RAxML model [GTRGAMMA|PROTGAMMA] (Default: GTRGAMMA)
-raxml_bb <INT> RAxML bootstrap replicates (Default: 1000)
-raxml_constraint_tree <Treefile>
The pathway to the constraint tree for running RAxML (Default: no constraint tree)
=== RAxML-NG ===
-rng_bs_trees <INT> RAxML-NG bootstrap replicates (Default: 1000)
-rng_force <TRUE/FALSE>
Ignore thread warnings (Default: FALSE)
-rng_constraint_tree <Treefile>
The pathway to the constraint tree for running RAxML-NG (Default: no constraint tree)
=== ASTRAL-IV ===
-astral4_root Outermost (most distant) outgroup taxon name for ASTRAL-IV branch length calculation. (Default: none)
(Strongly recommended for accurate branch length estimation. Specify only the single outermost outgroup.)
-astral_r <INT> ASTRAL-IV rounds of search. (Default: 4)
-astral_s <INT> ASTRAL-IV rounds of subsampling. (Default: 4)
=== wASTRAL ===
-wastral_mode <1-4> wASTRAL mode [1|2|3|4] (Default: 1)
1: hybrid weighting, 2: support only, 3: length only, 4: unweighted
-wastral_r <INT> wASTRAL rounds of search. (Default: 4)
-wastral_s <INT> wASTRAL rounds of subsampling. (Default: 4)
=== ASTRAL-Pro ===
-astral_pro_r <INT> ASTRAL-Pro rounds of search. (Default: 4)
-astral_pro_s <INT> ASTRAL-Pro rounds of subsampling. (Default: 4)
=== MAFFT (only for paralogs inclusion method -> ASTRAL-Pro) ===
-mafft_algorithm <str>
MAFFT algorithm [auto|linsi] (Default: auto)
-mafft_adjustdirection <TRUE/FALSE>
Whether to adjust sequence directions (Default: TRUE)
-mafft_maxiterate <INT>
Maximum number of iterations for MAFFT (Default: auto)
Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
-mafft_pair <str>
Pairing strategy for MAFFT (Default: auto)
Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
=== trimAl (only for paralogs inclusion method -> ASTRAL-Pro) ===
-trimal_mode <str>
trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
-trimal_gapthreshold <0-1>
Gap threshold (Default: 0.12)
-trimal_simthreshold <0-1>
Similarity threshold (Default: auto)
-trimal_cons <0-100>
Consensus threshold (Default: auto)
-trimal_block <INT>
Minimum block size (Default: auto)
-trimal_w <INT>
Window size (Default: auto)
-trimal_gw <INT>
Gap window size (Default: auto)
-trimal_sw <INT>
Similarity window size (Default: auto)
-trimal_resoverlap <0-1>
Minimum overlap of a positions with other positions in the column. (Default: auto)
-trimal_seqoverlap <0-100>
Minimum percentage of sequences without gaps in a column. (Default: auto)
=== HMMCleaner (only for paralogs inclusion method -> ASTRAL-Pro) ===
-hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45)
Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)
=== PhyPartsPieCharts & modified_phypartspiecharts ===
-run_phyparts <TRUE|FALSE>
Enable/disable PhyParts concordance analysis and modified pie chart visualization. (Default: TRUE)
Note: Requires successful completion of previous coalescent analysis.
-phypartspiecharts_tree_type <cladogram/circle>
The tree type of displaying when running modified_phypartspiecharts.py (Default: cladogram)
-phypartspiecharts_num_mode <num>
Control what numbers to show on branches (specify 0-2 digits) (Default: 12)
0: Hide all numbers
1: Number of genes supporting species tree (blue)
2: Number of genes conflicting with species tree (red+green)
3: Number of genes with no signal (gray)
4: Proportion of supporting genes (blue/total)
5: Proportion of conflicting genes ((red+green)/total)
6: Proportion of no signal genes (gray/total)
7: Ratio of supporting to all signal genes (blue/(blue+red+green))
8: Ratio of conflicting to all signal genes ((red+green)/(blue+red+green))
9: Original node support values from the input tree
Command example :
# Run HybSuite stage4 with IQ-TREE
$ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 1
# Run HybSuite stage4 with ASTRAL-IV
$ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 4
# Run HybSuite stage4 with ASTRAL-IV and PhyParts
$ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 4 -run_phyparts TRUE
Parameters for running hybsuite full_pipeline
HybSuite full pipeline Manual
--------------------------------------------------------------------------------
Usage: hybsuite full_pipeline ...
Mandatory arguments: -input_list -input_data (required when including user-provided data) -t -output_dir
Essential arguments: -PH -sp_tree -seqs_min_length -aln_min_sample -prefix -nt -process
Arguments for inputs:
-input_list <FILE> The file listing input sample names and corresponding data types. (Default: None)
-input_data <DIR> The directory containing all input data (required when the inputs include your own data / pre-assembled data). (Default: None).
-t <FILE> Target file for data assembly. (follows the format required in HybPiper)
Arguments for outputs:
-output_dir <DIR> The output directory for all pipeline results. (Default: None)
-NGS_dir <DIR> The output directory containing raw and cleaned reads files (see GitHub documentation).
Notes: Pre-existing cleaned reads will skip reads trimming steps.
-eas_dir <DIR> The output directory containing HybPiper assembly sequences. (Default: <output_dir>/01-Assembled_data)
Note: Pre-existing data in this directory will skip redundant assembly steps.
-prefix <STRING> Prefix for output files. (Default: HybSuite)
General arguments:
=== Stages running control ===
-skip_stage <1|2|3|12|123|>
Specify pipeline stages to skip during execution. (Default: None, running all stages)
Note: Particularly useful for re-running specific HybSuite pipeline stages.
(e.g., '-skip_stage 1' for skipping stage 1)
-run_to_stage <1|2|3> Specify pipeline stages to run up to (Default: None, running all stages)
(e.g., '-run_to_stage 3' for stopping before stage 4)
=== Public raw reads downloading control (Stage 1) ===
-rm_sra <TRUE/FALSE> Whether to remove SRA files after conversion. (Default: TRUE)
-download_format <fastq|fastq_gz>
Downloaded data format. (Default: fastq_gz)
=== Putative paralogs filtering control (Stage 2) ===
-seqs_min_length <INT>
Minimum sequence length for filtered paralogs. (Default: 0)
Putative paralogs shorter than this value will be filtered.
-seqs_mean_length_ratio <0-1>
Minimum sequence length ratio relative to the mean value per locus for putative paralogs. (Default: 0)
Putative paralogs shorter than this percentage of the maximum length will be filtered.
-seqs_max_length_ratio <0-1>
Minimum length ratio relative to the longest value per locus for putative paralogs. (Default: 0)
Putative paralogs shorter than this percentage of the maximum length will be filtered.
-seqs_min_sample_coverage <0-1>
Minimum sample coverage for putative paralogs. (Default: 0)
For all putative paralogs in stage 2, HRS and RLWP sequences in stage 3, loci lower than this sample coverage will be filtered.
-seqs_min_locus_coverage <0-1>
Minimum locus coverage for putative paralogs. (Default: 0)
For all putative paralogs in stage 2, taxa (samples) with lower than this locus coverage will be filtered.
=== Heatmap control (Stage 2&3) ===
-heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
Color scheme for heatmap gradient. (Default: black)
=== Paralog handling control (Stage 3) ===
-PH <1-7|a|b|all> Paralog handling methods to execute: (one or more of them can be chosen)
1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1
a: PhyloPyPruner, b: ParaGone (Default: 1a)
=== Sequences and alignments filtering control (Stage 3) ===
-seqs_min_length <INT>
Minimum sequence bp length for filtering HRS and RLWP sequences. (Default: 0)
HRS and RLWP sequences shorter than this value will be removed.
-aln_min_length <INT>
Minimum sequence bp length for filtering HRS and RLWP final alignments. (Default: 4)
-aln_min_sample <INT>
Minimum sample number for final alignments. (Default: 5)
Final alignments (aligned and trimmed) with sample number below this threshold will be removed.
=== Alignments trimming tool control (Stage 3) ===
-trim_tool <1/2> Choose the software to trim/clean alignments. (1: trimAl; 2: HMMCleaner) (Default: 1)
=== Gene trees builder control (Stage 3&4) ===
-gene_tree <1/2> Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1)
-gene_tree_bb <INT> Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)
=== Species tree builder control (Stage 4) ===
-sp_tree <1-5|all> Species tree inference method: (Default: 1)
1: IQ-TREE, 2: RAxML, 3: RAxML-NG, 4: ASTRAL-IV, 5: wASTRAL, 6: ASTRAL-Pro
=== Steps control in stage 4 ===
-run_coalescent_step <INT>
Control which coalescent analysis steps to run:
1: Construct single gene trees, 2: Combine and collapse gene trees, 3: Infer species tree, 4: Reroot gene trees, 5: PhyParts concordance analysis
(Default: 1234)
-run_concatenated_step <INT>
Control which concatenated analysis steps to run:
1: Construct concatenated alignment, 2: Infer species tree
(Default: 12)
=== Nucleotide ambiguity character replacement (Stage 3&4) ===
-replace_n <TRUE|FALSE>
Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).
=== Gene trees collapse threshold ===
-collapse_threshold <VALUE>
Specify the minimum support value threshold for internal nodes in gene trees. (Default: 0)
Nodes with support values β€ this threshold will be collapsed into polytomies.
=== Threads Control ===
-nt <INT|AUTO> Global thread setting. (Default: 1)
-nt_fasterq_dump <INT>
fasterq-dump threads. (Default: 1)
-nt_pigz <INT> pigz compression threads. (Default: 1)
-nt_trimmomatic <INT> Trimmomatic threads. (Default: 1)
-nt_hybpiper <INT> HybPiper threads (Default: 1)
-nt_paragone <INT> ParaGone threads. (Default: 1)
-nt_phylopypruner <INT>
PhyloPyPruner threads. (Default: 1)
-nt_mafft <INT> MAFFT threads. (Default: 1)
-nt_amas <INT> AMAS.py threads. (Default: 1)
-nt_modeltest_ng <INT>
ModelTest-NG threads. (Default: 1)
-nt_iqtree <INT> IQ-TREE threads. (Default: 1)
-nt_fasttree <INT> FastTree threads. (Default: 1)
-nt_modeltest_ng <INT>
ModelTest-NG threads (Default: 1)
-nt_raxml_ng <INT> RAxML-NG threads (Default: 1)
-nt_raxml <INT> RAxML threads (Default: 1)
-nt_astral4 <INT> ASTRAL-IV threads (Default: 1)
-nt_wastral <INT> wASTRAL threads (Default: 1)
-nt_astral_pro <INT> ASTRAL-Pro3 threads (Default: 1)
=== Parallel Control ===
-process <INT|all> Number of subprocess to run concurrently. (Default: 1)
"all" means running all subprocesses concurrently. (be cautious to set this option)
The related steps are:
Stage 1: public data downloading and raw reads trimming;
Stage 2: data assembly ('hybpiper assemble');
Stage 3: multiple sequences aligning, alignments trimming, and gene trees inference;
Stage 4: gene trees inference in coalescent analysis.
=== Logfile Control ===
-log_mode <simple|cmd|full>
The output mode of hybsuite logfile. (Default: cmd)
Arguments for integrated tools :
=== SRAToolkit (Stage 1) ===
-sra_maxsize <NUM> The maximum size of sra files to download. (Default: 20GB)
=== Trimmomatic (Stage 1) ===
-trimmomatic_leading_quality <3-40>
Leading base quality cutoff. (Default: 3)
-trimmomatic_trailing_quality <3-40>
Trailing base quality cutoff. (Default: 3)
-trimmomatic_min_length <36-100>
Minimum read length. (Default: 36)
-trimmomatic_sliding_window_s <4-10>
Sliding window size. (Default: 4)
-trimmomatic_sliding_window_q <15-30>
Window average quality. (Default: 15)
=== HybPiper (Stage 2 & 3) ===
-hybpiper_mapping_tool <blast|diamond>
The tool used for mapping reads to targets in HybPiper (only for protein targets) (Default: blast)
-hybpiper_check_chimeric_contigs <FALSE|TRUE>
Check whether a stitched contig is a potential chimera of contigs from multiple paralogs when running "hybpiper assemble". (Default: FALSE)
-hybpiper_cov_cutoff <INT>
Specify the value of "-cov_cutoff" when running "hybpiper assemble" in Stage 2. (Default: 8)
Increasing this value may increase the loci recovery efficiency but potentially introducing errors.
-hybpiper_skip_chimeric_genes <FALSE|TRUE>
Whether to recover sequences for putative chimeric genes when running "hybpiper retrieve_sequences" (HRS method) in Stage 3. (Default: FALSE)
-hybpiper_retrieved_seqs_type <dna|intron|supercontig>
The type of sequence to extract when running "hybpiper retrieve_sequences" in Stage 3.
=== PhyloPyPruner (Stage 3) ===
-pp_min_taxa <INT> Minimum taxa per cluster. (Default: 4)
-pp_min_support <0-1> Minimum support value. (Default: 0=auto)
-pp_trim_lb <INT> Trim long branches. (Default: 5)
=== ParaGone (Stage 3) ===
-paragone_pool <INT> Parallel alignment tasks. (Default: 1, same as the option '-process')
-treeshrink_q_value <0-1>
TreeShrink quantile threshold (Default: 0.05)
-paragone_cutoff_value <FLOAT>
Branch length cutoff (Default: 0.3)
-paragone_minimum_taxa <INT>
Minimum taxa per alignment (Default: 4)
-paragone_min_tips <INT>
Minimum tips per tree (Default: 4)
=== TreeShrink (Stage 3) ===
-treeshrink_q_value <0-1>
TreeShrink quantile threshold (Default: 0.05)
=== MAFFT (Stage 3) ===
-mafft_algorithm <str>
MAFFT algorithm [auto|linsi] (Default: auto)
-mafft_adjustdirection <TRUE/FALSE>
Whether to adjust sequence directions (Default: TRUE)
-mafft_maxiterate <INT>
Maximum number of iterations for MAFFT (Default: auto)
Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
-mafft_pair <str>
Pairing strategy for MAFFT (Default: auto)
Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
=== trimAl (Stage 3) ===
-trimal_mode <str>
trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
-trimal_gapthreshold <0-1>
Gap threshold (Default: 0.12)
-trimal_simthreshold <0-1>
Similarity threshold (Default: auto)
-trimal_cons <0-100>
Consensus threshold (Default: auto)
-trimal_block <INT>
Minimum block size (Default: auto)
-trimal_w <INT>
Window size (Default: auto)
-trimal_gw <INT>
Gap window size (Default: auto)
-trimal_sw <INT>
Similarity window size (Default: auto)
-trimal_resoverlap <0-1>
Minimum overlap of a positions with other positions in the column. (Default: auto)
-trimal_seqoverlap <0-100>
Minimum percentage of sequences without gaps in a column. (Default: auto)
=== HMMCleaner (Stage 3) ===
-hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45)
Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)
=== IQ-TREE (Stage 4) ===
-iqtree_bb <INT> IQ-TREE bootstrap replicates (Default: 1000)
-iqtree_alrt <INT> SH-aLRT replicates (Default: 1000)
-iqtree_run_option <str>
IQ-TREE run mode [standard|undo] (Default: undo)
-iqtree_partition <TRUE/FALSE>
Whether to use partition models in IQ-TREE (Default: TRUE)
-iqtree_constraint_tree <Treefile>
The pathway to the constraint tree for running IQ-TREE (Default: none)
=== ModelTest-NG (Stage 4) ===
-run_modeltest_ng <TRUE/FALSE>
Whether to run ModelTest-NG (Default: TRUE)
=== RAxML (Stage 4) ===
-raxml_m <str> RAxML model [GTRGAMMA|PROTGAMMA] (Default: GTRGAMMA)
-raxml_bb <INT> RAxML bootstrap replicates (Default: 1000)
-raxml_constraint_tree <Treefile>
The pathway to the constraint tree for running RAxML (Default: no constraint tree)
=== RAxML-NG (Stage 4) ===
-rng_bs_trees <INT> RAxML-NG bootstrap replicates (Default: 1000)
-rng_force <TRUE/FALSE>
Ignore thread warnings (Default: FALSE)
-rng_constraint_tree <Treefile>
The pathway to the constraint tree for running RAxML-NG (Default: no constraint tree)
=== ASTRAL-IV (Stage 4) ===
-astral4_root Outermost (most distant) outgroup taxon name for ASTRAL-IV branch length calculation. (Default: none)
(Strongly recommended for accurate branch length estimation. Specify only the single outermost outgroup.)
-astral_r <INT> ASTRAL-IV rounds of search. (Default: 4)
-astral_s <INT> ASTRAL-IV rounds of subsampling. (Default: 4)
=== wASTRAL (Stage 4) ===
-wastral_mode <1-4> wASTRAL mode [1|2|3|4] (Default: 1)
1: hybrid weighting, 2: support only, 3: length only, 4: unweighted
-wastral_r <INT> wASTRAL rounds of search. (Default: 4)
-wastral_s <INT> wASTRAL rounds of subsampling. (Default: 4)
=== ASTRAL-Pro ===
-astral_pro_r <INT> ASTRAL-Pro rounds of search. (Default: 4)
-astral_pro_s <INT> ASTRAL-Pro rounds of subsampling. (Default: 4)
=== PhyPartsPieCharts & modified_phypartspiecharts (Stage 4) ===
-run_phyparts <TRUE|FALSE>
Enable/disable PhyParts concordance analysis and modified pie chart visualization. (Default: TRUE)
Note: Requires successful completion of previous coalescent analysis.
-phypartspiecharts_tree_type <cladogram/circle>
The tree type of displaying when running modified_phypartspiecharts.py (Default: cladogram)
-phypartspiecharts_num_mode <num>
Control what numbers to show on branches (specify 0-2 digits) (Default: 12)
0: Hide all numbers
1: Number of genes supporting species tree (blue)
2: Number of genes conflicting with species tree (red+green)
3: Number of genes with no signal (gray)
4: Proportion of supporting genes (blue/total)
5: Proportion of conflicting genes ((red+green)/total)
6: Proportion of no signal genes (gray/total)
7: Ratio of supporting to all signal genes (blue/(blue+red+green))
8: Ratio of conflicting to all signal genes ((red+green)/(blue+red+green))
9: Original node support values from the input tree
Command example :
=== Run the full pipeline with all paralog-handling methods and all species trees inference approaches ===
hybsuite full_pipeline \
-input_list ./Input_list.txt \
-input_data ./Input_data \
-t Angiosperms353.fasta \
-PH 1234567 \
-sp_tree 12345 \
-output_dir ./ \
-nt 5 -process 5
=== Run the full pipeline with only tree-based orthology inference methods (MO/MI/RT/1to1) in ParaGone and ASTRAL-IV ===
hybsuite full_pipeline \
-input_list ./Input_list.txt \
-input_data ./Input_data \
-t Angiosperms353.fasta \
-PH 4567b \
-sp_tree 4 \
-output_dir ./ \
-nt 5 -process 5
5 - Tutorial
This page helps you prepare input files, configure parameters, and run HybSuite.
Tip
The HybSuite pipeline requires three main inputs, including:
- (1) a sample list file
- (2) a directory containing sequence files (raw NGS data and pre-assembled sequences)
- (3) a target sequence file
(1) The sample list file
This file should document sample names along with their corresponding sequence types identifiers (seperated by \t). HybSuite pipeline supports three input sequence types:
Type1: public raw reads from NCBI SRA
To download NGS raw reads from NCBI for phylogenetic analysis, you should:
Format the sample list file as:
- Row 1: Sample names
- Row 2: Corresponding accession numbers
(Tab-delimited, one sample per column)
Taxon1 SRR...
Taxon2 SRR...
...
Type2: user-provided raw reads
To prepare existing samples’ NGS raw reads as input, you should:
1. Format the sample list file as:
- Row 1: Sample names
- Row 2: Character
A
(Tab-delimited, one sample per column)
Taxon3 A
Taxon4 A
...
2. Place the raw data files (paired-end/single-end, FASTQ or FASTQ.GZ format) with corresponding names in the -input_data directory (see naming rules here ).
Type3: pre-assembled sequences
To prepare pre-assembled sequences as input, you should:
1. Format the sample list file as:
- Row 1: Sample names
- Row 2: Character
B
(Tab-delimited, one sample per column)
Taxon5 B
Taxon6 B
...
2. Place the corresponding pre-assembled sequence files in the -input_data directory (see naming rules here).
Combine sequence types together
To include multiple sequence types showed above as pipeline inputs, just combine all entries into the sample list file:
Taxon1 SRR...
Taxon2 SRR...
Taxon3 A
Taxon4 A
Taxon5 B
Taxon6 B
Outgroup Specification
This step is not necessary if you only run stage1 and stage2, but required for stages 3-4 (orthology inference & species tree inference):
To specify the outgroup in the sample list file, just mark outgroups with chracters Outgroup in row 3 (tab-separated with row 2).
Example (Outgroup = Taxon5):
Taxon1 SRR...
Taxon2 SRR...
Taxon3 A
Taxon4 A
Taxon5 B Outgroup
Taxon6 B
Note
- The order of recording the information of these three sequence types doesn’t matter.
- The sample list file can be named with different suffix, such as
.txt, .tsv …
Warning
- At least one sequence type should be recorded in the sample list file.
- You have to make sure all the sample names in the first row are not dupilicated.
- Any mistake in the second row’s characters will lead to running errors.
(2) The directory containing sequence files
Required if:
Your sample list includes existing raw reads or pre-assembled sequences.
If <Taxon> is listed in the sample list file, its raw data file should be named as follows:
- Paired-end data:
<Taxon>_1.<suffix> + <Taxon>_2.<suffix> - Single-end data:
<Taxon>.<suffix>
Tip
- Both paired-end and single-end data are supported.
- Supported
<suffix>: .fq.gz, .fq, .fastq.gz, or .fastq.
Note
Filenames must exactly match the sample names in your list. Mismatches will cause errors.
If <Taxon> is listed in the sample list file, its pre-assembled sequence file should be named as follows:
(3) The target sequence file
- The required format for the target sequences file is nearly identical to that used in HybPiper.
- The only difference is that in HybSuite, the gene name for a target sequence must be placed immediately after the final hyphen (
-) in the header line (see example showed below). - For example, see the
Reference.fasta file in the example dataset. - For more details, refer to HybPiper’s documentation to edit your target file.
>Elaeagnus-pungens-4471
AATGTCATCCAGGATAAATATCGGTTGGAAGCTGCAAATACTGACTGGATGAACAAGTAC
AAAGGCTCTAGTAAGCTTCTATTGCATCCAAGGAACACTGAGGAGGTTTCACAGATACTC
...
>Hippophae-rhamnoides-4527
GAAGAGAGGGTTGTAGTATTAGTGATTGGTGGAGGAGGAAGAGAACATGCTCTTTGCTAT
GCAATGAATCGATCACCATCCTGCGATGCAGTCTTTTGTGCTCCTGGCAATGCTGGGATT
...
>Hippophae-salicifolia-4691
CAGAGACTGCCTCCATTGTCAACTGATCCCAACAGATGCGAGCGTGCATTTGTTGGAAAC
ACGATAGGTCAAGCAAATGGTGTGTACGACAAGCCAATCGATCTCCGATTCTGTGATTAC
...
2. Construct command
(1) Basic command pattern
Conda version:
hybsuite <subcommand> [options] ...
Local version:
bash <path to HybSuite.sh> <subcommand> [options] ...
(2) Available subcommands
HybSuite provides modular subcommands for flexible workflow execution:
Run individual stages:
hybsuite stage1 [options]... # run Stage 1: NGS dataset construction
hybsuite stage2 [options]... # run Stage 2: Data assembly and filtering
hybsuite stage3 [options]... # run Stage 3: Paralog handling
hybsuite stage4 [options]... # run Stage 4: Species tree inference
Run the full pipeline:
hybsuite full_pipeline [options]... # Execute stages 1-4 in one go
Retrieve results:
After completing the full pipeline from stage 1 to 4, retrieve key output files:
hybsuite retrieve_results -i <hybsuite_comprehensive_output_dir> -o <results_dir>
This subcommand collects all final trees and summary statistics from the HybSuite comprehensive output directory.
Tip
Use hybsuite <subcommand> --help or refer to the Full pipeline parameters to view detailed parameters for each subcommand.
This section guides you through running each stage sequentially and how to configure related parameters.
Suppose you have prepared all the required input files including the sample list file, the input sequence directory, and the target sequence file in the correct formats described above, then you can keep forward and run each stage sequentially or the full pipeline in one go.
HybSuite checking
Before running each stage or the full pipeline, HybSuite automatically checks all dependencies, configured parameters, and sample information. If any invalid parameters or incorrectly formatted input files are detected, the program notifies you and exits.
After the checks, the program prompts you to confirm whether to proceed with running HybSuite (y) or not (n).
To skip checking, specify -check as FALSE when running the pipeline.
Run hybsuite stage1
Purpose: running HybSuite Stage 1: "NGS dataset construction": download public raw reads, integrate user-provided data, and perform adapter trimming.
The following parameters must be specified when running hybsuite stage1 (failing to do so will cause HybSuite to exit during execution):
(1) Configure input file parameters:
-input_list <FILE>
Specify the sample list file (as described in the section (1) The sample list file)-input_data <DIR>
Specify the directory including all user-provided raw reads and pre-assembled sequences (no need to specify this option if the sample names of user-provided raw reads and pre-assembled sequences are not included in the sample list file).
(2) Configure output file parameters:
-output_dir <DIR>
Specify the directory for storing HybSuiteβs comprehensive output files. It is recommended to specify this option as the same directory for convenience in later stages to make all outputs be in the same output folder.
The following parameters are essential for running this stage, you’d better check whether to configure depending on your analysis.
-nt <NUM>
Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).-process <NUM>
Specify the number of parallel sample processing (Default: 1). This involves the steps “raw read downloading” and “adapter trimming” in Stage 1.-sra_maxsize <?GB>
Specify the maximum SRA file size to download (Default: 20GB). If the size of targeted raw read files to download from NCBI is larger than this value, downloading process will be skipped for corresponding samples.-NGS_dir <DIR>
Specify the output directory with raw and clean reads files (Default: <output_dir>/NGS_dataset (<output_dir> can be specified by -output_dir))
See the full parameters for running hybsuite stage1 here for more customizable settings.
Example command:
# basic common pattern:
hybsuite stage1 -input_list <FILE> -input_data <DIR> -output_dir <DIR> -nt <NUM>or<AUTO> -process <NUM>
# command for our example datset (Angiosperms353) (8 threads; 5 in parallel)
cd <Path_to_"HybSuite-master/example_datasets/Angiosperms353/">
hybsuite stage1 -input_list Input_list.txt ./-input_data ./Input_data -output_dir ./Output -nt 8 -process 5
Step 4: Check output files
After running stage 1, you can check the output files for your analysis.
See the output files for hybsuite stage1 here for more details.
Run hybsuite stage2
Purpose: run HybSuite Stage 2: "Data assembly and paralog retrieval": assemble reads using HybPiper, retrieve paralog sequences, and filter paralog sequences by length, sample or locus coverage.
(1) Input file parameters
-input_list <FILE>
Sample list file (same as stage 1) here to check.-NGS_dir <DIR>
NGS dataset outputted in stage 1.
(default: <output_dir>/NGS_dataset)-input_data <DIR>
The same parameter as specified in Stage 1.
This option is required only when pre-assembled sequences are provided as input.-t <FILE>
Target sequence file in HybPiper format.
Tip
If you have existing cleaned read files, you can place them in the directory <NGS_dir>/03-My_clean_data (<NGS_dir> is the directory specified by -NGS_dir) for saving time because HybSuite will skip processing these samples to save running time!
For example, if you have existing clean data (pair-ended) of <taxon1> and clean data (single-ended) of <taxon2>, you can place them with corresponding file names showed below to let HybSuite skip processing these two samples:
<NGS_dir>/
βββ01-Downloaded_raw_data
βββ02-Downloaded_clean_data
βββ03-My_clean_data
βββ<taxon1>_1_clean.paired.fq.gz
βββ<taxon1>_2_clean.paired.fq.gz
βββ<taxon2>_clean.paired.fq.gz
(2) Output file parameters
-output_dir <DIR>
Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.
-eas_dir <DIR>
Specify the output directory for storing assembled results (one sample per subdirectory) generated by hybpiper assemble (see HybPiper manual).
(default: <output_dir>/NGS_dataset)
Tip
- If you already have assembled results generated by
hybpiper assemble (one sample per subdirectory), you can place them in the directory specified with -eas_dir. HybSuite will automatically skip these samples during the data assembly stage.
For example, if you have directories with assembled sequences outputted by hybpiper assemble for <taxon1> and <taxon2>:
- Create a directory and place these two directories named
<taxon1> and <taxon2> in it, and specify -eas_dir as the path to this new directory (showed below, <eas_dir> is the directory specified by -eas_dir).
<eas_dir>/
βββ<taxon1>
βββ<taxon2>
- Then, include the name of
<taxon1> and <taxon2> in the sample list file (specified by -input_list).
The following parameters are optional but essential for running this stage, check whether to configure depending on your analysis.
(1) Thread and parallel:
-nt <NUM|AUTO>
Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).process <NUM>
Specify the number of parallel sample processing (Default: 1). This involves the steps “data assembly” in stage 2.-eas_dir <DIR>
Specify existing assembled data to skip redundant assembly.
(2) Paralog sequence filtering:
-seqs_min_sample_coverage <NUM:0-1>
Specify the minimum sample coverage of recovered loci. Loci with sample coverage below this threshold are removed.
(default: 0, recommended value: 0.1)-seqs_min_locus_coverage <NUM:0-1>
Specify the minimum locus coverage of samples. Samples with locus coverage below this threshold are removed.
(default: 0)-seqs_min_length <NUM>
Specify the minimum sequence length for filtering paralog sequences. Sequences shorter than this threshold are removed.
(default: 0, recommended value: 100)-seqs_mean_length_ratio <NUM:0-1>
Specify the minimum length ratio relative to the mean length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)-seqs_max_length_ratio <NUM:0-1>
Specify the minimum length ratio relative to the maximum length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)-seqs_min_length_ratio <NUM:0-1>
Specify the minimum length ratio relative to the minimum length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)
(3) Parameters related to hybpiper assemble:
-hybpiper_mapping_tool <blast|diamond>
Specify the read mapping tool used in data assembly via hybpiper assemble.
(default: blast)-hybpiper_mapping_tool <TRUE|FALSE>
Specify whether to check chimeric contigs.
(default: TRUE)-hybpiper_cov_cutoff <INT>
Coverage cutoff for SPAdes when running “hybpiper assemble” in Stage 2.
Increasing this value may increase the loci recovery efficiency but potentially introducing errors.
(Default: 8)
See the full parameters for running hybsuite stage2 here for more customizable settings.
Example command
# Recommended command mode
hybsuite stage2 \
-input_list <FILE> \
-NGS_dir <DIR> \
-t <FILE> \
-output_dir <DIR> \
-seqs_min_length <NUM> \
-seqs_min_sample_coverage <NUM> \
-nt <NUM> -process <NUM>
# Command for the example dataset (Angiosperms353)
hybsuite stage2 \
-input_list ./Input_list.txt \
-NGS_dir ./NGS_dataset \
-t ./Target_file_Angiosperm353.fasta \
-output_dir ./Output \
-seqs_min_length 100 \
-seqs_min_sample_coverage 0.1 \
-nt 8 -process 5
Step 4: Check output files
After running stage 2, you can check the output files for your analysis.
See the output files for hybsuite stage2 here for more details.
Run hybsuite stage3
Purpose: run HybSuite Stage 3: "Paralog handling": optionally apply seven paralog-handling methods to infer orthology groups and generate final alignments for stage 4: species tree inference.
(1) Input file parameters
-input_list <FILE>
Sample list file (same as stage 1) here to check.-eas_dir <DIR>
Directory containing assembled sequences (one sample per subdirectory) generated by hybpiper assemble (see HybPiper manual) in Stage 2 or provided by users. (default: <output_dir>/NGS_dataset)-paralogs_dir <DIR>
Directory containing all paralog sequences generated in Stage 2 or provided by users.
If Stage 2 has been executed, set this parameter to <output_dir>/02-All_paralogs/03-Filtered_paralogs,
where <output_dir> refers to the comprehensive output directory specified by -output_dir.
(default: none)-input_data <DIR>
The same parameter as specified in Stage 1.
This option is required only when pre-assembled sequences are provided as input and the “HRS” or “RLWP” methods are selected.-t <FILE>
Target sequence file in HybPiper format.
(2) Output file parameters
-output_dir <DIR>
Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.-prefix <STRING>
Output file prefix. (default: HybSuite)
The following parameters are optional but essential for running this stage, check whether to configure them based on your analysis.
(1) Thread and parallel:
-nt <NUM|AUTO>
Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).-process <NUM>
Specify the number of parallel sample processing (Default: 1). This involves the steps “data assembly” in stage 2.
(2) Paralog-handling methods applied in this stage:
-PH <1-7|a|b|all>
Paralog handling methods (1=HRS, 2=RLWP, 3=LS, 4=MI, 5=MO, 6=RT, 7=1to1, a=PhyloPyPruner, b=ParaGone)
(One or several methods can be specified; default: 1a)
(3) Sequence filtering:
-seqs_min_length <INT>
Minimum HRS/RLWP sequence length (Default: 0)
Only HRS and RLWP methods filter sequences in this satge, other paralog-handling methods filter sequences in stage 2.-aln_min_length <INT>
Minimum alignment length (Default: 4)-aln_min_sample <INT>
Minimum sample number per alignment (Default: 0)
(4) Gene tree construction:
-gene_tree <1/2>
Gene tree builder: 1=IQ-TREE, 2=FastTree (Default: 1)-gene_tree_bb <INT>
Bootstrap value (Default: 1000)-trim_tool <1/2>
Trimming tool: 1=trimAl, 2=HMMCleaner (Default: 1)
See the full parameters for running hybsuite stage3 here for more customizable settings.
Example command:
# Recommended command
hybsuite stage3 \
-input_list <FILE> \
-eas_dir ./01-Assembled_data \
-paralogs_dir ./02-All_paralogs/03-Filtered_paralogs \
-t <FILE> \
-output_dir <DIR> \
-PH 1234567a \ # this option can be determined by your data size, time schedule, but still suggested to try all methods, including "4567b" with ParaGone
-nt 8 -process 5
# Command for the example dataset (Angiosperms353)
hybsuite stage3 \
-input_list ./Input_list.txt \
-eas_dir ./01-Assembled_data \
-paralogs_dir ./02-All_paralogs/03-Filtered_paralogs \
-t Target_file_Angiosperm353.fasta \
-output_dir ./Output \
-PH 1234567a \
-mafft_algorithm linsi \
-nt 8 -process 5
Step 4: Check output files
After running stage 3, you can check the output files for your analysis.
See the output files for hybsuite stage3 here for more details.
Run hybsuite stage4
Purpose: run HybSuite Stage4: "Species tree inference", infer species trees using concatenation-based and/or coalescent-based approaches.
(1) Configure input file parameters
-input_list <FILE>
Sample list file (same as stage 1-3) here to check.-aln_dir <DIR>
The path to directory 06-Final_alignments with final alignments generated in stage 3.
(2) Configure output file parameters
-output_dir <DIR>
Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.
The following parameters are optional but essential for running this stage, check whether to configure them based on your analysis.
-PH <1-7|a|b|all>
Select alignments from one or more specific paralog-handling methods. Make it consistent with the one used in stage 3. Default: 1.-sp_tree <1-6|all>
Species tree inference methods. Default: 1.
1=IQ-TREE, 2=RAxML, 3=RAxML-NG, 4=ASTRAL-IV, 5=wASTRAL, 6=ASTRAL-Pro-nt <INT|AUTO>
Thread setting.
Gene tree construction (for coalescent methods):
-gene_tree <1/2>
Gene tree builder. Default: 1.-gene_tree_bb <INT>
Bootstrap value. Default: 1000.-collapse_threshold <VALUE>
Collapse weakly supported branches. Default: 0.
Concatenation-based methods:
-run_modeltest_ng <TRUE/FALSE>
Run ModelTest-NG. Default: TRUE.-iqtree_bb <INT>
IQ-TREE bootstrap replicates. Default: 1000.-iqtree_partition <TRUE/FALSE>
Use partition models. Default: TRUE.
Coalescent-based methods:
-astral_r <INT>
ASTRAL-IV search rounds. Default: 4.-wastral_mode <1-4>
wASTRAL weighting mode. Default: 1.-run_phyparts <TRUE/FALSE>
Run PhyParts analysis. Default: TRUE.
Example command:
# Concatenation-based method (IQ-TREE)
hybsuite stage4 \
-input_list <FILE> \
-aln_dir <output_dir>/06-Final_alignments \
-output_dir <DIR> \
-PH 1234567a \
-sp_tree 1 \
-nt 8
# Coalescent-based method (ASTRAL-IV)
hybsuite stage4 \
-input_list sample_list.txt \
-aln_dir <output_dir>/06-Final_alignments\
-output_dir <DIR> \
-PH 1a \
-sp_tree 4 \
-run_phyparts TRUE \
-nt 8 -process 5
Run hybsuite full_pipeline
Purpose: run stages 1-4 sequentially in a single command.
Note
Running the full pipeline with a single command is convenient, but it can also be risky if intermediate outputs arenβt carefully checked.
Therefore, We recommend running each stage separately first to review the results and understand the workflow before using this “one-to-end” execution.
-input_list <FILE>
Sample list file.-t <FILE>
Target file.-output_dir <DIR>
Output directory.
-input_data is required when including user-provided data.
Methods control:
-PH <1-7|a|b|all>
Paralog handling methods. Default: 1a.-sp_tree <1-6|all>
Species tree inference methods. Default: 1.
Threading:
-nt <INT|AUTO>
Global thread setting. Default: 1.-process <INT|all>
Parallel processing. Default: 1.
Workflow control:
-skip_stage <1|12|123>
Skip completed stages. Default: none.-run_to_stage <1|2|3|4>
Stop at specific stage. Default: 4.
Logging control:
-log_mode <simple|cmd|full>
Logging verbosity. Default: cmd.simple: Only log key informationcmd: Log key information + command historyfull: Log detailed information + command history
Example command:
# Complete pipeline with all paralog-handling methods
hybsuite full_pipeline \
-input_list sample_list.txt \
-input_data Input_data \
-t Angiosperms353.fasta \
-output_dir ./ \
-PH 1234567ab \
-sp_tree 12345 \
-seqs_min_length 100 \
-aln_min_sample 4 \
-nt AUTO -process 10
# Pipeline with existing NGS data, skip stage 1
hybsuite full_pipeline \
-input_list sample_list.txt \
-NGS_dir ./NGS_dataset \
-t Angiosperms353.fasta \
-output_dir ./ \
-skip_stage 1 \
-PH 1a \
-sp_tree 14 \
-nt 8 -process 5
See the full parameters for run hybsuite full_pipeline here for more customizable settings.
4. Tips for rerunning the pipeline
If the initial results of HybSuite are not satisfactory, you can rerun the pipeline with modified parameters. The following strategies can help improve results or reduce runtime. These methods can be used individually or combined.
(1) Remove or add samples
You can remove or add samples by editing the sample_list.txt file and rerunning HybSuite.
If the same output_dir is used, HybSuite will automatically detect completed samples and skip the following steps:
- Public data downloading (stage 1)
- Raw reads trimming (stage 1)
- Data assembly (stage 2)
This allows you to update the dataset without repeating previously completed computations.
HybSuite allows reuse of intermediate results from previous runs.
Use the following options:
-NGS_dir
Use the NGS_dataset directory generated by a previous run to skip data downloading and adapter trimming.
-eas_dir
Use the 02-All_paralogs directory generated by a previous run to skip the data assembly step.
Reusing intermediate data can significantly reduce runtime when rerunning analyses.
(3) Adjust sequence filtering thresholds
You can improve dataset quality by adjusting filtering thresholds when rerunning the pipeline.
Common parameters include:
-seqs_min_length
Minimum sequence length retained in stage 2.-seqs_min_sample_coverage
Minimum proportion of samples containing a sequence.-aln_min_sample
Minimum number of samples required for an alignment in stage 3.
Increasing these thresholds can improve alignment quality and downstream analyses.
(4) Steps control in concatenation-based and coalescent-based analysis
HybSuite allows selective execution of steps in both concatenation-based and coalescent-based analyses.
Concatenation-based analysis
Use -run_concatenated_step to specify which steps to run.
For example, if a concatenated alignment has already been generated in a previous run, you can skip the concatenation step and directly infer the species tree by setting:
-run_concatenated_step 2
Coalescent-based analysis
Use -run_coalescent_step to control the execution of coalescent-based analysis.
For example, if gene trees are already available from a previous run, you can skip earlier step 1 and directly infer the species tree by setting:
-run_coalescent_step 234
6 - Installation
This page tells you how to install HybSuite step by step.
Important
HybSuite is a shell-based tool, invoking some python and R scripts, which is only available for Linux/Unix/WSL/macOS users.
Tip
Installing HybSuite via conda is strongly recommended, meanwhile manual installation is also available.
1. Install HybSuite via conda
(1) Prerequisites
- Conda installation is required.
If you don’t already have conda installed, see here for instructions on installing Anaconda or Miniconda.
(2) Step-by-step installation
To avoid dependency conflicts, creating a new conda environment for HybSuite is more recommended:
conda create -n hybsuite
And then, you can activate your newly-created conda environment and install hybsuite using the command below directly (from the specified channel):
conda activate hybsuite
conda install yuxuanliu::hybsuite
Before installing hybsuite, you can edit your ~/.condarc file as this to avoid installation issues about conda channels:
channels:
- conda-forge
- bioconda
- yuxuanliu
- defaults
Note
Our official bioconda channel package is currently under review. Until approved, using the command showed above is also useful.
(3) Verification
After installation, you can check the help menu of HybSuite to confirm successful installation by running:
hybsuite -h
2. Install HybSuite manually
(1) Prerequisites
- Conda installation is required.
If you don’t already have conda installed, see here for instructions on installing Anaconda or Miniconda.
(2) Package installtion
Directly clone the github repository:
git clone https://github.com/Yuxuanliu-HZAU/HybSuite.git
(3) Verification
After installation, you can check the help menu of HybSuite to confirm successful installation by running:
bash <absolute or relative path to HybSuite.sh> -h
(4) Dependencies installtion
Method 1: Run Install_all_dependencies.sh to install all dependencies in one go (more recommended)
The most convenient way to install all dependencies for HybSuite is directly running our script: HybSuite-master/Install_all_dependencies.sh to install all desired dependencies.
Before running this script, activate your target conda environment.
conda activate <conda_environment_name>
bash HybSuite-master/Install_all_dependencies.sh
Method 2: Install dependencies manually
If you fail to install some dependencies when running Install_all_dependencies.sh, then it is more advisable to install dependencies manually. Just follow the following steps:
conda create -n <conda_environment_name>
conda activate <conda_environment_name>
conda install conda-forge::mamba -y
mamba install python=3.9.15 -y
mamba install bioconda::hybpiper -y
mamba install bioconda::paragone -y
mamba install bioconda::amas -y
mamba install bioconda::sra-tools -y
mamba install conda-forge::pigz -y
conda install conda-forge::plotly -y
mamba install bioconda::newick_utils -y
mamba install bioconda::mafft -y
mamba install bioconda::trimal -y
mamba install bioconda::iqtree -y
mamba install bioconda::raxml -y
mamba install bioconda::raxml-ng -y
mamba install bioconda::aster -y
mamba install r
pip install ete3
pip install PyQt5
pip install phylopypruner
pip install phykit
R
install.packages("phytools")
install.packages("ape")
3. Dependencies
7 - Output files
Output File Naming Conventions
| Placeholder | Represents |
|---|
<PH> | Any of the 7 orthology inference methods: HRS, RLWP, LS, MI, MO, RT, 1to1 |
<taxon> | Taxon name (e.g., <taxon1>, <taxon2>, etc.) from your sample list file |
<prefix> | User-specified output prefix (via -prefix option) |
<locus_name> | Target sequence locus (e.g., <locus_name1>, <locus_name2>, etc.) |
The specific explanation of every output file is illustrated as follows.
Stage1 output
<NGS_dataset> (specified by -d)
A directory containing next-generation raw sequencing data downloaded from public databases, existing raw reads provided by the user, and clean data produced by Trimmomatic-0.39.
<NGS_dataset>/
βββ 01-Downloaded_raw_data/
βββ 02-Downloaded_clean_data/
βββ 03-My_clean_data/
<NGS_dataset> -> 01-Downloaded_raw_data
A directory containing next-generation raw sequencing data downloaded from public databases.
<NGS_dataset>/
βββ 01-Downloaded_raw_data/
βββ 01-Raw-reads_sra/
βββ 02-Raw-reads_fastq_gz/
<NGS_dataset> -> 01-Downloaded_raw_data -> 01-Raw-reads_sra
A directory containing raw sequencing data downloaded from NCBI in .sra format.
<NGS_dataset>/
βββ 01-Downloaded_raw_data/
βββ 01-Raw-reads_sra/
βββ <taxon>.sra
...
<taxon>.sra: File with raw sequencing data in SRA format.
By default, all *.sra files in this directory will be removed after converting them into fastq format to save space, unless you specify the option -rm_sra as FALSE to keep them.
<NGS_dataset> -> 01-Downloaded_raw_data -> 02-Raw-reads_fastq_gz
A directory containing raw sequencing data in .fastq or .fastq.gz format.
<NGS_dataset>/
βββ 01-Downloaded_raw_data/
βββ 02-Raw-reads_fastq_gz/
βββ <taxon>.fastq.gz or <taxon>.fastq
...
<taxon>.fastq.gz or <taxon>.fastq:
If the user specifies the option -download_format as fastq, pigz will not be used to compress the original .fastq files to .fastq.gz files, which will produce <taxon>.fastq in this folder.
If the user specifies the option -download_format as fastq.gz, pigz will be used to compress the original .fastq files to .fastq.gz files, which will produce <taxon>.fastq.gz in this folder.
Default: -download_format is specified as fastq.gz.
<NGS_dataset> -> 02-Downloaded_clean_data
A directory containing sequencing data cleaned from downloaded public raw reads.
<NGS_dataset>/
βββ 02-Downloaded_clean_data/
βββ <taxon>_1_clean.paired.fq.gz
βββ <taxon>_2_clean.paired.fq.gz
βββ <taxon>_1_clean.unpaired.fq.gz
βββ <taxon>_2_clean.unpaired.fq.gz
βββ <taxon>_clean.single.fq.gz
βββ <taxon>_clean.single.fq.gz
...
<taxon>_1_clean.paired.fq.gz & <taxon>_2_clean.paired.fq.gz
Files with compressed cleaned and paired sequencing data (paired-end type) in fq.gz format (these files will be used for downstream analysis).<taxon>_1_clean.unpaired.fq.gz & <taxon>_2_clean.unpaired.fq.gz
Files with compressed cleaned and unpaired sequencing data (paired-end type) in fq.gz format.<taxon>_clean.single.fq.gz
File with compressed cleaned sequencing data (single-end type) in fq.gz format (these files will be used for downstream analysis).
<NGS_dataset> -> 03-My_clean_data
A directory containing user-provided cleaned sequencing data or sequencing data cleaned from user-provided raw data.
<NGS_dataset>/
βββ 03-My_clean_data/
βββ <taxon>_1_clean.paired.fq.gz
βββ <taxon>_2_clean.paired.fq.gz
βββ <taxon>_1_clean.unpaired.fq.gz
βββ <taxon>_2_clean.unpaired.fq.gz
βββ <taxon>_clean.single.fq.gz
...
<taxon>_1_clean.paired.fq.gz & <taxon>_2_clean.paired.fq.gz
Files with user-provided compressed cleaned and paired sequencing data (paired-end type) in fq.gz format (these files will be used for downstream analysis).<taxon>_1_clean.unpaired.fq.gz & <taxon>_2_clean.unpaired.fq.gz
Files with user-provided compressed cleaned and unpaired sequencing data (paired-end type) in fq.gz format.<taxon>_clean.single.fq.gz
File with user-provided compressed cleaned sequencing data (single-end type) in fq.gz format (these files will be used for downstream analysis).
Stage2 output
01-Assembled_data
A directory containing assembled sequence data produced by hybpiper assemble command in HybPiper.
01-Assembled_data/
βββ Assembled_data_namelist.txt
βββ Old_assembled_data_namelist_<current_time>.log
βββ <taxon>/
...
Assembled_data_namelist.txt
A file containing sample names used as input to run the hybpiper assemble command.Old_assembled_data_namelist_<current_time>.log
A file containing previous sample names used as input to run the hybpiper assemble command.<taxon>
More details can be found here.
02-All_paralogs
A directory containing all original putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper, filtered paralogs, along with their paralog heatmaps and related statistical results.
02-All_paralogs/
βββ 01-Original_paralogs
βββ 02-Original_paralog_reports_and_heatmap
βββ 03-Filtered_paralogs
βββ 04-Filtered_paralog_reports_and_heatmap
02-All_paralogs -> 01-Original_paralogs
A directory containing all original putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper.
02-All_paralogs/
βββ 01-Original_paralogs/
βββ <locus_name>_paralogs_all.fasta
<locus_name>_paralogs_all.fasta: FASTA files for each sample/locus, containing all putative paralogs recovered by the HybPiper hybpiper paralog_retriever command.
02-All_paralogs -> 02-Original_paralog_reports_and_heatmap
A directory containing all original reports and heatmaps.
02-All_paralogs/
βββ 02-Original_paralog_reports_and_heatmap/
βββ Original_paralog_heatmap.png
βββ Original_paralog_report.tsv
βββOriginal_recovered_seqs_length.tsv
βββ Original_recovery_heatmap.html
Original_paralog_heatmap.png
A heatmap image file in PNG format, depicting the number of original putative paralog sequences for each locus/sample.Original_paralog_report.tsv
A TSV file recording the number of original putative paralog sequences for each locus/sample.Original_recovered_seqs_length.tsv
A TSV file recording the length of original recovered sequences for each locus/sample.Original_recovery_heatmap.html
An interactive HTML file for visualizing target locus recovery across all original paralogs (including both single-copy and multi-copy genes).
Here is a recovery heatmap example file you can play with: it shows a recovery result of the Angiosperms353 (Johnson et al., 2019) loci from 10 Elaeagnaceae species in our example dataset.
- The blue bars along with x- and y-axes indicate how many loci are recovered in each sample and how many samples each locus are recovered in, respectively.
- The color intensity of each cell indicates the proportion of gene length recovered for a given sample (y-axis) at a specific target locus (x-axis). When multiple sequences are recovered for a locus within a sample (putative paralogs), only the longest sequence is retained for visualization in the heatmap.
Now, let’s play with this interactive html file for fun and better effect!
- Choose the button “Sort by” as “Descending” to sort samples and loci on the heatmap from high to low recovery.
- Click on the “Plus” (+) and “Minus” (-) icons in the upper right corner to zoom in and out of the heatmap.
- Click on the “AutoScale” icon in the upper right corner to auto-scale the heatmap.
- Click the “Camera” (π·) icon in the upper right corner to download the current heatmap view as a PNG file.
Tip
- If some samples recover very few or no loci, we recommend replacing their data sources or increasing the value of
-seqs_min_loci_coverage to exclude these low-quality samples from downstream analyses.
02-All_paralogs -> 03-Filtered_paralogs
A directory containing all filtered putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper.
02-All_paralogs/
βββ 03-Filtered_paralogs/
βββ <locus_name>_paralogs_all.fasta
02-All_paralogs -> 04-Filtered_paralog_reports_and_heatmap
A directory containing all filtered reports and heatmaps.
02-All_paralogs/
βββ 04-Filtered_paralog_reports_and_heatmap/
βββ Filtered_paralog_heatmap.png
βββ Filtered_paralog_report.tsv
Filtered_paralog_heatmap.png
A heatmap image file in PNG format, depicting the number of filtered putative paralog sequences for each locus/sample.Filtered_paralog_report.tsv
A TSV file recording the number of filtered putative paralog sequences for each locus/sample.Filtered_recovered_seqs_length.tsv
A TSV file recording the length of original recovered sequences for each locus/sample.Filtered_recovery_heatmap.html
An interactive HTML file for visualizing target locus recovery across all filtered paralogs (including both single-copy and multi-copy genes).
The layout is identical to that of Original_recovery_heatmap.png, but it reflects the occupancy of filtered sequences rather than the original ones.
Stage3 output
03-Paralog_handling
03-Paralog_handling/
βββ HRS/ (optional)
βββ RLWP/ (optional)
βββ ParaGone/ (optional)
βββ PhyloPyPruner/ (optional)
03-Paralog_handling -> HRS
A directory containing original and filtered HRS sequences, including the recovery heatmap and filtering reports.
03-Paralog_handling/
βββ HRS/
βββ 01-Original_HRS_sequences
βββ <locus_name>.FNA
βββ 02-Original_HRS_sequences_reports_and_heatmap
βββ Original_HRS_heatmap.png
βββ Original_HRS_seq_lengths.tsv
βββ 03-Filtered_HRS_sequences
βββ <locus_name>.FNA
βββ 04-Filtered_HRS_sequences_reports_and_heatmap
βββ Filtered_HRS_heatmap.png
βββ Filtered_HRS_seq_lengths.tsv
βββ Removed_HRS_seqs_with_low_length_info.tsv
βββ Removed_samples_with_low_locus_coverage_info.tsv
βββ Removed_loci_with_low_sample_coverage_info.tsv
01-Original_HRS_sequences
<locus_name>.FNA
Files with retrieved sequences in FASTA format, produced by hybpiper retrieve_sequences (referred to as HRS sequences in the following).
Notes:
- In the HybSuite pipeline, supercontigs are automatically retrieved, including introns and exons, for downstream analysis. HybSuite doesn’t support retrieving only introns or exons.
- Since the downstream analysis requires DNA sequences, only DNA sequences can be retrieved; protein sequences are not supported for the next stage.
02-Original_HRS_sequences_reports_and_heatmap
Original_HRS_heatmap.png
A heatmap image file in PNG format, depicting the length of the original HRS sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.Original_HRS_seq_lengths.tsv
A TSV file recording all original HRS sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
03-Filtered_HRS_sequences
04-Filtered_HRS_sequences_reports_and_heatmap
Filtered_HRS_heatmap.png
A heatmap image file in PNG format, depicting the length of the filtered HRS sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.Filtered_HRS_seq_lengths.tsv
A TSV file recording all filtered HRS sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.Removed_HRS_seqs_with_low_length_info.tsv
A TSV file recording the information of the HRS sequences with low bp length/length ratio that have been filtered out from the dataset.Removed_samples_with_low_locus_coverage_info.tsv
A TSV file recording the information of the samples with low locus coverage that have been filtered out from the dataset.Removed_loci_with_low_sample_coverage_info.tsv
A TSV file recording the information of the loci with low sample coverage that have been filtered out from the dataset.
03-Paralog_handling -> RLWP
A directory containing original and filtered RLWP sequences, including the recovery heatmap and filtering reports.
03-Paralog_handling/
βββ RLWP/
βββ 01-Original_RLWP_sequences
βββ <locus_name>.FNA
βββ 02-Original_RLWP_sequences_reports_and_heatmap
βββ Original_RLWP_heatmap.png
βββ Original_RLWP_seq_lengths.tsv
βββ 03-Filtered_RLWP_sequences
βββ <locus_name>.FNA
βββ 04-Filtered_RLWP_sequences_reports_and_heatmap
βββ Filtered_RLWP_heatmap.png
βββ Filtered_RLWP_seq_lengths.tsv
βββ Removed_RLWP_seqs_with_low_length_info.tsv
βββ Removed_samples_with_low_locus_coverage_info.tsv
βββ Removed_loci_with_low_sample_coverage_info.tsv
01-Original_RLWP_sequences
<locus_name>.FNA
Files with retrieved sequences in FASTA format, produced by hybpiper retrieve_sequences (referred to as RLWP sequences in the following).
Notes:
- In the HybSuite pipeline, supercontigs are automatically retrieved, including introns and exons, for downstream analysis. HybSuite doesn’t support retrieving only introns or exons.
- Since the downstream analysis requires DNA sequences, only DNA sequences can be retrieved; protein sequences are not supported for the next stage.
02-Original_RLWP_sequences_reports_and_heatmap
Original_RLWP_heatmap.png
A heatmap image file in PNG format, depicting the length of the original RLWP sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.Original_RLWP_seq_lengths.tsv
A TSV file recording all original RLWP sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
03-Filtered_RLWP_sequences
04-Filtered_RLWP_sequences_reports_and_heatmap
Filtered_RLWP_heatmap.png
A heatmap image file in PNG format, depicting the length of the filtered RLWP sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.Filtered_RLWP_seq_lengths.tsv
A TSV file recording all filtered RLWP sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.Removed_RLWP_seqs_with_low_length_info.tsv
A TSV file recording the information of the RLWP sequences with low bp length/length ratio that have been filtered out from the dataset.Removed_samples_with_low_locus_coverage_info.tsv
A TSV file recording the information of the samples with low locus coverage that have been filtered out from the dataset.Removed_loci_with_low_sample_coverage_info.tsv
A TSV file recording the information of the loci with low sample coverage that have been filtered out from the dataset.
03-Paralog_handling -> ParaGone
03-Paralog_handling/
βββ ParaGone/
βββ 00_logs_and_reports
...
βββ 28_RT_final_alignments_trimmed
βββ HybSuite_1to1_final_alignments
- From
00_logs_and_reports to 28_RT_final_alignments_trimmed: More details about these output folders can be found on this wiki page of ParaGone. - If the user specifies the
-paragone_keep_files option in HybSuite as TRUE, the intermediate folders from 01_input_paralog_fasta to 22_RT_stripped_names will be kept. If the user specifies the -paragone_keep_files option as FALSE, intermediate folders will be removed. HybSuite_1to1_final_alignments: A directory containing orthology group alignments produced via the 1to1 algorithm, which were retrieved from results produced by ParaGone.
03-Paralog_handling -> PhyloPyPruner
03-Paralog_handling/
βββ PhyloPyPruner/
βββ Input
βββ Output_LS
βββ Output_MI
βββ Output_MO
βββ Output_RT
βββ Output_1to1
Input
A directory containing trimmed alignments of each locus and their gene trees (input files for running PhyloPyPruner).<locus_name>_paralogs_all.aln.trimmed.fasta
The trimmed alignment of locus <locus_name> from 02-All_paralogs/03-Filtered_paralogs/<locus_name>_paralogs_all.fasta, generated by MAFFT and TrimAl.<locus_name>_paralogs_all.aln.trimmed.fasta.tre
The gene tree of locus <locus_name>, constructed by FastTree.Output_<PH>
A directory containing PhyloPyPruner output files for the <PH> algorithm (<PH> includes LS, MI, MO, RT, 1to1; more details can be found here).
04-Alignments
A directory containing alignments produced by different paralog-handling methods specified by the user. These alignments are then trimmed and filtered in stage 3.
04-Alignments/
βββ <PH>/
βββ<ortholog_group_name>.*.aln.fasta
<PH>/<ortholog_group_name>.*.aln.fasta
The alignments which are inferred via the <PH> paralog-handling method and multiple sequence alignment by MAFFT.
NOTE:
<ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, 4757_1 and 4757_2 are inferred ortholog group names from locus 4757.
05-Trimmed_alignments
A directory containing trimmed alignments inferred via different <PH> paralog-handling methods.
05-Trimmed_alignments/
βββ <PH>/
βββ <ortholog_group_name>.*.aln.trimmed.fasta
<PH>/<ortholog_group_name>.*.aln.trimmed.fasta
The alignments which are inferred via the <PH> paralog-handling method, aligned using MAFFT, and trimmed via TrimAl or cleaned via HMMCleaner.
NOTE:
<ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, 4757_1 and 4757_2 are inferred ortholog group names from locus 4757.
06-Final_alignments
A directory containing final <PH> orthogroup alignments ready for downstream species tree inference.
06-Final_alignments/
βββ <PH>/
βββ <ortholog_group_name>.*.aln.trimmed.fasta
<PH>/<ortholog_group_name>.*.aln.trimmed.fasta
Final <PH> orthogroup alignments for downstream species tree inference (stage4).
Stage4 output
07-Concatenated_analysis
A directory containing concatenated analysis results.
07-Concatenated_analysis/
<PH>
βββ 01-Supermatrix
βββ partition.txt
βββ <prefix>_<PH>.fasta
βββ 02-Species_tree
βββ IQTREE
βββ IQ-TREE*
βββ RAxML
βββ RAxML*
βββ RAxML-NG
βββ RAxML-NG*
βββ <prefix>_<PH>_ModelTest_NG.txt.tree
βββ <prefix>_<PH>_ModelTest_NG.txt.log
βββ <prefix>_<PH>_ModelTest_NG.txt.out
βββ <prefix>_<PH>_ModelTest_NG.txt.ckp
07-Concatenated_analysis -> <PH> -> 01-Supermatrix
A directory containing the supermatrix concatenated from <PH> orthogroup alignments and the partition file.
<prefix>_<PH>.fasta
The concatenated supermatrix file for orthology groups inferred by the <PH> method.partition.txt
The partition file for concatenation.
07-Concatenated_analysis -> <PH> -> 02-Species_tree
IQ-TREE/
A directory containing IQ-TREE results and final rooted trees (created only when IQ-TREE is applied by setting -sp_tree 1).
IQ-TREE_<prefix>_<PH>.*
IQ-TREE intermediate output files.IQ-TREE_<prefix>_<PH>.treefile
The tree file with branch lengths and bootstrap values, generated by IQ-TREE.IQ-TREE_<prefix>_<PH>.rr.tre
The rerooted tree file with branch lengths and bootstrap values from IQ-TREE results.
RAxML/
A directory containing RAxML results and final rooted trees (created only when RAxML is applied by setting -sp_tree 2).
RAxML_*.<prefix>_<PH>.*
RAxML intermediate output files.RAxML_<prefix>_<PH>.rr.tre
The rerooted tree file with branch lengths and bootstrap values from RAxML results.
RAxML-NG/
A directory containing RAxML-NG results and final rooted trees (created only when RAxML-NG is applied by setting -sp_tree 3).
RAxML-NG_<prefix>_<PH>.raxml.*
RAxML-NG intermediate output files.RAxML-NG_<prefix>_<PH>.rr.tre
The rerooted tree file with branch lengths and bootstrap values from RAxML-NG results.
07-Concatenated_analysis -> <PH> -> <prefix>_<PH>_ModelTest_NG.txt.*
The output files generated by ModelTest-NG.
08-Coalescent_analysis
A directory containing coalescent analysis results.
08-Coalescent_analysis/
βββ <PH>
βββ 01-Gene_trees
βββ 02-Combined_gene_trees
βββ 03-Species_tree
βββ 04-Rerooted_gene_trees
βββ 05-PhyParts_PieCharts
βββ ASTRAL-Pro
βββ 01-Gene_trees
βββ 02-Combined_gene_trees
βββ 03-Species_tree
βββ 04-Rerooted_gene_trees
08-Coalescent_analysis -> <PH>
A directory containing coalescent-based phylogenetic tree results for a specific dataset generated by the <PH> paralog-handling method.
08-Coalescent_analysis -> <PH> -> 01-Gene_trees
A directory containing gene trees inferred from final <PH> alignments.
<ortholog_group_name>.tre: The gene tree for locus/orthogroup <ortholog_group_name>.
NOTE:
<ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, ortholog group names 4757_1 and 4757_2 are inferred from locus 4757.
08-Coalescent_analysis -> <PH> -> 02-Combined_gene_trees
A directory containing combined gene trees generated from <PH> alignments.
Combined_gene_trees.tre: File containing all gene trees combined into a single file.Combined_gene_trees.tre.collapsed: File containing all gene trees with low-support branches collapsed.
08-Coalescent_analysis -> <PH> -> 03-Species_tree
A directory containing species trees inferred from <PH> alignments.
ASTRAL-IV/
A directory containing the final species tree for the <PH> dataset, generated by ASTRAL-IV.
ASTRAL-IV_<prefix>_<PH>.log
The log file generated by ASTRAL-IV.ASTRAL-IV_<prefix>_<PH>.tre
The species tree inferred by ASTRAL-IV from the combined gene trees.ASTRAL-IV_<prefix>_<PH>.bootstrap.tre
The species tree generated by ASTRAL-IV and bootstrapped using ASTRAL-III, following the ASTER protocol.ASTRAL-IV_<prefix>_<PH>.bootstrap.rr.tre
The rerooted species tree generated by ASTRAL-IV and bootstrapped using ASTRAL-III.ASTRAL-III_LPP.log
The ASTRAL-III log file which documents the bootstrapping process performed by ASTRAL-III.
wASTRAL/
A directory containing the final species tree for the <PH> dataset, generated by wASTRAL.
wASTRAL_<prefix>_<PH>.tre
The species tree inferred by wASTRAL from the combined gene trees.wASTRAL_<prefix>_<PH>.log
The log file generated by wASTRAL.wASTRAL_<prefix>_<PH>.rr.tre
The rerooted species tree generated by wASTRAL.
08-Coalescent_analysis -> <PH> -> 04-Rerooted_gene_trees
A directory containing rerooted gene trees from the <PH> dataset.
<ortholog_group_name>.rr.tre
File containing the rerooted gene tree for <ortholog_group_name> alignments in the <PH> dataset, generated using Phyx or the MAD method.
08-Coalescent_analysis -> <PH> -> 05-PhyParts_PieCharts
A directory containing phylogenetic concordance analysis results using rerooted gene trees and species trees.
ASTRAL-IV/
A directory containing ASTRAL-IV species tree conflict assessment using rerooted gene trees from directory 04-Rerooted_gene_trees (created only when users choose to run ASTRAL-IV by setting -sp_tree 4).
ASTRAL_PhyParts.*
Files containing the PhyParts output (more details can be found here).ASTRAL_PhyPartsPieCharts_<prefix>_<PH>.svg
Visualization of concordance and conflict between gene trees and the species tree, generated by our newly developed modified_phypartspiecharts.py script.
wASTRAL/
A directory containing wASTRAL species tree conflict assessment using rerooted gene trees from directory 04-Rerooted_gene_trees (created only when users choose to run wASTRAL by setting -sp_tree 5).
wASTRAL_<prefix>_<PH>.tre
The species tree inferred by wASTRAL from rerooted <PH> gene trees.wASTRAL_<prefix>_<PH>_sorted_rr.tre
The final rerooted species tree, rerooted by Phyx and sorted by Newick_Utilities.
Comprehensive output
hybsuite_logs
A directory containing the comprehensive log file generated by HybSuite.
hybsuite_logs/
βββ hybsuite_<current_time>.log
hybsuite_<current_time>.log
The log file produced when running the HybSuite pipeline (running the extension tools will not produce this logfile).
hybsuite_checklists
A directory containing checklist files, including species checklists and locus checklists.
hybsuite_checklists/
βββ All_Spname_list.txt
βββ My_Spname.txt
βββ Outgroup.txt
βββ Pre-assembled_Spname.txt
βββ Public_Spname.txt
βββ Public_Spname_SRR.txt
βββ Recovered_locus_num_for_samples.tsv
βββ Recovered_sample_num_for_loci.tsv
βββ Ref_gene_name_list.txt
All_Spname_list.txt
A file containing all sample names from your research.My_Spname.txt
A file containing all sample names for user-provided raw data in your research.Outgroup.txt
A file containing all outgroup taxa specified by the user.Pre-assembled_Spname.txt
A file containing the names of all pre-assembled samples specified by the user.Public_Spname.txt
A file containing all sample names whose Next Generation Sequencing (NGS) raw data was downloaded from NCBI.Public_Spname_SRR.txt
A file containing all Sequence Read Archive (SRA) IDs used to download NGS raw data from NCBI. These SRA IDs correspond with the sample names listed in the Public_Spname.txt file.Recovered_locus_num_for_samples.tsv
A file containing the number of recovered loci by HybPiper for each sample.Recovered_sample_num_for_loci.tsv
A file containing the number of recovered samples by HybPiper for each locus.Ref_gene_name_list.txt
A file containing the names of all genes in the target sequences (specified by the -t option).
hybsuite_reports
A directory containing comprehensive statistical summaries of the results generated by the pipeline.
hybsuite_results/
βββ Alignments_stats
βββ <PH>-01_Alignments_stats_AMAS.tsv
βββ <PH>-02_Trimmed_alignments_stats_AMAS.tsv
βββ <PH>-03_Removed_alignments_without_parsimony_informative_sites.txt
βββ <PH>-04_Removed_alignments_with_length_less_than_4.txt
βββ <PH>-05_Removed_alignments_with_sample_number_less_than_5.txt
βββ <PH>-06_Final_alignments_list.txt
βββ <PH>-07_Final_alignments_stats_AMAS.tsv
βββ Supermatrix_stats
βββ <PH>-Supermatrix_stats_AMAS.tsv
hybsuite_reports -> Alignments_stats
<PH>-01_Alignments_stats_AMAS.tsv
Summary table of orthogroup alignments inferred via the <PH> paralog-handling method (generated by AMAS.py).
<PH>-02_Trimmed_alignments_stats_AMAS.tsv
Summary table of trimmed orthogroup alignments inferred via the <PH> paralog-handling method (generated by AMAS.py).
<PH>-03_Removed_alignments_without_parsimony_informative_sites.txt
List of alignments without parsimony informative sites. These alignments are removed for downstream species tree inference.
<PH>-04_Removed_alignments_with_length_less_than_4.txt
List of alignments with base pair length less than 4. These alignments are removed for downstream species tree inference.
<PH>-05_Removed_alignments_with_sample_number_less_than_5.txt
List of alignments with fewer than 5 samples. These alignments are removed for downstream species tree inference.
<PH>-06_Final_alignments_list.txt
List of final <PH> alignments selected for downstream species tree inference.
<PH>-07_Final_alignments_stats_AMAS.tsv
Summary table of final <PH> alignments for downstream species tree inference (generated by AMAS.py).
Filtering process: Alignments without parsimony-informative sites, low bp length, and with low sample number are removed.
hybsuite_reports -> Supermatrix_stats
<PH>-Supermatrix_stats_AMAS.tsv
Summary table of final <PH> supermatrix for downstream concatenation-based species tree inference (generated by AMAS.py).