This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Introduction

1: Changelog

2: Example dataset

3: Extension tools

4: Full parameters

5: Tutorial

6: Installation

7: Output files

This page offers detailed introduction of HybSuite. Feel free to explore!

🧬 Pipeline overview

HybSuite performs end-to-end hybrid capture (Hyb-Seq) phylogenomic analysis from raw reads (Hyb-Seq preferred; compatible with RNA-seq, WGS, and genome skimming data) to phylogenetic trees.

The full pipeline is composed of 4 stages:

HybSuite workflow

Stage 1: NGS dataset construction
- (1) Optionally download public raw reads from NCBI (via SRA Toolkit );
- (2) Optionally integrate user-provided raw reads (if provided);
- (3) Raw reads trimming (via Trimmomatic);
Stage 2: Data assembly and paralog retrieval
- (1) Target loci assembly and putative paralogs retrieval (via HybPiper)
- (2) Integrate pre-assembled sequences (if provided);
- (3) Filter putative paralogs;
- (4) Plot recovery heatmap and paralog heatmap of original and filtered sequences;
Stage 3: Paralog handling
- Optionally execute seven paralogs-handling methods (HRS, RLWP, LS, MO, MI, RT, 1to1; see our Tutorial and generate filtered alignments for downstream analysis:
  - HRS:
    (1) Retrieve seqeunces via command hybpiper retrieve_sequences in HybPiper;
    (2) Integrate pre-assembled sequences (if provided);
    (3) Filter sequences by length to remove potential mis-assembled seqeunces;
    (4) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner);
    (5) Filter trimmed alignments to generate final alignments.
  - RLWP:
    (1) Retrieve seqeunces via hybpiper retrieve_sequences via HybPiper;
    (2) Integrate pre-assembled sequences (if provided);
    (3) Filter sequences by length to remove potential mis-assembled seqeunces;
    (4) Remove loci with putative paralogs masked in more than samples;
    (5) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner);
    (6) Filter trimmed alignments to generate final alignments.
  - PhyloPypruner pipeline (LS, MI, MO, RT, 1to1):
    (1) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner) for all putative paralogs;
    (2) Gene trees inference of all putative paralogs;
    (3) Obtain orthogroup alignments using tree-based orthology inference algorithms (via PhyloPypruner);
    (4) Realign (via MAFFT) and trim (via trimAl or HMMCleaner) the orthogroup alignments;
    (5) Filter trimmed orthogroup alignments to generate final alignments.
  - ParaGone pipeline (MI, MO, RT, 1to1):
    (1) Use the directory cantaining all putative paralogs generated in stage 2 as input;
    (2) Obtain orthogroup alignments using tree-based orthology inference algorithms via ParaGone;
    (3) Filter trimmed orthogroup alignments to generate final alignments.
Stage 4: Species tree inference
- Multiple species tree inference methods available:
  - Concatenation-based approach: IQ-TREE, RAxML, or RAxML-NG;
  - Coalescent-based approach: ASTRAL-IV or wASTRAL;
  - Multi-copy genes aware coalescent-based approach: ASTRAL-pro3.

✨ Features

🔄 Transparent: Full workflow visibility with real-time progress logging at each step
📝 Reproducible: Automatically archives exact software commands & parameters for every run
🧩 Modular: Execute individual stages or complete pipeline in one command
⚡ Flexible: 7 paralog handling methods & 5+ species tree inference options
🚀 Scalable: Built-in parallelization for large-scale phylogenomic datasets

🏆 Advantages

1. End-to-end pipeline from reads to trees

Processes data from raw reads to phylogenetic trees with single-command workflows
Supports both full pipeline execution and modular stage-specific operations
Minimizes manual intervention while maintaining flexibility

2. Unique functionality of integrating pre-assembled sequences

Allows for integrating pre-assembled loci sequences into the working dataset. (click here to grasp skills)

3. Customizable sequences filtering strategies

Dual filtering strategies for both loci and samples
Configurable thresholds for read depth, missing data, and sequence quality
Enables dataset optimization for different study goals

4. Advanced paralog-handling methods

Implements 7 distinct methods for paralog detection and processing
Includes both similarity-based and topology-based approaches
Improves orthology assessment accuracy

5. Multi-method Phylogenetic tree inference

Integrated softwares for concatenation-based methods: IQ-TREE, RAxML, and RAxML-NG
Integrated softwares for coalescent-based methods: ASTRAL-III or wASTRAL

6. Integrated visualization tools

plot_paralog_heatmap.py (click here to grasp skills);
plot_recovery_heatmap.py (click here to grasp skills)
modified_phypartspiecharts.py (click here to grasp skills)

7. High-Performance Computing

Parallel processing across samples and loci (option -process), which can significantly improve computational efficiency.

1 - Changelog

1.1.7 January, 2026

New function: Steps control in stage 4
- Added support for controlling individual steps within Stage 4, allowing users to selectively run specific steps (e.g., gene tree inference, alignment trimming, species tree inference) rather than executing the entire stage in one go. here for details.

1.1.6 September, 2025

New dependency: Plotly
Plotly has been integrated into the new script plot_recovery_heatmap_v2.py to generate an interactive HTML heatmap visualizing target locus recovery in Stage 2. This heatmap provides useful guidance for parameter selection in Stages 3–4.
TreeShrink integrated into Stage 3
TreeShrink has been incorporated into Stage 3 as an optional processing step. Users can enable it by setting the option -run_treeshrink to TRUE. TreeShrink removes genes with excessively long branches and is available for all seven paralog-handling pipelines except ParaGone, as TreeShrink is already implemented within the ParaGone pipeline.

1.1.5 September, 2025 - MAJOR UPDATE !

Pipeline restructuring
- Stage consolidation: Combined previous stages 3 and 4, simplifying the pipeline from 5 to 4 stages.
- Stagewise execution: Added flexible stage-by-stage execution capability.
Enhanced functionality
Gene tree inference:
- Added support for IQ-TREE and FASTTree.
- Deprecated RAxML.
Alignment trimming:
- New alternative: Integrated HMMCleaner
- Maintained trimAl as the default setting.
Species tree inference:
- Added ASTRAL-pro3 for multi-copy gene aware coalescent analysis.

1.1.3-1.1.4 August, 2025

Fixed some bugs in stages control. These versions have been abondoned.

1.1.2 August, 2025

Integrated ASTRAL-IV into the pipeline stage 4.

Usage Update:

Run ASTRAL-IV with parameter: -tree 4

New dependency:

ASTER(conda version)

1.1.1 August, 2025

Fixed some common bugs.

2 - Example dataset

This page provides detailed instructions on how to run the example dataset included with HybSuite.

1. Download the example dataset

If you have downloaded the HybSuite source package, a directory named example_dataset is already included. In this case, no additional download is required.

Alternatively, you can download the repository on your server using:

git clone https://github.com/Yuxuanliu-HZAU/HybSuite
cd HybSuite/example_dataset

2. Configure inputs

The directory example_dataset contains two folders: Angiosperms353 and Arabidopsis100, respectively encompassing all inputs for running HybSuite pipeline for the corresponding two example datasets in our analyss.

`Example dataset 1: Angiosperms353`

Angiosperms353/
├── Input_list.txt
├── Target_file_Angiosperms353.fasta
├── Input_sequences/
    ├── Elaeagnus_pungens.fasta
    └── Hippophae_rhamnoides.fasta

`Input_list.txt`

This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):

Elaeagnus_angustifolia	SRR12569928
Elaeagnus_bambusetorum	SRR27547630
Elaeagnus_henryi	SRR15533155
Elaeagnus_macrophylla	SRR23618743
Elaeagnus_mollis	SRR30566771
Hippophae_neurocarpa	SRR17549374
Hippophae_salicifolia	ERR7621632
Hippophae_tibetana	SRR17549370
Shepherdia_argentea	ERR7621633
Barbeya_oleoides	SRR16214280	Outgroup
Elaeagnus_oldhamii	A
Elaeagnus_pungens	B
Hippophae_rhamnoides	B

Identifiers prefixed with SRR or ERR: Public raw NGS data of the corresponding samples (the first row) ready to be downloaded in HybSuite pipeline.
Identifier A: User-provided raw NGS data of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline.
Identifier B: User-provided pre-assembled sequences of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline.
Identifier Outgroup : Specifing the outgroup taxon.

Note

In this example dataset, input data include public raw data, user-provided raw data, and pre-assembled sequences,. Thus, corresponding data are provided in the directory Input_sequences.
The outgroup taxon is Barbeya_oleoides.

`Input_sequences`

This directory should contain either user-provided raw reads, pre-assembled sequences, or both, according to the information provided in Input_list.txt.

type1: user-provided raw reads
In our analysis, only the data of species Elaeagnus oldhamii belongs to user-provided raw reads, which needs to be downloaded here prior to running HybSuite pipeline. After downloading the raw data, transfer them to FASTQ.GZ format and move them to this directory. The two pair-ended files should be named as:

Elaeagnus_oldhamii_1.fastq.gz
Elaeagnus_oldhamii_2.fastq.gz

type2: pre-assembled sequences
Two taxa with pre-assembled sequences are provided: Elaeagnus_pungens, and Hippophae_rhamnoides (corresponding to the taxon name along with the identifier B provided in the sample list file Sample_list.tsv. Their FASTA files are named as Elaeagnus_pungens.fasta and Hippophae_rhamnoides.fasta respectively. (<taxon>.fasta)

Tip

These two files are directly downloaded from PAFTOL project:

`Target_file_Angiosperms353.fasta`

This file is the target sequence file for Angiosperms353.
The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:

>Elaeagnus-pungens-4471
AATGTCATCCAGGATAAATATCGGTTGGAAGCTGCAAATACTGACTGGATGAACAAGTAC
AAAGGCTCTAGTAAGCTTCTATTGCATCCAAGGAACACTGAGGAGGTTTCACAGATACTC
...
>Hippophae-rhamnoides-4527
GAAGAGAGGGTTGTAGTATTAGTGATTGGTGGAGGAGGAAGAGAACATGCTCTTTGCTAT
GCAATGAATCGATCACCATCCTGCGATGCAGTCTTTTGTGCTCCTGGCAATGCTGGGATT
...
>Hippophae-salicifolia-4691
CAGAGACTGCCTCCATTGTCAACTGATCCCAACAGATGCGAGCGTGCATTTGTTGGAAAC
ACGATAGGTCAAGCAAATGGTGTGTACGACAAGCCAATCGATCTCCGATTCTGTGATTAC
...

`Example dataset 2: Arabidopsis100`

Arabidopsis100/
├── Input_list.txt
└── Target_file_Arabidopsis100.fasta

`Input_list.txt`

This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):

Elaeagnus angustifolia	SRR26705271
Elaeagnus bambusetorum	SRR26757993
Elaeagnus henryi	SRR26705270
Elaeagnus macrophylla	SRR26753865
Elaeagnus mollis	SRR26758012
Elaeagnus oldhamii	SRR26705501
Elaeagnus pungens	SRR26705285
Hippophae neurocarpa	SRR26705287
Hippophae rhamnoides	SRR26756417
Hippophae salicifolia	SRR26705274
Hippophae tibetana	SRR26704952
Shepherdia argentea	SRR26756705
Barbeya_oleoides	SRR26756183	Outgroup

Note

In this example dataset, all input data are public raw data from NCBI SRA (the identifiers are prefixed with SRR/ERR), which are ready to be downloaded in HybSuite pipeline. Thus, no data needs to be provided locally by users.
The outgroup taxon is Barbeya_oleoides.

`Target_file_Arabidopsis_thaliana100.fasta`

This file is the target sequence file for Arabidopsis100.
The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:

>Locus-1
MAFRRVLTTVILFCYLLISSQSIEFKNSQKPHKIQGPIKTIVVVVMENRSFDHILGWLKSTRPEIDGLTGKESNPLNVSDPNSKKIFVSDDAVFVDMDPGHSFQAIREQIFGSNDTSGDPKMNGFAQQSESMEPGMAKNVMSGFKPEVLPVYTELANEFGVFDRWFASVPTSTQPNRFYVHSATSHGCSSNVKKDLVKGFPQKTIFDSLDENGLSFGIYYQNIPATFFFKSLRRLKHLVKFHSYALKFKLDAKLGKLPNYSVVEQRYFDIDLFPANDDHPSHDVAAGQRFVKEVYETLRSSPQWKEMALLITYDEHGGFYDHVPTPVKGVPNPDGIIGPDPFYFGFDRLGVRVPTFLISPWIEKGTVIHEPEGPTPHSQFEHSSIPATVKKLFNLKSHFLTKRDAWAGTFEKYFRIRDSPRQDCPEKLPEVKLSLRPWGAKEDSKLSEFQVELIQLASQLVGDHLLNSYPDIGKNMTVSEGNKYAEDAVQKFLEAGMAALEAGADENTIVTMRPSLTTRTSPSEGTNKYIGSY*
>Locus-2
MSDQQLETEINFWGETSEEDYFNLKGIIGSKSFFTSPRGLNLFTRSWLPSSSSPPRGLIFMVHGYGNDVSWTFQSTPIFLAQMGFACFALDIEGHGRSDGVRAYVPSVDLVVDDIISFFNSIKQNPKFQGLPRFLFGESMGGAICLLIQFADPLGFDGAVLVAPMCKISDKVRPKWPVDQFLIMISRFLPTWAIVPTEDLLEKSIKVEEKKPIAKRNPMRYNEKPRLGTVMELLRVTDYLGKKLKDVSIPFIIVHGSADAVTDPEVSRELYEHAKSKDKTLKIYDGMMHSMLFGEPDDNIEIVRKDIVSWLNDRCGGDKTKTQV*
>Locus-3
MSSRENPSGICKSIPKLISSFVDTFVDYSVSGIFLPQDPSSQNEILQTRFEKPERLVAIGDLHGDLEKSREAFKIAGLIDSSDRWTGGSTMVVQVGDVLDRGGEELKILYFLEKLKREAERAGGKILTMNGNHEIMNIEGDFRYVTKKGLEEFQIWADWYCLGNKMKTLCSGLDKPKDPYEGIPMSFPRMRADCFEGIRARIAALRPDGPIAKRFLTKNQTVAVVGDSVFVHGGLLAEHIEYGLERINEEVRGWINGFKGGRYAPAYCRGGNSVVWLRKFSEEMAHKCDCAALEHALSTIPGVKRMIMGHTIQDAGINGVCNDKAIRIDVGMSKGCADGLPEVLEIRRDSGVRIVTSNPLYKENLYSHVAPDSKTGLGLLVPVPKQVEVKA*

Note

Differing from the target file for Angiosperms353, the target sequences in this file are all protein sequences.
Even so, there is no need to specify the type of target file, since HybSuite can automatically recognize the sequence type (nucleotide/protein).

3. Run the pipeline

First of all, change your working directory to the downloaded example dataset file:

cd <the path to the directory of "example_dataset">

Next, create output directories (or specify an existing directory when running HybSuite):

mkdir -p ./Angiosperms353/Output ./Arabidopsis100/Output

After setting the right working directory, run the following commands for the two example datasets:

Angiosperms353

hybsuite full_pipeline \
-input_list ./Angiosperms353/Input_list.txt \
-input_data ./Angiosperms353/Input_sequences \
-output_dir ./Angiosperms353/Output \
-nt 5 \
-process 5 \
-t ./Angiosperms353/Target_file_Angiosperms353.fasta \
-seqs_min_length 100 \
-seqs_min_sample_coverage 0.1 \
-PH 1234567 \
-sp_tree 14

Arabidopsis100

hybsuite full_pipeline \
-input_list ./Arabidopsis100/Input_list.txt \
-output_dir ./Arabidopsis100/Output \
-nt 5 \
-process 5 \
-t ./Arabidopsis100/Target_file_Arabidopsis_thaliana100.fasta \
-seqs_min_length 100 \
-seqs_min_sample_coverage 0.1 \
-PH 1234567 \
-sp_tree 14

Tip

The command format hybsuite full_pipeline [options] ... and hybsuite stage1 [options] ... applies to the Conda installation.
If HybSuite is installed locally from source, run:
bash <absolute_path_to_HybSuite/bin/HybSuite.sh> full_pipeline ...
instead of hybsuite full_pipeline.

3 - Extension tools

Apart from the main pipeline, we also offer some extention tools for results visualization and statistic analysis. This page tells you how to use them!

1. `plot_paralog_heatmap.py`

(1) Overview

Paralogs are homologous genes that arise from gene duplication within the same species. plot_paralog_heatmap.py is a Python script for analyzing and visualizing paralog distribution patterns across samples and loci. As a part of the HybSuite toolkit, it processes unaligned FASTA file for each locus to:

Count paralogous sequences for each sample at each locus and generate a TSV format data table recording the counts.
Generate heatmaps to visualize paralog distribution patterns, with auto-adjusted dimensions based on sample and locus counts.
Support multi-threading for improved efficiency.

(2) Dependencies

If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:

conda activate <conda_env>

Otherwise, manually install the dependencies first:

pip install pandas seaborn matplotlib numpy

Key differences from HybPiper’s paralog heatmap plotting function paralog_retriever:
- (1) Input format
  - hybpiper paralog_retriever: Requires the sample folders generated by hybpiper assemble.
  - This script: Accepts a folder containing FASTA files, making it applicable to a wider range of datasets (e.g., those including pre-assembled sequences).
- (2) Visualization features
  - Customizable heatmap colors.
  - Option to display the paralog count for each sample–locus cell directly on the heatmap.
  - Generates higher-quality, more publication-ready figures.

(3) Input file requirements

This script processes the following input files:

(1) Input Directory (required, specified by -i/--input_dir):
A directory containing multiple FASTA format files, each FASTA file represents a locus and contains sequences from multiple samples. Files should be named <locus_name>.fasta, <locus_name>_paralogs.fasta or <locus_name>_paralogs_all.fasta.

Tip

FASTA File Format Requirements:

Sequence headers start with ‘>’
Header format should be >sample_name [other information] or simply >sample_name
Mutiple sequences in one sample in one locus are allowed.

For example:

>Sample1
ATGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>Sample2
ATGCTAGCTATCGATCGATCGATCGATCGA
>Sample3
ATGCTCGATCGATCGATCGATCGATCGATC

NOTE:
If a sample has only one single sequence in one locus, then this sequence is the orthology one. If a single sample have multiple sequences in one locus, then these seuqences are putative paralogs.

The paralog sequence FASTA files retrieved by hybpiper paralog_retriever can directly be the input files for this script. For example, the 5942_paralogs_all.fasta in our dataset:

>Elaeagnus_angustifolia.main NODE_1_length_1285_cov_267.250828,Elaeagnus-pungens-Elaeagnus_pungens,0,201,98.51,(1),312,915
ACCTTCCTTGACCTCAAGACCGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATCATTACAGACCTGCAGAACAAACGCCGTGGCACCGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGCGGAAAGAAGCCCTACCCACAAAAGAAAACGGGTAGGGCTCGACAGGGGTCCAAGAGAACTCCACTCCGGCCAGGTGGAGGAGTCGTCTTTGGGCCTAAGCCCAGAGATTGGAGCATCAAGATCAATAGAAAGGAGAAAAGGTTGGCGATTTCGACAGCAATGTCTAGTGCAGCTGCGAATACGATCGTGGTGGAGGATTTTTGGGACAATATGGATAAACCCAGGACGAAGGATTTTATAGCTGCTATGAAGAGGTGGGGTTTAAATCCACCGGGAGAGAAAGCTATGTTTATGATGGACGAAATTTCGGATAACGTGAGGCTTTCAAGTAGAAATATTCCGAAAGTGAAGGTTTTGACCCCGAGGACTTTGAATTTGTTTGATATTTTAAATGCGGATAAGTTGGTGCTTACCCCTGCTGCTGTGGATTACTTGAATGGACGTTATGGTGTTAATTATGAGGGTGAGAGT
>Elaeagnus_angustifolia.0 NODE_2_length_1266_cov_276.023549,Elaeagnus-pungens-Elaeagnus_pungens,2,201,89.95,(-1),301,898
CTTGATCTCAAAACAGCACCACCCGAAACTGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCGTGGGACTGCCTCAACCCTAACCCGTGGTGAGGTTAGAGGTGGTGGGAAAAAACCTTACCCACAAAAGAAAACGGGTCGGGCCCGACAAGGGTCCAAGAGAACTCCACTCCGTCCCGGTGGAGGTGTCGTTTTTGGTCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTGGTGGAGGATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGAGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCGGAAGAGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCTAATTATGAGGGGGAGAAT
>Elaeagnus_macrophylla.main NODE_2_length_1269_cov_183.823826,Elaeagnus-pungens-Elaeagnus_pungens,0,201,88.06,(1),297,900
ACATACCTCGATCTCAAAACAGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCCTGGGACTGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGTGGAAAGAAACCTTACCCACAAAAGAAAACGGGTCGCGCTCGACAAGGGTCAAAAAGAACTCCACTCCGTCCAGGTGGAGGTGTCGTTTTTGGGCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTAGTGGAGCATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGGGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCTGAAGTGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCAAATTATGAGGGTGAGAAT
...

(2) Species List File: (Optional, specified by -species_list)
If you want to analyze only specific species, you can provide a species list file:

Sample1
Sample2
Sample3

(4) Basic usage

python plot_paralog_heatmap.py \
    -i <input_dir> \
    -opr <counts.tsv> \
    -oph <heatmap.png> \
    [options] ...

Required parameters in basic usage
- -i/--input_dir
  Directory containing FASTA files with paralogous sequences (formatted as <locus_name>_paralogs.fasta)
- At least one output option must be specified:
  - -opr, --output_paralog_report
    Generate a TSV file containing paralog counts
  - -oph, --output_paralog_heatmap
    Generate a heatmap visualization (format determined by file extension)

(5) Full parameters:

General options:
  -t THREADS, --threads THREADS
                        Number of threads to use for processing (default: 1)
  --species_list SPECIES_LIST
                        File containing list of species to include in the analysis (one species per line)
  --output_species_list OUTPUT_SPECIES_LIST
                        Output file to save the list of processed species

Heatmap customization options:  
  --dpi DPI             DPI (dots per inch) for output image (default: 300)
  --fig_length FIG_LENGTH
                        Figure length in inches (default: auto-calculated based on number of loci)
  --fig_height FIG_HEIGHT
                        Figure height in inches (default: auto-calculated based on number of species)
  --sample_font SAMPLE_FONT
                        Font size for sample labels in points (default: 10)
  --gene_font GENE_FONT
                        Font size for gene labels in points (default: 10)
  --hide_xlabels        Hide x-axis labels (locus names)
  --hide_ylabels        Hide y-axis labels (sample names)
  --no_grid             Do not show grid lines in heatmap
  --color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient (default: black)
  --show_values         Show numerical values in heatmap cells (only for values >= 2)
  --grid_color GRID_COLOR
                        Color for grid lines in heatmap (default: grey)
  --add_markers         Add visual markers in cells (dots for 1s, diagonal lines for 0s)

(6) Output examples

TSV Report (paralog_counts.tsv):
(use -opr, --output_paralog_report to specify the output filename and directory.)

Species   gene1   gene2   gene3
Sample1   2       1       0
Sample2   1       1       1
Sample3   0       2       1

Heatmap Visualization
The heatmap uses color intensity to represent the number of recovered sequences:
- White: No sequences (0)
- Light color: Single sequence (1), representing single copy orthologs.
- Darker color: Multiple sequences (≥2), representing putative paralogs.
For example, the following figure is the default output of our test dataset Arabidopsis100.
You can find it in <output_dir>/02-All_paralogs/04-Filtered_paralog_reports_and_heatmap/Filtered_paralog_heatmap.png after running HybSuite by following our guide. In the HybSuite stage 2 pipeline, this script was applied to generate the heatmaps for original paralogs and filtered paralogs and by default, --show_values is used to display the specific numbers of recovered sequences in each locus of each sample.

test_dataset-paralog_heatmap_default

When running this script manually, the recovered sequence counts won’t display if you don’t use the --show_values option.:

test_dataset-paralog_heatmap_no_value

To clearly show the type of sequence in each locus of each sample, it is advisable to use --add_markers plus --show_value to add markers and numbers to the figure:
- X: No sequences (0)
- ·: Single sequence (1), representing single copy orthologs.
- <number>: Multiple sequences (≥2), representing putative paralogs.

python plot_paralog_heatmap.py ... -add_markers --show_value

test_dataset-paralog_heatmap_markers

Besides, you can also use --color to change a new color theme:

python plot_paralog_heatmap.py ... --color red

test_dataset-paralog_heatmap_red_color

python plot_paralog_heatmap.py ... --color blue

test_dataset-paralog_heatmap_blue_color

NOTE: Our script only provides nine color themes, they are black(default), red, blue, purple, green, orange, yellow, brown and pink.

(7) Use cases

Paralog Distribution Analysis: Identify which species and genes tend to have more paralogs
Data Quality Assessment: Evaluate completeness of sequencing and assembly
Evolutionary Analysis: Study gene duplication events across different species
Data Visualization: Generate high-quality visualizations for papers or reports

(8) Tips and tricks

For large datasets, increase the thread count (-t parameter) to speed up processing
If sample names are long, use a smaller sample font size (–sample_font)
This script can adjust the image dimensions automatically (which might be best for visualization in many cases). You can also use --fig_length and --fig_height to manually adjust your image.
Use –show_values to display specific paralog counts directly on the heatmap (for counts ≥2)

2. `plot_recovery_heatmap_v2.py`

(1) Overview

plot_recovery_heatmap_v2.py visualizes sequence recovery across samples and loci. It generates heatmaps showing the percentage of sequence length recovered for each gene in each sample, relative to reference sequences.

It highlights:

Well-recovered loci across samples
Samples with poor overall recovery
Recovery patterns indicating systematic biases

Key features:

Calculates sequence lengths from FASTA files
Generates comprehensive sequence length tables in TSV format
Creates customizable heatmaps with multiple color schemes
Supports comparison against mean or maximum reference lengths
Offers extensive visualization options including value display and grid customization
Provides multi-threading support for processing large datasets
Support for interactive Plotly HTML output.

Note

This script is the updated version of plot_recovery_heatmap.py. The initial version is still available but the updated one is more recommended to use. See the plot_recovery_heatmap.py manual for details.

(2) Dependencies

If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:

conda activate <conda_env>

Otherwise, manually install the dependencies first:

pip install biopython pandas seaborn matplotlib numpy plotly

The script requires:

Python 3.6 or higher
Biopython (for sequence parsing)
NumPy and Pandas (for data manipulation)
Matplotlib and Seaborn (for visualization) The script automatically checks for required packages and will provide clear error messages if any are missing.

(3) Input file requirements

Required input:

Directory of FASTA files - Each file should contain sequences for a single locus across multiple samples

Supported file extensions: .fna, .fasta, .fa.
Each sequence header should start with the species/sample name (e.g., >species_name rest_of_header).

Target sequence file - A single FASTA file containing reference sequences for all target loci

Each sequence ID should include the locus name at the end, separated by a hyphen (e.g., >ref-locusnameA).
The script automatically detects whether references are nucleotide or protein sequences.

(4) Optional input:

Species list file - A simple text file with one species name per line (-s <FILE>)
If not provided, species names will be automatically extracted from the FASTA files

(5) Basic usage

The basic command requires only the input directory and reference file:

python plot_recovery_heatmap_v2.py -i /path/to/fasta_files -r /path/to/reference.fasta \
-output_heatmap /path/to/recovery_heatmap(without_suffix)

This will:

Calculate sequence lengths for each sample and locus.
Generate a seq_lengths.tsv file in the current directory. (use --output_seq_lengths to change the output path).
Create a heatmap image named recovery_heatmap.png in the directory /path/to/.

(6) Example

If you have finished running our pipeline, open the directory 02-All_paralogs/03-Filtered_paralogs, which is one of the output directories of our example dataset Angiosperms353. Then you will find there are many FASTA files in it:

4471_paralogs_all.fasta
4527_paralogs_all.fasta
4691_paralogs_all.fasta
4724_paralogs_all.fasta
...

Locus name 4471, 4527, 4691, 4724 …
Filename suffix: _paralogs_all
Suffix: .fasta

In that case:

use the option -i/--input_dir to specify the path to this directory;
use the option -r to specify the path to the Angiosperms353 target file (locus names in target file must be corresponded with that in your input directory);
use the option --filename_suffix to specify the filename suffix _paralogs_all to enable the script extract the locus name from the filename.

Run:

cd /path/to/02-paralogs/03-Filtered_paralogs/
python plot_recovery_heatmap_v2.py -i . -r /path/to/Target_file_Angiosperms353.fasta --filename_suffix "_paralogs_all" -output_heatmap ./recovery_heatmap.html -gw 0

Then you can obtain an interactive heatmap HTML file ./recovery_heatmap.html:

Loading chart...

The blue bars along with x- and y-axes indicate how many loci are recovered in each sample and how many samples each locus are recovered in, respectively.
The color intensity of each cell indicates the proportion of gene length recovered for a given sample (y-axis) at a specific target locus (x-axis). When multiple sequences are recovered for a locus within a sample (putative paralogs), only the longest sequence is retained for visualization in the heatmap.

Now, let’s play with this interactive html file for fun and better effect!

Choose the button “Sort by” as “Descending” to sort samples and loci on the heatmap from high to low recovery.
Click on the “Plus” (+) and “Minus” (-) icons in the upper right corner to zoom in and out of the heatmap.
Click on the “AutoScale” icon in the upper right corner to auto-scale the heatmap.
Click the “Camera” (📷) icon in the upper right corner to download the current heatmap view as a PNG file.

Tip

If some samples recover very few or no loci, we recommend replacing their data sources or increasing the value of -seqs_min_loci_coverage to exclude these low-quality samples from downstream analyses.

(7) Full parameters

usage: plot_recovery_heatmap_v2.py [-h] -i INPUT_DIR -r REFERENCE [-s SPECIES_LIST] [--filename_suffix FILENAME_SUFFIX]
                                   [--output_species_list OUTPUT_SPECIES_LIST] [--output_heatmap OUTPUT_HEATMAP]
                                   [--output_seq_lengths OUTPUT_SEQ_LENGTHS] [-t THREADS]
                                   [--color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}] [--title TITLE]
                                   [--use_max] [--xlabel XLABEL] [--ylabel YLABEL] [-gw GRID_WIDTH]

plot_recovery_heatmap_v2.py - A visualization tool in HybSuite

This script is a component of the HybSuite toolkit, designed for visualizing sequence recovery 
rates across different taxa and loci. It generates heatmaps that display the percentage of 
sequence length recovered for each gene in each taxon, relative to either the average or 
maximum length of reference sequences.

Key features:
1. Calculates sequence lengths and generates a seq_lengths.tsv file
2. Calculates the percentage length recovered relative to reference sequences
3. Generates customizable heatmaps showing recovery rates
4. Supports both average and maximum reference length comparisons
5. Offers flexible visualization options

Both the seq_lengths.tsv file and heatmap generation are optional outputs.
Part of HybSuite

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_DIR, --input_dir INPUT_DIR
                        Directory containing FASTA files for each locus
  -r REFERENCE, --ref REFERENCE
                        Target sequence file (FASTA format)
  -s SPECIES_LIST, --species_list SPECIES_LIST
                        File containing list of species names (one per line). If not provided, species names will be extracted from FASTA files
  --filename_suffix FILENAME_SUFFIX
                        Suffix(es) to remove from input FASTA filenames to get locus names. Multiple suffixes can be separated by commas. Example: "_paralogs_all". If not specified, the input filenames will be recognized as loci names.
  --output_species_list OUTPUT_SPECIES_LIST, -osp OUTPUT_SPECIES_LIST
                        Output file for extracted species list (when species_list is not provided)
  --output_heatmap OUTPUT_HEATMAP, -oh OUTPUT_HEATMAP
                        Output path and filename for the heatmap (default: recovery_heatmap.html). Should end with .html extension.
  --output_seq_lengths OUTPUT_SEQ_LENGTHS, -osl OUTPUT_SEQ_LENGTHS
                        Output file for sequence lengths (TSV format). If not provided, sequence lengths will be written to seq_lengths.tsv in current directory
  -t THREADS, --threads THREADS
                        Number of threads to use (default: 1)
  --color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}
                        Color scheme for the HTML heatmap (default: blue). Available options: viridis, magma, inferno, plasma, cividis, turbo, purple, blue, green, black
  --title TITLE         Main title of the heatmap (default: "Percentage length recovery for each gene")
  --use_max             Use maximum length instead of average length from reference sequences
  --xlabel XLABEL       X-axis label (default: "Locus")
  --ylabel YLABEL       Y-axis label (default: "Sample")
  -gw GRID_WIDTH, --grid_width GRID_WIDTH
                        The value of grid width of the heatmap, recommended to set as "0" when the locus number is huge (default: 0.5)

3. `RLWP.py`

(1) Overview

RLWP.py (Remove Loci With Paralogs) is a Python script within the HybSuite toolkit designed to filter out genetic loci with excessive paralog occurrences. Paralogs are gene copies that arise from gene duplication events and can complicate phylogenetic analyses. This tool identifies and removes loci that exceed a user-defined threshold of paralog presence across samples, helping to improve the quality of downstream analyses by maintaining only single-copy orthologous markers.
Key features:

Filters loci based on paralog occurrence statistics
Supports multi-threading for improved performance
Provides detailed logging and reporting
Offers in-place filtering or non-destructive output to a separate directory
Works with various sequence file formats (suffix can be FASTA, FNA, fasta, fa)

(2) Dependencies

If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:

conda activate <conda1_env>

Otherwise, manually install the dependencies first:

pip install biopython pandas

(3) Input file requirements

RLWP.py requires two main types of input:

A directory containing sequence files: A directory containing nucleotide sequence files in FASTA format (.fa, .fasta, .fna, .fas or their uppercase variants). Each file should represent one locus.
Paralog statistics file: A tab-separated values (TSV) file containing paralog counts per sample for each locus.

0: No any sequence is recovered in this loci of the sample.
1: The only recovered sequence in this loci of the sample is a single-copy orthology sequence.
more than 1: Putative paralogs exist in this loci of the sample.

This file should have:

Sample IDs in the first column
Locus names as column headers
Values representing the number of paralogs found for each sample-locus combination

NOTE: This file can be generated by running plot_paralog_heatmap.py (with the option -oph, see here)

Example paralog statistics file format:

Sample  Locus1  Locus2  Locus3
sample1 1       2       1
sample2 1       1       3
sample3 2       1       1

(4) Basic usage

Remove loci where >2 samples show putative paralogs:

python RLWP.py -i input_directory -p paralog_statistics.tsv -s 2 -or deletion_report.tsv

Required parameters:

-i, --input_dir: Directory containing sequence files
-p, --paralog_heatma: Path to paralog statistics file (TSV format)
-s, --samples_threshold: Minimum number of samples with paralogs to trigger locus removal
-or, --output_report: Path for saving the deletion report

Optional parameters:

-o, --output_dir: Optional directory to output filtered files (preserves originals)
-t, --threads: Number of threads to use for parallel processing (default: 1)

(5) Output examples

Tips and tricks

Choosing the Right Threshold: Start with a conservative threshold (e.g., 5% of your total samples) and adjust based on your dataset characteristics.
Non-destructive Workflow: Use the -o option to create a filtered copy of your data without modifying original files:

python RLWP.py -i <input_directory> -p paralog_reports.tsv -s 3 -o filtered_data

Start with a conservative threshold: (e.g., 5% of your total samples) and adjust based on your dataset characteristics.
Performance Optimization: For large datasets, increase thread count to speed up processing

4. `filter_seqs_by_length.py`

(1) Overview

filter_seqs_by_length.py is a Python script within the HybSuite package that filters DNA sequences based on length criteria. It allows filtering sequences using absolute minimum length or relative length compared to reference sequences.
It is particularly useful for removing short, potentially truncated sequences before downstream analyses, helping to ensure high-quality datasets for phylogenomic analysis.
It processes multiple FASTA files in parallel, reference file can be DNA or protein sequences.
It provides detailed logging and reporting of filtered sequences, making it easy to track what was removed and why.

(2) Dependencies

Key dependencies:
- BioPython: For sequence handling
- Pandas: For reporting and data manipulation
- Python 3.6+: For pathlib and f-string support
If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:

conda activate <conda1_env>

Otherwise, manually install the dependencies first:

pip install biopython pandas

(3) Input file requirements

filter_seqs_by_length.py requires two types of input:

A directory containing sequence files in FASTA format:

Supported extensions: .fa, .fasta, .fna, .fas (case-insensitive)
Each file is assumed to contain sequences from a single locus
Filename determines locus ID (e.g., the locus name of GeneName.fasta is GeneName)

Reference Sequences

The format of reference sequences is as the same as the requirement of HybSuite main program (here to check).

(4) Basic usage

Filter sequences by absolute minimum length:

python filter_seqs_by_length.py -i input_directory --min_length 300

Filter using reference sequences to according to length ratio (compared to mean length or maximum length of each locus in reference file):

python filter_seqs_by_length.py -i input_directory -r reference.fasta --mean_length_ratio 0.7

Combine multiple criteria to filter:

python filter_seqs_by_length.py -i input_directory -r reference.fasta \
    --min_length 200 --mean_length_ratio 0.6 --max_length_ratio 0.5

Save output to a different directory rather than modifing original files:

python filter_seqs_by_length.py -i input_directory --output_dir filtered_sequences

Generate report of removed sequences:

python filter_seqs_by_length.py -i input_directory -r reference.fasta \
    --mean_length_ratio 0.7 --output_report removed_seqs.tsv

(5) Output examples

Filtered FASTA Files

Filtered sequences are written either:

To the original files (overwriting them)
To a new directory if --output_dir is specified

Removed Sequences Report

When using --output_report, a TSV file is created with details of removed sequences:

File	Sequence_ID	Length	Mean_Length_Ratio	Max_Length_Ratio
gene1.fasta	Sample1_gene1	125	0.435	0.391
gene2.fasta	Sample3_gene2	78	0.213	0.185

(6) Use cases

Cleaning Assembled Data

Remove truncated sequences resulting from poor assembled and captured sequences:

python filter_seqs_by_length.py -i captured_exons -r reference.fasta \
    --mean_length_ratio 0.3 --output_report removed_sequences.tsv

Tips and Tricks

Locus Identification: Ensure filenames match locus IDs in reference sequences (part after last hyphen).
Preserving Originals: Always use --output_dir when testing filtering parameters to avoid overwriting original files.
Speed Optimization: Set -t to match available CPU cores for maximum performance.
Multiple Filters: Combining --min_length with ratio-based filters creates more stringent filtering.
Protein Sequences: The script automatically detects the type of reference file (DNA/protein) and adjusts length calculations appropriately.

5. `filter_seqs_by_sample_and_locus_coverage.py`

(1) Overview

filter_sequences_by_sample_and_locus_coverage.py is a Python script designed to remove samples with low loci coverage and loci with low sample coverage in phylogenomic dataset based on user-defined threshold.

This tool can:

Filter samples and loci based on minimum coverage thresholds
Generate reports of removed samples and loci
Process files in parallel to improve performance

(2) Dependencies

Key dependencies:
- BioPython: For sequence handling
- Pandas: For reporting and data manipulation
- Python 3.6+: For pathlib and f-string support
If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:

conda activate <conda1_env>

Otherwise, manually install the dependencies first:

pip install biopython pandas

(3) Input file requirements

The tool processes FASTA files in a directory with the following requirements:

A directory containing sequence files in FASTA format:

Supported extensions: .fa, .fasta, .fna, .fas, FNA (case-insensitive).
Each file is assumed to contain sequences from a single locus.
Filename determines locus ID (e.g., the locus name of GeneName.fasta is GeneName).

(4) Basic usage

python filter_seqs_by_sample_and_locus_coverage.py -i input_directory --min_sample_coverage 0.5 --min_locus_coverage 0.7

Required parameters

-i, --input: Directory containing FASTA files.
--min_sample_coverage: Minimum sample coverage ratio (0-1) for each locus (default: 0.0)
`–min_locus_coverage: Minimum locus coverage ratio (0-1) for each sample (default: 0.0)

Optional parameters

-o, --output_dir: Directory for filtered sequences (if not specified, original files are modified)
-t, --threads: Number of threads to use (default: 1)
--removed_samples_info: Output TSV file for removed samples coverage information
--removed_loci_info: Output TSV file for removed loci coverage information

(5) Output examples

Example of removed_samples.tsv (specified by --removed_samples_info):

Sample  Locus_Coverage
Species1    0.45
Species2    0.32

Example of removed_loci.tsv (specified by --removed_loci_info):

Locus   Sample_Coverage
Locus1  0.38
Locus2  0.42

(6) Use cases

The following running codes remove loci that appear in less than 60% of samples and samples that have less than 50% of loci.

python filter_seqs_by_sample_and_locus_coverage.py \
-i assembled_loci/ -o filtered_loci/ \
--min_sample_coverage 0.6 --min_locus_coverage 0.5 \
--removed_samples_info removed_samples.tsv --removed_loci_info removed_loci.tsv

(7) Tips and tricks

Choosing Coverage Thresholds: Start with lower thresholds (e.g., 0.3-0.5) and gradually increase until you achieve the desired balance between data completeness and taxon/locus sampling.
Preserving Original Files: Always use the -o option to output to a new directory when experimenting with different thresholds.
Removing Problematic Samples Only: Set --min_locus_coverage without setting --min_sample_coverage to filter out only low-coverage samples while keeping all loci.
Tracking Removed Data: Always use the --removed_samples_info and --removed_loci_info options to keep records of what was filtered out for documentation and troubleshooting.
Performance Optimization: Use the -t option with a value close to your CPU core count for faster processing of large datasets.

6. `modified_phypartspiecharts.py`

(1) Overview

modified_phypartspiecharts.py is an enhanced version of the original phypartspiecharts.py script, specifically designed to visualize gene tree conflict patterns. Based on the method published by Smith et al. in 2015, this script generates pie charts to display gene tree concordance and conflict at phylogenetic tree nodes.

(2) Basic usage

The basic usage of modified_phypartspiecharts.py is nearly as the same as that of phypartspiecharts.py. The only difference is that users have to use --output to specify the path and filename of the output visiualization results when running modified_phypartspiecharts.py, rather than --svg_name in phypartspiecharts.py.

python modified_phypartspiecharts.py \
species_tree phyparts_prefix num_genes ...

Required Parameters in basic usage
- species_tree: Path to species tree file (Newick format)
- phyparts_root: Prefix of PhyParts output files
- num_genes: Total number of gene trees

(3) Extended functionality

Compared to the original version, modified_phypartspiecharts.py offers the following extended functionality:

a. Running Efficiency Control

Multithreading Support: Use -nt/--threads <NUM> for multithreaded processing, significantly improving speed for large datasets

b. Output Files Control

Support for SVG and PDF format outputs: use --output <output_file> and specify your outputfile with the extension .pdf or .svg.
Additional Statistical Output: use --stat parameter to export detailed node statistics to TSV files.
The detailed node statistics table generated by --stat parameter contains the following columns:
- Node: Node ID
- Support(blue): Number of genes supporting the species tree
- TopConflict(green): Number of genes with main conflict
- OtherConflict(red): Number of genes with other conflict
- NoSignal(gray): Number of genes with no signal
- Support/Total_Ratio: Ratio of supporting to conflicting genes
The file also includes the average ratios for internal nodes including Support/Total Ratio, Conflict/Total Ratio, NoSignal/Total Ratio, Support/Signal_Ratio and Conflict/Signal_Ratio.
For example:

Node	Support(blue)	TopConflict(green)	OtherConflict(red)	NoSignal(gray)	Support/Total_Ratio	Conflict/Total_Ratio	NoSignal/Total_Ratio	Support/Signal_Ratio	Conflict/Signal_Ratio
0	85	0	0	157	0.3512	0.0000	0.6488	1.0000	0.0000
1	202	7	31	2	0.8347	0.1570	0.0083	0.8417	0.1583
2	135	41	59	7	0.5579	0.4132	0.0289	0.5745	0.4255
3	205	6	20	11	0.8471	0.1074	0.0455	0.8874	0.1126
4	123	22	91	6	0.5083	0.4669	0.0248	0.5212	0.4788
5	164	33	36	9	0.6777	0.2851	0.0372	0.7039	0.2961
6	190	18	29	5	0.7851	0.1942	0.0207	0.8017	0.1983
7	91	35	112	4	0.3760	0.6074	0.0165	0.3824	0.6176
8	129	18	86	9	0.5331	0.4298	0.0372	0.5536	0.4464

Average ratios for internal nodes only:
Support/Total Ratio: 0.6079
Conflict/Total Ratio: 0.2957
NoSignal/Total Ratio: 0.0964
Support/Signal Ratio: 0.6963
Conflict/Signal Ratio: 0.3037

c. Extended Visualization Functionality

Support for controlling whether taxonomic names use italic font: use --no_italic
Support for flexible number displayed on branches: use --show_num_mode <NUM>
Support for controlling tree branch width display: use --line_width <NUM>
Support for controlling pie chart size: use --pie_size <NUM>
Support for controlling leaf node label size: use --tip_size <NUM>
Support for controlling node number label size: use --number_size <NUM>
Support for circular, cladogram, and phylogram display types: use --tree_type <circular|cladogram>

Github_wiki_page-modified_phypartspiecharts

(4) Full Options

options:
  -h, --help            show this help message and exit
  --taxon_subst TAXON_SUBST
                        Comma-delimited file to translate tip names.
  --output OUTPUT       Output filename with extension (.svg or .pdf)
  --output_node_tree    Generate an additional tree file with '_nodes' suffix showing:
                        - All node identifiers in the tree
                        - No pie charts
                        - No numerical annotations
  --no_ladderize        Don't ladderize tree
  --to_csv              Export data to CSV
  --tree_type {circle,cladogram,phylo}
                        Tree visualization type (cladogram or circle, default: cladogram)
  --line_width VT_LINE_WIDTH
                        Width of tree branches (default: 0)
  --no_italic           Display species names in normal font style (default: italic)
  --tip_size TIP_SIZE_FACTOR
                        Scale factor for tip label font size (default: 1.0)
  --number_size NUMBER_SIZE_FACTOR
                        Scale factor for gene tree count font size (default: 1.0)
  --show_num_mode SHOW_NUM_MODE
                        Control what numbers to show on branches (specify 0-2 digits):
                        0: Hide all numbers
                        1: Number of genes supporting species tree (blue)
                        2: Number of genes conflicting with species tree (red+green)
                        3: Number of genes with no signal (gray)
                        4: Proportion of supporting genes (blue/total)
                        5: Proportion of conflicting genes ((red+green)/total)
                        6: Proportion of no signal genes (gray/total)
                        7: Ratio of supporting to all signal genes (blue/(blue+red+green))
                        8: Ratio of conflicting to all signal genes (red+green/(blue+red+green))
                        9: Original node support values from the input tree
                        Example: --show_num_mode 0  (hide all numbers)
                                --show_num_mode 1  (show only support number)
                                --show_num_mode 12 (default, show support and conflict numbers)
                                --show_num_mode 47 (show support number and support/conflict ratio)
                                --show_num_mode 9  (show original node support values)
  --pie_size PIE_SIZE_FACTOR
                        Scale factor for pie chart size (default: 1.0)
  --stat STAT_OUTPUT    Output file path for node statistics (TSV format)
  -nt THREADS, --threads THREADS
                        Number of threads to use (default: 1)

Citation

If using this tool, please cite:

Original phypartspiecharts.py: https://github.com/mossmatters/MJPythonNotebooks/blob/master/phypartspiecharts.py

7. `Fasta_formatter.py`

(1) Overview

Fasta_formatter.py is a Python script for reformatting FASTA sequences into either interleaved (60 characters per line) or single-line format. It supports multi-threading for faster processing of large files.

(2) Dependencies

If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:

conda activate <conda_env>

Otherwise, manually install the dependencies first:

pip install pathlib concurrent.futures

(3) Basic usage

python Fasta_formatter.py \
    -i <input_fasta> \
    -o <output_fasta> \
    --inter|--single \
    [-nt <threads>]

Required parameters:

-i/--input: Input FASTA file
-o/--output: Output file path
--inter: Output in interleaved format (60 characters per line)
--single: Output in single-line format

Optional parameters:

-nt/--threads: Number of threads (default: 1)

(4) Example

Convert a FASTA file to interleaved format with 4 threads:

python Fasta_formatter.py -i sequences.fasta -o sequences_formatted.fasta --inter -nt 4

Convert to single-line format:

python Fasta_formatter.py -i sequences.fasta -o sequences_singleline.fasta --single

(5) Use cases

Data preprocessing: Prepare sequences for downstream analysis tools that require specific FASTA formatting
File standardization: Convert between different FASTA formats for compatibility
Large file processing: Use multi-threading to speed up formatting of big datasets

8. `rename_assembled_data.py`

(1) Overview

rename_assembled_data.py is a python script in the HybSuite package designed to handle batch renaming operations for assembled data directories produced in the HybSuite stage 2 and their contents. It provides a comprehensive solution for renaming directories, files, and file contents while maintaining data integrity and consistency.

Key features:

Recursively renames directory structures, file names, and file contents
Handles potential naming conflicts safely
Supports both single directory and batch renaming operations

(2) Basic usage

Single directory renaming

To rename a single directory and all its contents:

python rename_assembled_data.py -i /path/to/directory -n new_name

Parameters:

-i, --input: Path to the directory you want to rename
-n, --new_name: The new name to replace the old name

Example:

python rename_assembled_data.py -i ./sample_001 -n sample_002

Batch renaming

For batch renaming multiple directories, create a tab-delimited file containing old and new name pairs:

python rename_assembled_data.py --rename_list path/to/rename_list.txt -p /path/to/parent_directory

Parameters:

--rename_list: Path to a tab-delimited file containing old_name and new_name pairs
-p, --parent_dir: Path to the parent directory containing all the folders to be processed

The rename list file should be formatted as follows (tab-delimited):

old_name1   new_name1
old_name2   new_name2
old_name3   new_name3

Example:

python rename_assembled_data.py --rename_list rename_pairs.txt -p ./assembled_data

The script will:

Process each directory listed in the rename file
Rename all matching files and directories within each target directory
Replace matching content within files
Provide a summary of successful and failed operations

Note: The script includes safety checks and will skip operations that might cause conflicts or data loss.

4 - Full parameters

This page provides the full options and parameters for each subcommand, along with additional explanations and links where necessary. The available subcommands can be viewed using the command:

hybsuite -h/--help

or:

bash <the path to HybSuite.sh> -h/--help

Parameters for running `hybsuite stage1`

Stage 1 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage1 ...

Mandatory arguments: -input_list -input_data (required when including user-provided data) -output_dir

Essential arguments: -sra_maxsize -NGS_dir -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types. (Default: None)
  -input_data <DIR>     The directory containing all input data (required when the inputs include your own data / pre-assembled data). (Default: None).

Arguments for outputs:
  -output_dir <DIR>     The output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -NGS_dir <DIR>        The output directory containing raw and cleaned reads files (Default: <output_dir>/NGS_dataset).
                        Notes: Pre-existing cleaned reads will skip reads trimming steps.

General arguments:
  === Threads control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_fasterq_dump <INT>               
                        fasterq-dump threads. (Default: 1)
  -nt_pigz <INT>        pigz compression threads. (Default: 1)
  -nt_trimmomatic <INT> Trimmomatic threads. (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of public data downloading and raw reads trimming to run concurrently. (Default: 1)
                        "all" means running all samples concurrently. (be cautious to set this option)
   
  === Public raw reads doenloading control ===
  -rm_sra <TRUE/FALSE>  Whether to remove SRA files after conversion. (Default: TRUE)
  -download_format <fastq|fastq_gz>
                        Downloaded data format. (Default: fastq_gz)

  === Logfile Control ===
  -log_mode <simple|cmd|full>
                        The output mode of hybsuite logfile. (Default: cmd)

Arguments for integrated tools:
  === SRAToolkit ===
  -sra_maxsize <NUM>    The maximum size of sra files to download. (Default: 20GB)

  === Trimmomatic ===
  -trimmomatic_leading_quality <3-40> 
                        Leading base quality cutoff. (Default: 3)
  -trimmomatic_trailing_quality <3-40> 
                        Trailing base quality cutoff. (Default: 3)
  -trimmomatic_min_length <36-100>     
                        Minimum read length. (Default: 36)
  -trimmomatic_sliding_window_s <4-10> 
                        Sliding window size. (Default: 4)
  -trimmomatic_sliding_window_q <15-30>
                        Window average quality. (Default: 15)

Command example:
  # Run HybSuite stage1 with 1 thread and 1 parallel processing
  $ hybsuite stage1 -input_list ./input_list.txt -input_data ./Input_data -NGS_dir ./NGS_dir -output_dir ./
  
  # Run HybSuite stage1 with 5 threads and 5 parallel processing
  $ hybsuite stage1 -input_list ./input_list.txt -input_data ./Input_data -NGS_dir ./NGS_dir -output_dir ./ -nt 5 -process 5

Parameters for running `hybsuite stage2`

Stage 2 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage2 ...

Mandatory arguments: -input_list -NGS_dir -t -output_dir

Essential arguments: -eas_dir -seqs_min_length -seqs_min_sample_coverage -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types used in stage 1. (Default: None)
  -input_data <DIR>     The directory containing all input data (in this stage, only required when the inputs include pre-assembled data). (Default: None).
  -NGS_dir <DIR>        The directory containing NGS raw and cleaned reads files (generated in stage 1). (Default: ./NGS_dir)
  -t <FILE>             Target file for data assembly. (follows the format required in HybPiper)

Arguments for outputs:
  -output_dir <DIR>     The Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -eas_dir <DIR>        The output directory containing HybPiper assembly sequences. (Default: <output_dir>/01-Assembled_data)
                        Note: Pre-existing data in this directory will skip redundant assembly steps.

General arguments:
  === Putative paralogs filtering control ===
  -seqs_min_length <INT>         
                        Minimum sequence length for filtered paralogs. (Default: 0)
                        Putative paralogs shorter than this value will be filtered.             
  -seqs_mean_length_ratio <0-1>    
                        Minimum sequence length ratio relative to the mean value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_max_length_ratio <0-1>              
                        Minimum length ratio relative to the longest value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_min_sample_coverage <0-1>           
                        Minimum sample coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, HRS and RLWP sequences in stage 3, loci lower than this sample coverage will be filtered.
  -seqs_min_locus_coverage <0-1>            
                        Minimum locus coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, taxa (samples) with lower than this locus coverage will be filtered.
  
  === Heatmap control ===
  -heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient. (Default: black)
  
  === Threads control ===
  -nt <INT|AUTO>        Global thread setting (Default: 1)
  -nt_hybpiper <INT>    HybPiper threads (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of data assembly ('hybpiper assemble') to run concurrently (Default: 1)
                        "all" means running all samples concurrently (be cautious to set this option)
  
  === Logfile control ===
  -log_mode <simple|cmd|full>
                        The output mode of hybsuite logfile. (Default: cmd)

Arguments for integrated tools:
   === HybPiper ===
  -hybpiper_mapping_tool <blast|diamond>     
                        The tool used for mapping reads to targets in HybPiper (only for protein targets) (Default: blast)
  -hybpiper_check_chimeric_contigs	<FALSE|TRUE>
                        Check whether a stitched contig is a potential chimera of contigs from multiple paralogs when running "hybpiper assemble". (Default: TRUE)
  -hybpiper_cov_cutoff <INT>
                        Specify the value of "-cov_cutoff" when running "hybpiper assemble" in Stage 2. (Default: 8)
                        Increasing this value may increase the loci recovery efficiency but potentially introducing errors.

Command example:
  # Run HybSuite stage2 with filtering paralog sequences
  $ hybsuite stage2 -NGS_dir ./NGS_dir -t ./Angiosperms353.fasta -output_dir ./ -nt 5 -process 5 -seqs_min_length 100 -seqs_min_sample_coverage 0.1

  # Run HybSuite stage2 without filtering paralog sequences
  $ hybsuite stage2 -NGS_dir ./NGS_dir -t ./Angiosperms353.fasta -output_dir ./ -nt 5 -process 5

Parameters for running `hybsuite stage3`

Stage 3 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage3 ...

Mandatory arguments: -input_list -eas_dir -paralogs_dir -t -output_dir

Essential arguments: -PH -prefix -run_phyparts -aln_min_sample -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types used in stage 1&2. (Default: None)
  -input_data <DIR>     The directory containing all input data (in this stage, only required when the inputs include pre-assembled data). (Default: None).
  -eas_dir <DIR>        The output directory containing HybPiper assembly sequences (generated in stage 3). (Default: <output_dir>/01-Assembled_data)
  -paralogs_dir <DIR>   The directory containing all paralog sequences generated in stage 2 or by users themselves. (Default: None)
                        It's advisable to set this parameter as '<output_dir>/02-All_paralogs/03-Filtered_paralogs'.
  -t <FILE>             Target file for data assembly. (follows the format required in HybPiper)

Arguments for outputs:
  -output_dir <DIR>     Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -prefix <STRING>      Prefix for output files. (Default: HybSuite)

General arguments:
  === Paralog handling control ===
  -PH <1-7|a|b|all>     Paralog handling methods to execute: (one or more of them can be chosen)
                        1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1
                        a: PhyloPyPruner, b: ParaGone (Default: 1a)
  
  === Sequences and alignments filtering control ===
  -seqs_min_length <INT>
                        Minimum sequence bp length for filtering HRS and RLWP sequences. (Default: 0)
                        HRS and RLWP sequences shorter than this value will be removed.
  -aln_min_length <INT> 
                        Minimum sequence bp length for filtering HRS and RLWP final alignments. (Default: 4)
  -aln_min_sample <INT>
                        Minimum sample number for final alignments. (Default: 0)
                        Final alignments (aligned and trimmed) with sample number below this threshold will be removed.

  === Gene tree builder control ===
  -gene_tree <1/2>      Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1) 
  -gene_tree_bb <INT>   Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)

  === Alignments trimming tool control ===
  -trim_tool <1/2>      Choose the software to trim/clean alignments. (1: trimAl; 2: HMMCleaner) (Default: 1) 

  === Nucleotide ambiguity character replacement ===
  -replace_n <TRUE|FALSE>
                        Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
                        Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).

  === Threads control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_paragone <INT>    ParaGone threads. (Default: 1)
  -nt_phylopypruner <INT>              
                        PhyloPyPruner threads. (Default: 1)
  -nt_mafft <INT>       MAFFT threads. (Default: 1)
  -nt_amas <INT>        AMAS.py threads. (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads. (Default: 1)
  -nt_iqtree <INT>      IQ-TREE threads. (Default: 1)
  -nt_fasttree <INT>    FastTree threads. (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of multiple sequences aligning, alignments trimming, and gene trees inference to run concurrently. (Default: 1)
                        "all" means running all samples concurrently. (be cautious to set this option)

  === Heatmap control ===
  -heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient. (Default: black)

Arguments for integrated tools :
  === PhyloPyPruner ===
  -pp_min_taxa <INT>    Minimum taxa per cluster. (Default: 4)
  -pp_min_support <0-1> Minimum support value. (Default: 0=auto)
  -pp_trim_lb <INT>     Trim long branches. (Default: 5)

  === ParaGone ===  
  -paragone_pool <INT>  Parallel alignment tasks. (Default: 1, same as the option '-process')
  -treeshrink_q_value <0-1>        
                        TreeShrink quantile threshold (Default: 0.05)
  -paragone_cutoff_value <FLOAT>       
                        Branch length cutoff (Default: 0.3)
  -paragone_minimum_taxa <INT>         
                        Minimum taxa per alignment (Default: 4)
  -paragone_min_tips <INT>             
                        Minimum tips per tree (Default: 4)
  
  === HybPiper ===
  -hybpiper_skip_chimeric_genes <FALSE|TRUE>
                        Whether to skip recovering sequences for putative chimeric genes when running "hybpiper retrieve_sequences" (HRS method) in Stage 3. (Default: FALSE)
  -hybpiper_retrieved_seqs_type <dna|intron|supercontig>
                        The type of sequence to extract when running "hybpiper retrieve_sequences" in Stage 3. (default:dna, which means extracting coding sequences)

  === MAFFT ===  
  -mafft_algorithm <str>               
                        MAFFT algorithm [auto|linsi] (Default: auto)
  -mafft_adjustdirection <TRUE/FALSE>  
                        Whether to adjust sequence directions (Default: TRUE)
  -mafft_maxiterate <INT>              
                        Maximum number of iterations for MAFFT (Default: auto)
                        Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
  -mafft_pair <str>                    
                        Pairing strategy for MAFFT (Default: auto)
                        Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
  
  === trimAl ===
  -trimal_mode <str>                   
                        trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
  -trimal_gapthreshold <0-1>           
                        Gap threshold (Default: 0.12)
  -trimal_simthreshold <0-1>           
                        Similarity threshold (Default: auto)
  -trimal_cons <0-100>                 
                        Consensus threshold (Default: auto)
  -trimal_block <INT>                  
                        Minimum block size (Default: auto)
  -trimal_w <INT>                      
                        Window size (Default: auto)
  -trimal_gw <INT>                     
                        Gap window size (Default: auto)
  -trimal_sw <INT>                     
                        Similarity window size (Default: auto)
  -trimal_resoverlap <0-1>             
                        Minimum overlap of a positions with other positions in the column. (Default: auto) 
  -trimal_seqoverlap <0-100>           
                        Minimum percentage of sequences without gaps in a column. (Default: auto)
  
  === HMMCleaner ===
  -hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
                        Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45) 
                        Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)

Command example :
  # Run HybSuite stage3 without alignments filtering
  $ hybsuite stage3 -eas_dir ./01-Assembled_data -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5
  
  # Run HybSuite stage3 with alignments filtering
  $ hybsuite stage3 -eas_dir ./01-Assembled_data -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs -t ./Angiosperms353 -PH 124567b -output_dir ./ -nt -process 5 -aln_min_length 100 -aln_min_sample 0.1

Parameters for running `hybsuite stage4`

Stage 4 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage4 ...

Mandatory arguments: -input_list -aln_dir -output_dir

Essential arguments: -PH -sp_tree -prefix -run_phyparts -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types used in stage 1&2. (Default: None)
  -aln_dir              The directory containing different orthogroups alignments generated in stage 3. (Default: <output_dir>/06-Final_alignments)
                        It's advisable to set this parameter as '<output_dir>/06-Final_alignments'.
  -PH <1-7|a|b|all>     Choose alignments generated via paralog handling methods as input:
                        1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1 (one or more of them can be chosen)
                        a: PhyloPyPruner, b: ParaGone (Default: 1a)

Arguments for outputs:
  -output_dir <DIR>     Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -prefix <STRING>      Prefix for output files. (Default: HybSuite)

General arguments:
  === Species tree builder control ===
  -sp_tree <1-5|all>    Species tree inference method:
                        1: IQ-TREE, 2: RAxML, 3: RAxML-NG, 4: ASTRAL-IV, 5: wASTRAL
  
  === Steps control ===
  -run_coalescent_step <INT> 
                        Control which coalescent analysis steps to run:
                        1: Construct single gene trees, 2: Combine and collapse gene trees, 3: Infer species tree, 4: Reroot gene trees, 5: PhyParts concordance analysis
                        (Default: 1234)
  -run_concatenated_step <INT> 
                        Control which concatenated analysis steps to run:
                        1: Construct concatenated alignment, 2: Infer species tree
                        (Default: 12)
  
  === Gene tree builder control ===
  -gene_tree <1/2>      Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1) 
  -gene_tree_bb <INT>   Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)
  
  === Gene trees collapse threshold ===
  -collapse_threshold <VALUE>
                        Specify the minimum support value threshold for internal nodes in gene trees. (Default: 0)
                        Nodes with support values ≤ this threshold will be collapsed into polytomies.

  === Nucleotide ambiguity character replacement ===
  -replace_n <TRUE|FALSE>
                        Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
                        Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).

  === Threads control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_amas <INT>        AMAS.py threads (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads (Default: 1)
  -nt_iqtree <INT>      IQ-TREE threads (Default: 1)
  -nt_fasttree <INT>    FastTree threads (Default: 1)
  -nt_raxml_ng <INT>    RAxML-NG threads (Default: 1)
  -nt_raxml <INT>       RAxML threads (Default: 1)
  -nt_astral4 <INT>     ASTRAL-IV threads (Default: 1)
  -nt_wastral <INT>     wASTRAL threads (Default: 1)
  -nt_astral_pro <INT>  ASTRAL-Pro3 threads (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of gene trees inference in coalescent analysis to run concurrently. (Default: 1)
                        "all" means running all samples concurrently. (be cautious to set this option)

Arguments for integrated tools :
  === IQ-TREE (cancatenated analysis)===
  -iqtree_bb <INT>      IQ-TREE bootstrap replicates (Default: 1000)
  -iqtree_alrt <INT>    SH-aLRT replicates (Default: 1000)
  -iqtree_run_option <str>      
                        IQ-TREE run mode [standard|undo] (Default: undo)
  -iqtree_partition <TRUE/FALSE>       
                        Whether to use partition models in IQ-TREE (Default: TRUE)
  -iqtree_constraint_tree <Treefile>           
                        The pathway to the constraint tree for running IQ-TREE (Default: none)

  === ModelTest-NG ===
  -run_modeltest_ng <TRUE/FALSE>       
                        Whether to run ModelTest-NG (Default: TRUE)

  === RAxML ===
  -raxml_m <str>        RAxML model [GTRGAMMA|PROTGAMMA] (Default: GTRGAMMA)
  -raxml_bb <INT>       RAxML bootstrap replicates (Default: 1000)
  -raxml_constraint_tree <Treefile>              
                        The pathway to the constraint tree for running RAxML (Default: no constraint tree)

  === RAxML-NG ===
  -rng_bs_trees <INT>   RAxML-NG bootstrap replicates (Default: 1000)
  -rng_force <TRUE/FALSE>              
                        Ignore thread warnings (Default: FALSE)
  -rng_constraint_tree <Treefile>                
                        The pathway to the constraint tree for running RAxML-NG (Default: no constraint tree)

  === ASTRAL-IV ===
  -astral4_root         Outermost (most distant) outgroup taxon name for ASTRAL-IV branch length calculation. (Default: none)
                        (Strongly recommended for accurate branch length estimation. Specify only the single outermost outgroup.)  
  -astral_r <INT>       ASTRAL-IV rounds of search. (Default: 4)
  -astral_s <INT>       ASTRAL-IV rounds of subsampling. (Default: 4)

  === wASTRAL ===
  -wastral_mode <1-4>   wASTRAL mode [1|2|3|4] (Default: 1)
                        1: hybrid weighting, 2: support only, 3: length only, 4: unweighted
  -wastral_r <INT>      wASTRAL rounds of search. (Default: 4)
  -wastral_s <INT>      wASTRAL rounds of subsampling. (Default: 4)

  === ASTRAL-Pro ===
  -astral_pro_r <INT>   ASTRAL-Pro rounds of search. (Default: 4)
  -astral_pro_s <INT>   ASTRAL-Pro rounds of subsampling. (Default: 4)

  === MAFFT (only for paralogs inclusion method -> ASTRAL-Pro) ===  
  -mafft_algorithm <str>               
                        MAFFT algorithm [auto|linsi] (Default: auto)
  -mafft_adjustdirection <TRUE/FALSE>  
                        Whether to adjust sequence directions (Default: TRUE)
  -mafft_maxiterate <INT>              
                        Maximum number of iterations for MAFFT (Default: auto)
                        Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
  -mafft_pair <str>                    
                        Pairing strategy for MAFFT (Default: auto)
                        Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
  
  === trimAl (only for paralogs inclusion method -> ASTRAL-Pro) ===
  -trimal_mode <str>                   
                        trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
  -trimal_gapthreshold <0-1>           
                        Gap threshold (Default: 0.12)
  -trimal_simthreshold <0-1>           
                        Similarity threshold (Default: auto)
  -trimal_cons <0-100>                 
                        Consensus threshold (Default: auto)
  -trimal_block <INT>                  
                        Minimum block size (Default: auto)
  -trimal_w <INT>                      
                        Window size (Default: auto)
  -trimal_gw <INT>                     
                        Gap window size (Default: auto)
  -trimal_sw <INT>                     
                        Similarity window size (Default: auto)
  -trimal_resoverlap <0-1>             
                        Minimum overlap of a positions with other positions in the column. (Default: auto) 
  -trimal_seqoverlap <0-100>           
                        Minimum percentage of sequences without gaps in a column. (Default: auto)
  
  === HMMCleaner (only for paralogs inclusion method -> ASTRAL-Pro) ===
  -hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
                        Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45) 
                        Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)

  === PhyPartsPieCharts & modified_phypartspiecharts ===
  -run_phyparts <TRUE|FALSE>
                        Enable/disable PhyParts concordance analysis and modified pie chart visualization. (Default: TRUE)
                        Note: Requires successful completion of previous coalescent analysis.
  -phypartspiecharts_tree_type <cladogram/circle>
                        The tree type of displaying when running modified_phypartspiecharts.py (Default: cladogram)
  -phypartspiecharts_num_mode <num>
                        Control what numbers to show on branches (specify 0-2 digits) (Default: 12)
                        0: Hide all numbers
                        1: Number of genes supporting species tree (blue)
                        2: Number of genes conflicting with species tree (red+green)
                        3: Number of genes with no signal (gray)
                        4: Proportion of supporting genes (blue/total)
                        5: Proportion of conflicting genes ((red+green)/total)
                        6: Proportion of no signal genes (gray/total)
                        7: Ratio of supporting to all signal genes (blue/(blue+red+green))
                        8: Ratio of conflicting to all signal genes ((red+green)/(blue+red+green))
                        9: Original node support values from the input tree

Command example :
  # Run HybSuite stage4 with IQ-TREE
  $ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 1
  
  # Run HybSuite stage4 with ASTRAL-IV
  $ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 4

  # Run HybSuite stage4 with ASTRAL-IV and PhyParts
  $ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 4 -run_phyparts TRUE

Parameters for running `hybsuite full_pipeline`

HybSuite full pipeline Manual
--------------------------------------------------------------------------------
Usage: hybsuite full_pipeline ...

Mandatory arguments: -input_list -input_data (required when including user-provided data) -t -output_dir

Essential arguments: -PH -sp_tree -seqs_min_length -aln_min_sample -prefix -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types. (Default: None)
  -input_data <DIR>     The directory containing all input data (required when the inputs include your own data / pre-assembled data). (Default: None).
  -t <FILE>             Target file for data assembly. (follows the format required in HybPiper)

Arguments for outputs:
  -output_dir <DIR>     The output directory for all pipeline results. (Default: None)
  -NGS_dir <DIR>        The output directory containing raw and cleaned reads files (see GitHub documentation).
                        Notes: Pre-existing cleaned reads will skip reads trimming steps.
  -eas_dir <DIR>        The output directory containing HybPiper assembly sequences. (Default: <output_dir>/01-Assembled_data)
                        Note: Pre-existing data in this directory will skip redundant assembly steps.
  -prefix <STRING>      Prefix for output files. (Default: HybSuite)

General arguments:
  === Stages running control ===
  -skip_stage <1|2|3|12|123|>
                        Specify pipeline stages to skip during execution. (Default: None, running all stages)
                        Note: Particularly useful for re-running specific HybSuite pipeline stages.
                        (e.g., '-skip_stage 1' for skipping stage 1)
  -run_to_stage <1|2|3> Specify pipeline stages to run up to (Default: None, running all stages)
                        (e.g., '-run_to_stage 3' for stopping before stage 4)

  === Public raw reads downloading control (Stage 1) ===
  -rm_sra <TRUE/FALSE>  Whether to remove SRA files after conversion. (Default: TRUE)
  -download_format <fastq|fastq_gz>
                        Downloaded data format. (Default: fastq_gz)

  === Putative paralogs filtering control (Stage 2) ===
  -seqs_min_length <INT>         
                        Minimum sequence length for filtered paralogs. (Default: 0)
                        Putative paralogs shorter than this value will be filtered.             
  -seqs_mean_length_ratio <0-1>    
                        Minimum sequence length ratio relative to the mean value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_max_length_ratio <0-1>              
                        Minimum length ratio relative to the longest value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_min_sample_coverage <0-1>           
                        Minimum sample coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, HRS and RLWP sequences in stage 3, loci lower than this sample coverage will be filtered.
  -seqs_min_locus_coverage <0-1>            
                        Minimum locus coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, taxa (samples) with lower than this locus coverage will be filtered.

  === Heatmap control (Stage 2&3) ===
  -heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient. (Default: black)

  === Paralog handling control (Stage 3) ===
  -PH <1-7|a|b|all>     Paralog handling methods to execute: (one or more of them can be chosen)
                        1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1
                        a: PhyloPyPruner, b: ParaGone (Default: 1a)
  
  === Sequences and alignments filtering control (Stage 3) ===
  -seqs_min_length <INT>
                        Minimum sequence bp length for filtering HRS and RLWP sequences. (Default: 0)
                        HRS and RLWP sequences shorter than this value will be removed.
  -aln_min_length <INT> 
                        Minimum sequence bp length for filtering HRS and RLWP final alignments. (Default: 4)
  -aln_min_sample <INT>
                        Minimum sample number for final alignments. (Default: 5)
                        Final alignments (aligned and trimmed) with sample number below this threshold will be removed.

  === Alignments trimming tool control (Stage 3) ===
  -trim_tool <1/2>      Choose the software to trim/clean alignments. (1: trimAl; 2: HMMCleaner) (Default: 1)
  
  === Gene trees builder control (Stage 3&4) ===
  -gene_tree <1/2>      Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1) 
  -gene_tree_bb <INT>   Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)

  === Species tree builder control (Stage 4) ===
  -sp_tree <1-5|all>    Species tree inference method: (Default: 1)
                        1: IQ-TREE, 2: RAxML, 3: RAxML-NG, 4: ASTRAL-IV, 5: wASTRAL, 6: ASTRAL-Pro
  
  === Steps control in stage 4 ===
  -run_coalescent_step  <INT> 
                        Control which coalescent analysis steps to run:
                        1: Construct single gene trees, 2: Combine and collapse gene trees, 3: Infer species tree, 4: Reroot gene trees, 5: PhyParts concordance analysis
                        (Default: 1234)
  -run_concatenated_step <INT> 
                        Control which concatenated analysis steps to run:
                        1: Construct concatenated alignment, 2: Infer species tree
                        (Default: 12)
  
  === Nucleotide ambiguity character replacement (Stage 3&4) ===
  -replace_n <TRUE|FALSE>
                        Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
                        Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).

  === Gene trees collapse threshold ===
  -collapse_threshold <VALUE>
                        Specify the minimum support value threshold for internal nodes in gene trees. (Default: 0)
                        Nodes with support values ≤ this threshold will be collapsed into polytomies.
  
  === Threads Control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_fasterq_dump <INT>               
                        fasterq-dump threads. (Default: 1)
  -nt_pigz <INT>        pigz compression threads. (Default: 1)
  -nt_trimmomatic <INT> Trimmomatic threads. (Default: 1)
  -nt_hybpiper <INT>    HybPiper threads (Default: 1)
  -nt_paragone <INT>    ParaGone threads. (Default: 1)
  -nt_phylopypruner <INT>              
                        PhyloPyPruner threads. (Default: 1)
  -nt_mafft <INT>       MAFFT threads. (Default: 1)
  -nt_amas <INT>        AMAS.py threads. (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads. (Default: 1)
  -nt_iqtree <INT>      IQ-TREE threads. (Default: 1)
  -nt_fasttree <INT>    FastTree threads. (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads (Default: 1)
  -nt_raxml_ng <INT>    RAxML-NG threads (Default: 1)
  -nt_raxml <INT>       RAxML threads (Default: 1)
  -nt_astral4 <INT>     ASTRAL-IV threads (Default: 1)
  -nt_wastral <INT>     wASTRAL threads (Default: 1)
  -nt_astral_pro <INT>  ASTRAL-Pro3 threads (Default: 1)

  === Parallel Control ===
  -process <INT|all>    Number of subprocess to run concurrently. (Default: 1)
                        "all" means running all subprocesses concurrently. (be cautious to set this option)
                        The related steps are: 
                        Stage 1: public data downloading and raw reads trimming;
                        Stage 2: data assembly ('hybpiper assemble');
                        Stage 3: multiple sequences aligning, alignments trimming, and gene trees inference;
                        Stage 4: gene trees inference in coalescent analysis.

  === Logfile Control ===
  -log_mode <simple|cmd|full>
                        The output mode of hybsuite logfile. (Default: cmd)

Arguments for integrated tools :
  === SRAToolkit (Stage 1) ===
  -sra_maxsize <NUM>    The maximum size of sra files to download. (Default: 20GB)

  === Trimmomatic (Stage 1) ===
  -trimmomatic_leading_quality <3-40>
                        Leading base quality cutoff. (Default: 3)
  -trimmomatic_trailing_quality <3-40> 
                        Trailing base quality cutoff. (Default: 3)
  -trimmomatic_min_length <36-100>
                        Minimum read length. (Default: 36)
  -trimmomatic_sliding_window_s <4-10> 
                        Sliding window size. (Default: 4)
  -trimmomatic_sliding_window_q <15-30>
                        Window average quality. (Default: 15)

  === HybPiper (Stage 2 & 3) ===
  -hybpiper_mapping_tool <blast|diamond>     
                        The tool used for mapping reads to targets in HybPiper (only for protein targets) (Default: blast)
  -hybpiper_check_chimeric_contigs	<FALSE|TRUE>
                        Check whether a stitched contig is a potential chimera of contigs from multiple paralogs when running "hybpiper assemble". (Default: FALSE)
  -hybpiper_cov_cutoff <INT>
                        Specify the value of "-cov_cutoff" when running "hybpiper assemble" in Stage 2. (Default: 8)
                        Increasing this value may increase the loci recovery efficiency but potentially introducing errors.
  -hybpiper_skip_chimeric_genes <FALSE|TRUE>
                        Whether to recover sequences for putative chimeric genes when running "hybpiper retrieve_sequences" (HRS method) in Stage 3. (Default: FALSE)
  -hybpiper_retrieved_seqs_type <dna|intron|supercontig>
                        The type of sequence to extract when running "hybpiper retrieve_sequences" in Stage 3.
  
  === PhyloPyPruner (Stage 3) ===
  -pp_min_taxa <INT>    Minimum taxa per cluster. (Default: 4)
  -pp_min_support <0-1> Minimum support value. (Default: 0=auto)
  -pp_trim_lb <INT>     Trim long branches. (Default: 5)

  === ParaGone (Stage 3) ===  
  -paragone_pool <INT>  Parallel alignment tasks. (Default: 1, same as the option '-process')
  -treeshrink_q_value <0-1>        
                        TreeShrink quantile threshold (Default: 0.05)
  -paragone_cutoff_value <FLOAT>       
                        Branch length cutoff (Default: 0.3)
  -paragone_minimum_taxa <INT>         
                        Minimum taxa per alignment (Default: 4)
  -paragone_min_tips <INT>             
                        Minimum tips per tree (Default: 4)
  
  === TreeShrink (Stage 3) ===
  -treeshrink_q_value <0-1>        
                        TreeShrink quantile threshold (Default: 0.05)

  === MAFFT (Stage 3) ===  
  -mafft_algorithm <str>               
                        MAFFT algorithm [auto|linsi] (Default: auto)
  -mafft_adjustdirection <TRUE/FALSE>  
                        Whether to adjust sequence directions (Default: TRUE)
  -mafft_maxiterate <INT>              
                        Maximum number of iterations for MAFFT (Default: auto)
                        Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
  -mafft_pair <str>                    
                        Pairing strategy for MAFFT (Default: auto)
                        Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
  
  === trimAl (Stage 3) ===
  -trimal_mode <str>                   
                        trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
  -trimal_gapthreshold <0-1>           
                        Gap threshold (Default: 0.12)
  -trimal_simthreshold <0-1>           
                        Similarity threshold (Default: auto)
  -trimal_cons <0-100>                 
                        Consensus threshold (Default: auto)
  -trimal_block <INT>                  
                        Minimum block size (Default: auto)
  -trimal_w <INT>                      
                        Window size (Default: auto)
  -trimal_gw <INT>                     
                        Gap window size (Default: auto)
  -trimal_sw <INT>                     
                        Similarity window size (Default: auto)
  -trimal_resoverlap <0-1>             
                        Minimum overlap of a positions with other positions in the column. (Default: auto) 
  -trimal_seqoverlap <0-100>           
                        Minimum percentage of sequences without gaps in a column. (Default: auto)
  
  === HMMCleaner (Stage 3) ===
  -hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
                        Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45) 
                        Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)
  
  === IQ-TREE (Stage 4) ===
  -iqtree_bb <INT>      IQ-TREE bootstrap replicates (Default: 1000)
  -iqtree_alrt <INT>    SH-aLRT replicates (Default: 1000)
  -iqtree_run_option <str>      
                        IQ-TREE run mode [standard|undo] (Default: undo)
  -iqtree_partition <TRUE/FALSE>       
                        Whether to use partition models in IQ-TREE (Default: TRUE)
  -iqtree_constraint_tree <Treefile>           
                        The pathway to the constraint tree for running IQ-TREE (Default: none)

  === ModelTest-NG (Stage 4) ===
  -run_modeltest_ng <TRUE/FALSE>       
                        Whether to run ModelTest-NG (Default: TRUE)

  === RAxML (Stage 4) ===
  -raxml_m <str>        RAxML model [GTRGAMMA|PROTGAMMA] (Default: GTRGAMMA)
  -raxml_bb <INT>       RAxML bootstrap replicates (Default: 1000)
  -raxml_constraint_tree <Treefile>              
                        The pathway to the constraint tree for running RAxML (Default: no constraint tree)

  === RAxML-NG (Stage 4) ===
  -rng_bs_trees <INT>   RAxML-NG bootstrap replicates (Default: 1000)
  -rng_force <TRUE/FALSE>              
                        Ignore thread warnings (Default: FALSE)
  -rng_constraint_tree <Treefile>                
                        The pathway to the constraint tree for running RAxML-NG (Default: no constraint tree)

  === ASTRAL-IV (Stage 4) ===
  -astral4_root         Outermost (most distant) outgroup taxon name for ASTRAL-IV branch length calculation. (Default: none)
                        (Strongly recommended for accurate branch length estimation. Specify only the single outermost outgroup.)
  -astral_r <INT>       ASTRAL-IV rounds of search. (Default: 4)
  -astral_s <INT>       ASTRAL-IV rounds of subsampling. (Default: 4)

  === wASTRAL (Stage 4) ===
  -wastral_mode <1-4>   wASTRAL mode [1|2|3|4] (Default: 1)
                        1: hybrid weighting, 2: support only, 3: length only, 4: unweighted
  -wastral_r <INT>      wASTRAL rounds of search. (Default: 4)
  -wastral_s <INT>      wASTRAL rounds of subsampling. (Default: 4)

  === ASTRAL-Pro ===
  -astral_pro_r <INT>   ASTRAL-Pro rounds of search. (Default: 4)
  -astral_pro_s <INT>   ASTRAL-Pro rounds of subsampling. (Default: 4)

  === PhyPartsPieCharts & modified_phypartspiecharts (Stage 4) ===
  -run_phyparts <TRUE|FALSE>
                        Enable/disable PhyParts concordance analysis and modified pie chart visualization. (Default: TRUE)
                        Note: Requires successful completion of previous coalescent analysis.
  -phypartspiecharts_tree_type <cladogram/circle>
                        The tree type of displaying when running modified_phypartspiecharts.py (Default: cladogram)
  -phypartspiecharts_num_mode <num>
                        Control what numbers to show on branches (specify 0-2 digits) (Default: 12)
                        0: Hide all numbers
                        1: Number of genes supporting species tree (blue)
                        2: Number of genes conflicting with species tree (red+green)
                        3: Number of genes with no signal (gray)
                        4: Proportion of supporting genes (blue/total)
                        5: Proportion of conflicting genes ((red+green)/total)
                        6: Proportion of no signal genes (gray/total)
                        7: Ratio of supporting to all signal genes (blue/(blue+red+green))
                        8: Ratio of conflicting to all signal genes ((red+green)/(blue+red+green))
                        9: Original node support values from the input tree

Command example :
  === Run the full pipeline with all paralog-handling methods and all species trees inference approaches ===
  hybsuite full_pipeline \
  -input_list ./Input_list.txt \
  -input_data ./Input_data \
  -t Angiosperms353.fasta \
  -PH 1234567 \
  -sp_tree 12345 \
  -output_dir ./ \
  -nt 5 -process 5
  
  === Run the full pipeline with only tree-based orthology inference methods (MO/MI/RT/1to1) in ParaGone and ASTRAL-IV ===
  hybsuite full_pipeline \
  -input_list ./Input_list.txt \
  -input_data ./Input_data \
  -t Angiosperms353.fasta \
  -PH 4567b \
  -sp_tree 4 \
  -output_dir ./ \
  -nt 5 -process 5

5 - Tutorial

This page helps you prepare input files, configure parameters, and run HybSuite.

1. Prepare input files

Tip

The HybSuite pipeline requires three main inputs, including:

(1) a sample list file
(2) a directory containing sequence files (raw NGS data and pre-assembled sequences)
(3) a target sequence file

(1) The sample list file

This file should document sample names along with their corresponding sequence types identifiers (seperated by \t). HybSuite pipeline supports three input sequence types:

Type1: public raw reads from NCBI SRA

To download NGS raw reads from NCBI for phylogenetic analysis, you should:

Format the sample list file as:

Row 1: Sample names
Row 2: Corresponding accession numbers

(Tab-delimited, one sample per column)

Taxon1	SRR...
Taxon2	SRR...
...

Type2: user-provided raw reads

To prepare existing samples’ NGS raw reads as input, you should:

1. Format the sample list file as:

Row 1: Sample names
Row 2: Character A

(Tab-delimited, one sample per column)

Taxon3	A
Taxon4	A
...

2. Place the raw data files (paired-end/single-end, FASTQ or FASTQ.GZ format) with corresponding names in the -input_data directory (see naming rules here ).

Type3: pre-assembled sequences

To prepare pre-assembled sequences as input, you should:

1. Format the sample list file as:

Row 1: Sample names
Row 2: Character B

(Tab-delimited, one sample per column)

Taxon5	B
Taxon6	B
...

2. Place the corresponding pre-assembled sequence files in the -input_data directory (see naming rules here).

Combine sequence types together

To include multiple sequence types showed above as pipeline inputs, just combine all entries into the sample list file:

Taxon1	SRR...
Taxon2	SRR...
Taxon3	A
Taxon4	A
Taxon5	B
Taxon6	B

Outgroup Specification

This step is not necessary if you only run stage1 and stage2, but required for stages 3-4 (orthology inference & species tree inference):

To specify the outgroup in the sample list file, just mark outgroups with chracters Outgroup in row 3 (tab-separated with row 2). Example (Outgroup = Taxon5):

Taxon1	SRR...
Taxon2	SRR...
Taxon3	A
Taxon4	A
Taxon5	B	Outgroup
Taxon6	B

Note

The order of recording the information of these three sequence types doesn’t matter.
The sample list file can be named with different suffix, such as .txt, .tsv …

Warning

At least one sequence type should be recorded in the sample list file.
You have to make sure all the sample names in the first row are not dupilicated.
Any mistake in the second row’s characters will lead to running errors.

(2) The directory containing sequence files

Required if:

Your sample list includes existing raw reads or pre-assembled sequences.

1. Naming rules for existing raw reads files in the input directory:

If <Taxon> is listed in the sample list file, its raw data file should be named as follows:

Paired-end data: <Taxon>_1.<suffix> + <Taxon>_2.<suffix>
Single-end data: <Taxon>.<suffix>

Tip

Both paired-end and single-end data are supported.
Supported <suffix>: .fq.gz, .fq, .fastq.gz, or .fastq.

Note

Filenames must exactly match the sample names in your list. Mismatches will cause errors.

2. Naming rules for pre-assembled sequences in the input directory:

If <Taxon> is listed in the sample list file, its pre-assembled sequence file should be named as follows:

<Taxon>.fasta

(3) The target sequence file

The required format for the target sequences file is nearly identical to that used in HybPiper.
The only difference is that in HybSuite, the gene name for a target sequence must be placed immediately after the final hyphen (-) in the header line (see example showed below).
For example, see the Reference.fasta file in the example dataset.
For more details, refer to HybPiper’s documentation to edit your target file.

>Elaeagnus-pungens-4471
AATGTCATCCAGGATAAATATCGGTTGGAAGCTGCAAATACTGACTGGATGAACAAGTAC
AAAGGCTCTAGTAAGCTTCTATTGCATCCAAGGAACACTGAGGAGGTTTCACAGATACTC
...
>Hippophae-rhamnoides-4527
GAAGAGAGGGTTGTAGTATTAGTGATTGGTGGAGGAGGAAGAGAACATGCTCTTTGCTAT
GCAATGAATCGATCACCATCCTGCGATGCAGTCTTTTGTGCTCCTGGCAATGCTGGGATT
...
>Hippophae-salicifolia-4691
CAGAGACTGCCTCCATTGTCAACTGATCCCAACAGATGCGAGCGTGCATTTGTTGGAAAC
ACGATAGGTCAAGCAAATGGTGTGTACGACAAGCCAATCGATCTCCGATTCTGTGATTAC
...

2. Construct command

(1) Basic command pattern

Conda version:

hybsuite <subcommand> [options] ...

Local version:

bash <path to HybSuite.sh> <subcommand> [options] ...

(2) Available subcommands

HybSuite provides modular subcommands for flexible workflow execution:

Run individual stages:

hybsuite stage1 [options]...    # run Stage 1: NGS dataset construction
hybsuite stage2 [options]...    # run Stage 2: Data assembly and filtering
hybsuite stage3 [options]...    # run Stage 3: Paralog handling
hybsuite stage4 [options]...    # run Stage 4: Species tree inference

Run the full pipeline:

hybsuite full_pipeline [options]...    # Execute stages 1-4 in one go

Retrieve results:

After completing the full pipeline from stage 1 to 4, retrieve key output files:

hybsuite retrieve_results -i <hybsuite_comprehensive_output_dir> -o <results_dir>

This subcommand collects all final trees and summary statistics from the HybSuite comprehensive output directory.

Tip

Use hybsuite <subcommand> --help or refer to the Full pipeline parameters to view detailed parameters for each subcommand.

3. Configure your parameters

This section guides you through running each stage sequentially and how to configure related parameters.

Suppose you have prepared all the required input files including the sample list file, the input sequence directory, and the target sequence file in the correct formats described above, then you can keep forward and run each stage sequentially or the full pipeline in one go.

HybSuite checking

Before running each stage or the full pipeline, HybSuite automatically checks all dependencies, configured parameters, and sample information. If any invalid parameters or incorrectly formatted input files are detected, the program notifies you and exits.

After the checks, the program prompts you to confirm whether to proceed with running HybSuite (y) or not (n).

To skip checking, specify -check as FALSE when running the pipeline.

Run `hybsuite stage1`

Purpose: running HybSuite Stage 1: "NGS dataset construction": download public raw reads, integrate user-provided data, and perform adapter trimming.

Step 1: Configure mandatory parameters

The following parameters must be specified when running hybsuite stage1 (failing to do so will cause HybSuite to exit during execution):

(1) Configure input file parameters:

-input_list <FILE>
Specify the sample list file (as described in the section (1) The sample list file)
-input_data <DIR>
Specify the directory including all user-provided raw reads and pre-assembled sequences (no need to specify this option if the sample names of user-provided raw reads and pre-assembled sequences are not included in the sample list file).

(2) Configure output file parameters:

-output_dir <DIR>
Specify the directory for storing HybSuite’s comprehensive output files. It is recommended to specify this option as the same directory for convenience in later stages to make all outputs be in the same output folder.

Step 2: Configure essential parameters

The following parameters are essential for running this stage, you’d better check whether to configure depending on your analysis.

-nt <NUM>
Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).
-process <NUM>
Specify the number of parallel sample processing (Default: 1). This involves the steps “raw read downloading” and “adapter trimming” in Stage 1.
-sra_maxsize <?GB>
Specify the maximum SRA file size to download (Default: 20GB). If the size of targeted raw read files to download from NCBI is larger than this value, downloading process will be skipped for corresponding samples.
-NGS_dir <DIR>
Specify the output directory with raw and clean reads files (Default: <output_dir>/NGS_dataset (<output_dir> can be specified by -output_dir))

Step 3: Configure other parameters

See the full parameters for running hybsuite stage1 here for more customizable settings.

Example command:

# basic common pattern:
hybsuite stage1 -input_list <FILE> -input_data <DIR> -output_dir <DIR> -nt <NUM>or<AUTO> -process <NUM>

# command for our example datset (Angiosperms353) (8 threads; 5 in parallel)
cd <Path_to_"HybSuite-master/example_datasets/Angiosperms353/">
hybsuite stage1 -input_list Input_list.txt ./-input_data ./Input_data -output_dir ./Output -nt 8 -process 5

Step 4: Check output files

After running stage 1, you can check the output files for your analysis.
See the output files for hybsuite stage1 here for more details.

Run `hybsuite stage2`

Purpose: run HybSuite Stage 2: "Data assembly and paralog retrieval": assemble reads using HybPiper, retrieve paralog sequences, and filter paralog sequences by length, sample or locus coverage.

Step 1: Configure mandatory parameters

(1) Input file parameters

-input_list <FILE>
Sample list file (same as stage 1) here to check.
-NGS_dir <DIR>
NGS dataset outputted in stage 1.
(default: <output_dir>/NGS_dataset)
-input_data <DIR>
The same parameter as specified in Stage 1.
This option is required only when pre-assembled sequences are provided as input.
-t <FILE>
Target sequence file in HybPiper format.

Tip

If you have existing cleaned read files, you can place them in the directory <NGS_dir>/03-My_clean_data (<NGS_dir> is the directory specified by -NGS_dir) for saving time because HybSuite will skip processing these samples to save running time!

For example, if you have existing clean data (pair-ended) of <taxon1> and clean data (single-ended) of <taxon2>, you can place them with corresponding file names showed below to let HybSuite skip processing these two samples:

<NGS_dir>/
├──01-Downloaded_raw_data
├──02-Downloaded_clean_data
└──03-My_clean_data
    ├──<taxon1>_1_clean.paired.fq.gz
    ├──<taxon1>_2_clean.paired.fq.gz
    └──<taxon2>_clean.paired.fq.gz

(2) Output file parameters

-output_dir <DIR>
Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.
-eas_dir <DIR>
Specify the output directory for storing assembled results (one sample per subdirectory) generated by hybpiper assemble (see HybPiper manual).
(default: <output_dir>/NGS_dataset)

Tip

If you already have assembled results generated by hybpiper assemble (one sample per subdirectory), you can place them in the directory specified with -eas_dir. HybSuite will automatically skip these samples during the data assembly stage.

For example, if you have directories with assembled sequences outputted by hybpiper assemble for <taxon1> and <taxon2>:

Create a directory and place these two directories named <taxon1> and <taxon2> in it, and specify -eas_dir as the path to this new directory (showed below, <eas_dir> is the directory specified by -eas_dir).

<eas_dir>/
├──<taxon1>
└──<taxon2>

Then, include the name of <taxon1> and <taxon2> in the sample list file (specified by -input_list).

Step 2: Configure essential parameters

The following parameters are optional but essential for running this stage, check whether to configure depending on your analysis.

(1) Thread and parallel:

-nt <NUM|AUTO>
Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).
process <NUM>
Specify the number of parallel sample processing (Default: 1). This involves the steps “data assembly” in stage 2.
-eas_dir <DIR>
Specify existing assembled data to skip redundant assembly.

(2) Paralog sequence filtering:

-seqs_min_sample_coverage <NUM:0-1>
Specify the minimum sample coverage of recovered loci. Loci with sample coverage below this threshold are removed.
(default: 0, recommended value: 0.1)
-seqs_min_locus_coverage <NUM:0-1>
Specify the minimum locus coverage of samples. Samples with locus coverage below this threshold are removed.
(default: 0)
-seqs_min_length <NUM>
Specify the minimum sequence length for filtering paralog sequences. Sequences shorter than this threshold are removed.
(default: 0, recommended value: 100)
-seqs_mean_length_ratio <NUM:0-1>
Specify the minimum length ratio relative to the mean length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)
-seqs_max_length_ratio <NUM:0-1>
Specify the minimum length ratio relative to the maximum length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)
-seqs_min_length_ratio <NUM:0-1>
Specify the minimum length ratio relative to the minimum length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)

(3) Parameters related to hybpiper assemble:

-hybpiper_mapping_tool <blast|diamond>
Specify the read mapping tool used in data assembly via hybpiper assemble.
(default: blast)
-hybpiper_mapping_tool <TRUE|FALSE>
Specify whether to check chimeric contigs.
(default: TRUE)
-hybpiper_cov_cutoff <INT>
Coverage cutoff for SPAdes when running “hybpiper assemble” in Stage 2.
Increasing this value may increase the loci recovery efficiency but potentially introducing errors.
(Default: 8)

Step 3: Configure other parameters

See the full parameters for running hybsuite stage2 here for more customizable settings.

Example command

# Recommended command mode
hybsuite stage2 \
  -input_list <FILE> \
  -NGS_dir <DIR> \
  -t <FILE> \
  -output_dir <DIR> \
  -seqs_min_length <NUM> \
  -seqs_min_sample_coverage <NUM> \
  -nt <NUM> -process <NUM>

# Command for the example dataset (Angiosperms353)
hybsuite stage2 \
  -input_list ./Input_list.txt \
  -NGS_dir ./NGS_dataset \
  -t ./Target_file_Angiosperm353.fasta \
  -output_dir ./Output \
  -seqs_min_length 100 \
  -seqs_min_sample_coverage 0.1 \
  -nt 8 -process 5

Step 4: Check output files

After running stage 2, you can check the output files for your analysis.
See the output files for hybsuite stage2 here for more details.

Run `hybsuite stage3`

Purpose: run HybSuite Stage 3: "Paralog handling": optionally apply seven paralog-handling methods to infer orthology groups and generate final alignments for stage 4: species tree inference.

Step 1: Configure mandatory parameters:

(1) Input file parameters

-input_list <FILE>
Sample list file (same as stage 1) here to check.
-eas_dir <DIR>
Directory containing assembled sequences (one sample per subdirectory) generated by hybpiper assemble (see HybPiper manual) in Stage 2 or provided by users. (default: <output_dir>/NGS_dataset)
-paralogs_dir <DIR>
Directory containing all paralog sequences generated in Stage 2 or provided by users. If Stage 2 has been executed, set this parameter to <output_dir>/02-All_paralogs/03-Filtered_paralogs, where <output_dir> refers to the comprehensive output directory specified by -output_dir.
(default: none)
-input_data <DIR>
The same parameter as specified in Stage 1.
This option is required only when pre-assembled sequences are provided as input and the “HRS” or “RLWP” methods are selected.
-t <FILE>
Target sequence file in HybPiper format.

(2) Output file parameters

-output_dir <DIR>
Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.
-prefix <STRING>
Output file prefix. (default: HybSuite)

Step 2: Configure essential parameters:

The following parameters are optional but essential for running this stage, check whether to configure them based on your analysis.

(1) Thread and parallel:

-nt <NUM|AUTO>
Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).
-process <NUM>
Specify the number of parallel sample processing (Default: 1). This involves the steps “data assembly” in stage 2.

(2) Paralog-handling methods applied in this stage:

-PH <1-7|a|b|all>
Paralog handling methods (1=HRS, 2=RLWP, 3=LS, 4=MI, 5=MO, 6=RT, 7=1to1, a=PhyloPyPruner, b=ParaGone)
(One or several methods can be specified; default: 1a)

(3) Sequence filtering:

-seqs_min_length <INT>
Minimum HRS/RLWP sequence length (Default: 0)
Only HRS and RLWP methods filter sequences in this satge, other paralog-handling methods filter sequences in stage 2.
-aln_min_length <INT>
Minimum alignment length (Default: 4)
-aln_min_sample <INT>
Minimum sample number per alignment (Default: 0)

(4) Gene tree construction:

-gene_tree <1/2>
Gene tree builder: 1=IQ-TREE, 2=FastTree (Default: 1)
-gene_tree_bb <INT>
Bootstrap value (Default: 1000)
-trim_tool <1/2>
Trimming tool: 1=trimAl, 2=HMMCleaner (Default: 1)

Step 3: Configure other parameters

See the full parameters for running hybsuite stage3 here for more customizable settings.

Example command:

# Recommended command
hybsuite stage3 \
  -input_list <FILE> \
  -eas_dir ./01-Assembled_data \
  -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs \
  -t <FILE> \
  -output_dir <DIR> \
  -PH 1234567a \  # this option can be determined by your data size, time schedule, but still suggested to try all methods, including "4567b" with ParaGone
  -nt 8 -process 5

# Command for the example dataset (Angiosperms353)
hybsuite stage3 \
  -input_list ./Input_list.txt \
  -eas_dir ./01-Assembled_data \
  -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs \
  -t Target_file_Angiosperm353.fasta \
  -output_dir ./Output \
  -PH 1234567a \
  -mafft_algorithm linsi \
  -nt 8 -process 5

Step 4: Check output files

After running stage 3, you can check the output files for your analysis.
See the output files for hybsuite stage3 here for more details.

Run `hybsuite stage4`

Purpose: run HybSuite Stage4: "Species tree inference", infer species trees using concatenation-based and/or coalescent-based approaches.

Step 1: Configure mandatory parameters:

(1) Configure input file parameters

-input_list <FILE>
Sample list file (same as stage 1-3) here to check.
-aln_dir <DIR>
The path to directory 06-Final_alignments with final alignments generated in stage 3.

(2) Configure output file parameters

-output_dir <DIR>
Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.

Step 2: Configure essential parameters:

The following parameters are optional but essential for running this stage, check whether to configure them based on your analysis.

-PH <1-7|a|b|all>
Select alignments from one or more specific paralog-handling methods. Make it consistent with the one used in stage 3. Default: 1.
-sp_tree <1-6|all>
Species tree inference methods. Default: 1.
1=IQ-TREE, 2=RAxML, 3=RAxML-NG, 4=ASTRAL-IV, 5=wASTRAL, 6=ASTRAL-Pro
-nt <INT|AUTO>
Thread setting.

Gene tree construction (for coalescent methods):

-gene_tree <1/2>
Gene tree builder. Default: 1.
-gene_tree_bb <INT>
Bootstrap value. Default: 1000.
-collapse_threshold <VALUE>
Collapse weakly supported branches. Default: 0.

Concatenation-based methods:

-run_modeltest_ng <TRUE/FALSE>
Run ModelTest-NG. Default: TRUE.
-iqtree_bb <INT>
IQ-TREE bootstrap replicates. Default: 1000.
-iqtree_partition <TRUE/FALSE>
Use partition models. Default: TRUE.

Coalescent-based methods:

-astral_r <INT>
ASTRAL-IV search rounds. Default: 4.
-wastral_mode <1-4>
wASTRAL weighting mode. Default: 1.
-run_phyparts <TRUE/FALSE>
Run PhyParts analysis. Default: TRUE.

Example command:

# Concatenation-based method (IQ-TREE)
hybsuite stage4 \
  -input_list <FILE> \
  -aln_dir <output_dir>/06-Final_alignments \
  -output_dir <DIR> \
  -PH 1234567a \
  -sp_tree 1 \
  -nt 8

# Coalescent-based method (ASTRAL-IV)
hybsuite stage4 \
  -input_list sample_list.txt \
  -aln_dir <output_dir>/06-Final_alignments\
  -output_dir <DIR> \
  -PH 1a \
  -sp_tree 4 \
  -run_phyparts TRUE \
  -nt 8 -process 5

Run `hybsuite full_pipeline`

Purpose: run stages 1-4 sequentially in a single command.

Note

Running the full pipeline with a single command is convenient, but it can also be risky if intermediate outputs aren’t carefully checked. Therefore, We recommend running each stage separately first to review the results and understand the workflow before using this “one-to-end” execution.

Step 1: Configure mandatory parameters:

-input_list <FILE>
Sample list file.
-t <FILE>
Target file.
-output_dir <DIR>
Output directory.

-input_data is required when including user-provided data.

Step 2: Configure essential parameters:

Methods control:

-PH <1-7|a|b|all>
Paralog handling methods. Default: 1a.
-sp_tree <1-6|all>
Species tree inference methods. Default: 1.

Threading:

-nt <INT|AUTO>
Global thread setting. Default: 1.
-process <INT|all>
Parallel processing. Default: 1.

Workflow control:

-skip_stage <1|12|123>
Skip completed stages. Default: none.
-run_to_stage <1|2|3|4>
Stop at specific stage. Default: 4.

Logging control:

-log_mode <simple|cmd|full>
Logging verbosity. Default: cmd.
- simple: Only log key information
- cmd: Log key information + command history
- full: Log detailed information + command history

Example command:

# Complete pipeline with all paralog-handling methods
hybsuite full_pipeline \
  -input_list sample_list.txt \
  -input_data Input_data \
  -t Angiosperms353.fasta \
  -output_dir ./ \
  -PH 1234567ab \
  -sp_tree 12345 \
  -seqs_min_length 100 \
  -aln_min_sample 4 \
  -nt AUTO -process 10

# Pipeline with existing NGS data, skip stage 1
hybsuite full_pipeline \
  -input_list sample_list.txt \
  -NGS_dir ./NGS_dataset \
  -t Angiosperms353.fasta \
  -output_dir ./ \
  -skip_stage 1 \
  -PH 1a \
  -sp_tree 14 \
  -nt 8 -process 5

Step 3: Configure other parameters

See the full parameters for run hybsuite full_pipeline here for more customizable settings.

4. Tips for rerunning the pipeline

If the initial results of HybSuite are not satisfactory, you can rerun the pipeline with modified parameters. The following strategies can help improve results or reduce runtime. These methods can be used individually or combined.

(1) Remove or add samples

You can remove or add samples by editing the sample_list.txt file and rerunning HybSuite.

If the same output_dir is used, HybSuite will automatically detect completed samples and skip the following steps:

Public data downloading (stage 1)
Raw reads trimming (stage 1)
Data assembly (stage 2)

This allows you to update the dataset without repeating previously completed computations.

(2) Reuse existing intermediate data

HybSuite allows reuse of intermediate results from previous runs.

Use the following options:

-NGS_dir
Use the NGS_dataset directory generated by a previous run to skip data downloading and adapter trimming.
-eas_dir
Use the 02-All_paralogs directory generated by a previous run to skip the data assembly step.

Reusing intermediate data can significantly reduce runtime when rerunning analyses.

(3) Adjust sequence filtering thresholds

You can improve dataset quality by adjusting filtering thresholds when rerunning the pipeline.

Common parameters include:

-seqs_min_length
Minimum sequence length retained in stage 2.
-seqs_min_sample_coverage
Minimum proportion of samples containing a sequence.
-aln_min_sample
Minimum number of samples required for an alignment in stage 3.

Increasing these thresholds can improve alignment quality and downstream analyses.

(4) Steps control in concatenation-based and coalescent-based analysis

HybSuite allows selective execution of steps in both concatenation-based and coalescent-based analyses.

Concatenation-based analysis

Use -run_concatenated_step to specify which steps to run.

For example, if a concatenated alignment has already been generated in a previous run, you can skip the concatenation step and directly infer the species tree by setting:

-run_concatenated_step 2

Coalescent-based analysis

Use -run_coalescent_step to control the execution of coalescent-based analysis.

For example, if gene trees are already available from a previous run, you can skip earlier step 1 and directly infer the species tree by setting:

-run_coalescent_step 234

6 - Installation

This page tells you how to install HybSuite step by step.

Important

HybSuite is a shell-based tool, invoking some python and R scripts, which is only available for Linux/Unix/WSL/macOS users.

Tip

Installing HybSuite via conda is strongly recommended, meanwhile manual installation is also available.

1. Install HybSuite via conda

(1) Prerequisites

Conda installation is required. If you don’t already have conda installed, see here for instructions on installing Anaconda or Miniconda.

(2) Step-by-step installation

To avoid dependency conflicts, creating a new conda environment for HybSuite is more recommended:

conda create -n hybsuite

And then, you can activate your newly-created conda environment and install hybsuite using the command below directly (from the specified channel):

conda activate hybsuite
conda install yuxuanliu::hybsuite

Before installing hybsuite, you can edit your ~/.condarc file as this to avoid installation issues about conda channels:

channels:
  - conda-forge
  - bioconda
  - yuxuanliu
  - defaults

Note

Our official bioconda channel package is currently under review. Until approved, using the command showed above is also useful.

(3) Verification

After installation, you can check the help menu of HybSuite to confirm successful installation by running:

hybsuite -h

2. Install HybSuite manually

(1) Prerequisites

Conda installation is required. If you don’t already have conda installed, see here for instructions on installing Anaconda or Miniconda.

(2) Package installtion

Directly clone the github repository:

git clone https://github.com/Yuxuanliu-HZAU/HybSuite.git

(3) Verification

After installation, you can check the help menu of HybSuite to confirm successful installation by running:

bash <absolute or relative path to HybSuite.sh> -h

(4) Dependencies installtion

Method 1: Run `Install_all_dependencies.sh` to install all dependencies in one go (more recommended)

The most convenient way to install all dependencies for HybSuite is directly running our script: HybSuite-master/Install_all_dependencies.sh to install all desired dependencies.
Before running this script, activate your target conda environment.

conda activate <conda_environment_name>
bash HybSuite-master/Install_all_dependencies.sh

Method 2: Install dependencies manually

If you fail to install some dependencies when running Install_all_dependencies.sh, then it is more advisable to install dependencies manually. Just follow the following steps:

conda create -n <conda_environment_name>
conda activate <conda_environment_name>
conda install conda-forge::mamba -y
mamba install python=3.9.15 -y
mamba install bioconda::hybpiper -y
mamba install bioconda::paragone -y
mamba install bioconda::amas -y
mamba install bioconda::sra-tools -y
mamba install conda-forge::pigz -y
conda install conda-forge::plotly -y
mamba install bioconda::newick_utils -y
mamba install bioconda::mafft -y
mamba install bioconda::trimal -y
mamba install bioconda::iqtree -y
mamba install bioconda::raxml -y
mamba install bioconda::raxml-ng -y
mamba install bioconda::aster -y
mamba install r
pip install ete3
pip install PyQt5
pip install phylopypruner
pip install phykit

R
install.packages("phytools")
install.packages("ape")

3. Dependencies

7 - Output files

Output File Naming Conventions

Placeholder	Represents
`<PH>`	Any of the 7 orthology inference methods: HRS, RLWP, LS, MI, MO, RT, 1to1
`<taxon>`	Taxon name (e.g., `<taxon1>`, `<taxon2>`, etc.) from your sample list file
`<prefix>`	User-specified output prefix (via `-prefix` option)
`<locus_name>`	Target sequence locus (e.g., `<locus_name1>`, `<locus_name2>`, etc.)

The specific explanation of every output file is illustrated as follows.

Stage1 output

`<NGS_dataset>` (specified by `-d`)

A directory containing next-generation raw sequencing data downloaded from public databases, existing raw reads provided by the user, and clean data produced by Trimmomatic-0.39.

<NGS_dataset>/
├── 01-Downloaded_raw_data/
├── 02-Downloaded_clean_data/
└── 03-My_clean_data/

`<NGS_dataset>` -> `01-Downloaded_raw_data`

A directory containing next-generation raw sequencing data downloaded from public databases.

<NGS_dataset>/
└── 01-Downloaded_raw_data/
    ├── 01-Raw-reads_sra/
    └── 02-Raw-reads_fastq_gz/

`<NGS_dataset>` -> `01-Downloaded_raw_data` -> `01-Raw-reads_sra`

A directory containing raw sequencing data downloaded from NCBI in .sra format.

<NGS_dataset>/
└── 01-Downloaded_raw_data/
    └── 01-Raw-reads_sra/
        ├── <taxon>.sra
        ...

<taxon>.sra: File with raw sequencing data in SRA format.

By default, all *.sra files in this directory will be removed after converting them into fastq format to save space, unless you specify the option -rm_sra as FALSE to keep them.

`<NGS_dataset>` -> `01-Downloaded_raw_data` -> `02-Raw-reads_fastq_gz`

A directory containing raw sequencing data in .fastq or .fastq.gz format.

<NGS_dataset>/
└── 01-Downloaded_raw_data/
    └── 02-Raw-reads_fastq_gz/
        ├── <taxon>.fastq.gz or <taxon>.fastq
        ...

<taxon>.fastq.gz or <taxon>.fastq:

If the user specifies the option -download_format as fastq, pigz will not be used to compress the original .fastq files to .fastq.gz files, which will produce <taxon>.fastq in this folder.
If the user specifies the option -download_format as fastq.gz, pigz will be used to compress the original .fastq files to .fastq.gz files, which will produce <taxon>.fastq.gz in this folder.
Default: -download_format is specified as fastq.gz.

`<NGS_dataset>` -> `02-Downloaded_clean_data`

A directory containing sequencing data cleaned from downloaded public raw reads.

<NGS_dataset>/
└── 02-Downloaded_clean_data/
    ├── <taxon>_1_clean.paired.fq.gz
    ├── <taxon>_2_clean.paired.fq.gz
    ├── <taxon>_1_clean.unpaired.fq.gz
    ├── <taxon>_2_clean.unpaired.fq.gz
    ├── <taxon>_clean.single.fq.gz
    ├── <taxon>_clean.single.fq.gz
    ...

<taxon>_1_clean.paired.fq.gz & <taxon>_2_clean.paired.fq.gz
Files with compressed cleaned and paired sequencing data (paired-end type) in fq.gz format (these files will be used for downstream analysis).
<taxon>_1_clean.unpaired.fq.gz & <taxon>_2_clean.unpaired.fq.gz
Files with compressed cleaned and unpaired sequencing data (paired-end type) in fq.gz format.
<taxon>_clean.single.fq.gz
File with compressed cleaned sequencing data (single-end type) in fq.gz format (these files will be used for downstream analysis).

`<NGS_dataset>` -> `03-My_clean_data`

A directory containing user-provided cleaned sequencing data or sequencing data cleaned from user-provided raw data.

<NGS_dataset>/
└── 03-My_clean_data/
    ├── <taxon>_1_clean.paired.fq.gz
    ├── <taxon>_2_clean.paired.fq.gz
    ├── <taxon>_1_clean.unpaired.fq.gz
    ├── <taxon>_2_clean.unpaired.fq.gz
    ├── <taxon>_clean.single.fq.gz
    ...

<taxon>_1_clean.paired.fq.gz & <taxon>_2_clean.paired.fq.gz
Files with user-provided compressed cleaned and paired sequencing data (paired-end type) in fq.gz format (these files will be used for downstream analysis).
<taxon>_1_clean.unpaired.fq.gz & <taxon>_2_clean.unpaired.fq.gz
Files with user-provided compressed cleaned and unpaired sequencing data (paired-end type) in fq.gz format.
<taxon>_clean.single.fq.gz
File with user-provided compressed cleaned sequencing data (single-end type) in fq.gz format (these files will be used for downstream analysis).

Stage2 output

`01-Assembled_data`

A directory containing assembled sequence data produced by hybpiper assemble command in HybPiper.

01-Assembled_data/
├── Assembled_data_namelist.txt    
├── Old_assembled_data_namelist_<current_time>.log
├── <taxon>/
    ...

Assembled_data_namelist.txt A file containing sample names used as input to run the hybpiper assemble command.
Old_assembled_data_namelist_<current_time>.log A file containing previous sample names used as input to run the hybpiper assemble command.
<taxon> More details can be found here.

`02-All_paralogs`

A directory containing all original putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper, filtered paralogs, along with their paralog heatmaps and related statistical results.

02-All_paralogs/
├── 01-Original_paralogs
├── 02-Original_paralog_reports_and_heatmap
├── 03-Filtered_paralogs
└── 04-Filtered_paralog_reports_and_heatmap

`02-All_paralogs` -> `01-Original_paralogs`

A directory containing all original putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper.

02-All_paralogs/
└── 01-Original_paralogs/
    └── <locus_name>_paralogs_all.fasta

<locus_name>_paralogs_all.fasta: FASTA files for each sample/locus, containing all putative paralogs recovered by the HybPiper hybpiper paralog_retriever command.

`02-All_paralogs` -> `02-Original_paralog_reports_and_heatmap`

A directory containing all original reports and heatmaps.

02-All_paralogs/
└── 02-Original_paralog_reports_and_heatmap/
    ├── Original_paralog_heatmap.png
    ├── Original_paralog_report.tsv
    ├──Original_recovered_seqs_length.tsv
    └── Original_recovery_heatmap.html

Original_paralog_heatmap.png
A heatmap image file in PNG format, depicting the number of original putative paralog sequences for each locus/sample.
Original_paralog_report.tsv
A TSV file recording the number of original putative paralog sequences for each locus/sample.
Original_recovered_seqs_length.tsv
A TSV file recording the length of original recovered sequences for each locus/sample.
Original_recovery_heatmap.html
An interactive HTML file for visualizing target locus recovery across all original paralogs (including both single-copy and multi-copy genes).

Here is a recovery heatmap example file you can play with: it shows a recovery result of the Angiosperms353 (Johnson et al., 2019) loci from 10 Elaeagnaceae species in our example dataset.

The blue bars along with x- and y-axes indicate how many loci are recovered in each sample and how many samples each locus are recovered in, respectively.
The color intensity of each cell indicates the proportion of gene length recovered for a given sample (y-axis) at a specific target locus (x-axis). When multiple sequences are recovered for a locus within a sample (putative paralogs), only the longest sequence is retained for visualization in the heatmap.

Loading chart...

Now, let’s play with this interactive html file for fun and better effect!

Choose the button “Sort by” as “Descending” to sort samples and loci on the heatmap from high to low recovery.
Click on the “Plus” (+) and “Minus” (-) icons in the upper right corner to zoom in and out of the heatmap.
Click on the “AutoScale” icon in the upper right corner to auto-scale the heatmap.
Click the “Camera” (📷) icon in the upper right corner to download the current heatmap view as a PNG file.

Tip

If some samples recover very few or no loci, we recommend replacing their data sources or increasing the value of -seqs_min_loci_coverage to exclude these low-quality samples from downstream analyses.

`02-All_paralogs` -> `03-Filtered_paralogs`

A directory containing all filtered putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper.

02-All_paralogs/
└── 03-Filtered_paralogs/
    └── <locus_name>_paralogs_all.fasta

`02-All_paralogs` -> `04-Filtered_paralog_reports_and_heatmap`

A directory containing all filtered reports and heatmaps.

02-All_paralogs/
└── 04-Filtered_paralog_reports_and_heatmap/
    ├── Filtered_paralog_heatmap.png
    └── Filtered_paralog_report.tsv

Filtered_paralog_heatmap.png
A heatmap image file in PNG format, depicting the number of filtered putative paralog sequences for each locus/sample.
Filtered_paralog_report.tsv
A TSV file recording the number of filtered putative paralog sequences for each locus/sample.
Filtered_recovered_seqs_length.tsv
A TSV file recording the length of original recovered sequences for each locus/sample.
Filtered_recovery_heatmap.html
An interactive HTML file for visualizing target locus recovery across all filtered paralogs (including both single-copy and multi-copy genes).
The layout is identical to that of Original_recovery_heatmap.png, but it reflects the occupancy of filtered sequences rather than the original ones.

Stage3 output

`03-Paralog_handling`

03-Paralog_handling/
├── HRS/ (optional)
├── RLWP/ (optional)
├── ParaGone/ (optional)
└── PhyloPyPruner/ (optional)

Different arguments for the option -PH will lead to different subdirectories in this output folder:
- HRS/: created when -PH includes number 1 (the user applies the HRS orthology inference method)
- RLWP/: created when -PH includes number 2 (the user applies the RLWP orthology inference method)
- PhyloPyPruner/: created when -PH includes one or several numbers of 4, 5, 6, and 7 (using MI/MO/RT/1to1 orthology inference methods) and “a” (default), or includes number “3” (directly running PhyloPyPruner to carry out LS method). More details about the interpretation of these orthology inference methods can be found here
- ParaGone/: created when -PH includes one or several numbers of 4, 5, 6, and 7 (the user applies MI/MO/RT/1to1 orthology inference methods) and “b” (running ParaGone rather than PhyloPyPruner). More details about the interpretation of these orthology inference methods can be found here
For example:
- Using -PH 12 will create HRS and RLWP directories.
- Using -PH 1234b will create HRS, RLWP, and ParaGone directories.

`03-Paralog_handling` -> `HRS`

A directory containing original and filtered HRS sequences, including the recovery heatmap and filtering reports.

03-Paralog_handling/
└── HRS/
    ├── 01-Original_HRS_sequences
        └── <locus_name>.FNA
    ├── 02-Original_HRS_sequences_reports_and_heatmap
        ├── Original_HRS_heatmap.png
        └── Original_HRS_seq_lengths.tsv
    ├── 03-Filtered_HRS_sequences
        └── <locus_name>.FNA
    └── 04-Filtered_HRS_sequences_reports_and_heatmap
        ├── Filtered_HRS_heatmap.png
        ├── Filtered_HRS_seq_lengths.tsv
        ├── Removed_HRS_seqs_with_low_length_info.tsv
        ├── Removed_samples_with_low_locus_coverage_info.tsv
        └── Removed_loci_with_low_sample_coverage_info.tsv

`01-Original_HRS_sequences`

<locus_name>.FNA
Files with retrieved sequences in FASTA format, produced by hybpiper retrieve_sequences (referred to as HRS sequences in the following).

Notes:
In the HybSuite pipeline, supercontigs are automatically retrieved, including introns and exons, for downstream analysis. HybSuite doesn’t support retrieving only introns or exons.
Since the downstream analysis requires DNA sequences, only DNA sequences can be retrieved; protein sequences are not supported for the next stage.

`02-Original_HRS_sequences_reports_and_heatmap`

Original_HRS_heatmap.png
A heatmap image file in PNG format, depicting the length of the original HRS sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
Original_HRS_seq_lengths.tsv
A TSV file recording all original HRS sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.

`03-Filtered_HRS_sequences`

<locus_name>.FNA
Files with filtered HRS sequences in FASTA format, produced by hybpiper retrieve_sequences.

`04-Filtered_HRS_sequences_reports_and_heatmap`

Filtered_HRS_heatmap.png
A heatmap image file in PNG format, depicting the length of the filtered HRS sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
Filtered_HRS_seq_lengths.tsv
A TSV file recording all filtered HRS sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
Removed_HRS_seqs_with_low_length_info.tsv
A TSV file recording the information of the HRS sequences with low bp length/length ratio that have been filtered out from the dataset.
Removed_samples_with_low_locus_coverage_info.tsv
A TSV file recording the information of the samples with low locus coverage that have been filtered out from the dataset.
Removed_loci_with_low_sample_coverage_info.tsv
A TSV file recording the information of the loci with low sample coverage that have been filtered out from the dataset.

`03-Paralog_handling` -> `RLWP`

A directory containing original and filtered RLWP sequences, including the recovery heatmap and filtering reports.

03-Paralog_handling/
└── RLWP/
    ├── 01-Original_RLWP_sequences
        └── <locus_name>.FNA
    ├── 02-Original_RLWP_sequences_reports_and_heatmap
        ├── Original_RLWP_heatmap.png
        └── Original_RLWP_seq_lengths.tsv
    ├── 03-Filtered_RLWP_sequences
        └── <locus_name>.FNA
    └── 04-Filtered_RLWP_sequences_reports_and_heatmap
        ├── Filtered_RLWP_heatmap.png
        ├── Filtered_RLWP_seq_lengths.tsv
        ├── Removed_RLWP_seqs_with_low_length_info.tsv
        ├── Removed_samples_with_low_locus_coverage_info.tsv
        └── Removed_loci_with_low_sample_coverage_info.tsv

`01-Original_RLWP_sequences`

<locus_name>.FNA
Files with retrieved sequences in FASTA format, produced by hybpiper retrieve_sequences (referred to as RLWP sequences in the following).

Notes:
In the HybSuite pipeline, supercontigs are automatically retrieved, including introns and exons, for downstream analysis. HybSuite doesn’t support retrieving only introns or exons.
Since the downstream analysis requires DNA sequences, only DNA sequences can be retrieved; protein sequences are not supported for the next stage.

`02-Original_RLWP_sequences_reports_and_heatmap`

Original_RLWP_heatmap.png
A heatmap image file in PNG format, depicting the length of the original RLWP sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
Original_RLWP_seq_lengths.tsv
A TSV file recording all original RLWP sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.

`03-Filtered_RLWP_sequences`

<locus_name>.FNA
Files with filtered RLWP sequences in FASTA format, produced by hybpiper retrieve_sequences.

`04-Filtered_RLWP_sequences_reports_and_heatmap`

Filtered_RLWP_heatmap.png
A heatmap image file in PNG format, depicting the length of the filtered RLWP sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
Filtered_RLWP_seq_lengths.tsv
A TSV file recording all filtered RLWP sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
Removed_RLWP_seqs_with_low_length_info.tsv
A TSV file recording the information of the RLWP sequences with low bp length/length ratio that have been filtered out from the dataset.
Removed_samples_with_low_locus_coverage_info.tsv
A TSV file recording the information of the samples with low locus coverage that have been filtered out from the dataset.
Removed_loci_with_low_sample_coverage_info.tsv
A TSV file recording the information of the loci with low sample coverage that have been filtered out from the dataset.

`03-Paralog_handling` -> `ParaGone`

03-Paralog_handling/
└── ParaGone/
    ├── 00_logs_and_reports
    ...
    ├── 28_RT_final_alignments_trimmed
    └── HybSuite_1to1_final_alignments

From 00_logs_and_reports to 28_RT_final_alignments_trimmed: More details about these output folders can be found on this wiki page of ParaGone.
If the user specifies the -paragone_keep_files option in HybSuite as TRUE, the intermediate folders from 01_input_paralog_fasta to 22_RT_stripped_names will be kept. If the user specifies the -paragone_keep_files option as FALSE, intermediate folders will be removed.
HybSuite_1to1_final_alignments: A directory containing orthology group alignments produced via the 1to1 algorithm, which were retrieved from results produced by ParaGone.

`03-Paralog_handling` -> `PhyloPyPruner`

03-Paralog_handling/
└── PhyloPyPruner/
    ├── Input
    ├── Output_LS
    ├── Output_MI
    ├── Output_MO
    ├── Output_RT
    └── Output_1to1

Input
A directory containing trimmed alignments of each locus and their gene trees (input files for running PhyloPyPruner).
<locus_name>_paralogs_all.aln.trimmed.fasta
The trimmed alignment of locus <locus_name> from 02-All_paralogs/03-Filtered_paralogs/<locus_name>_paralogs_all.fasta, generated by MAFFT and TrimAl.
<locus_name>_paralogs_all.aln.trimmed.fasta.tre
The gene tree of locus <locus_name>, constructed by FastTree.
Output_<PH>
A directory containing PhyloPyPruner output files for the <PH> algorithm (<PH> includes LS, MI, MO, RT, 1to1; more details can be found here).

`04-Alignments`

A directory containing alignments produced by different paralog-handling methods specified by the user. These alignments are then trimmed and filtered in stage 3.

04-Alignments/
└── <PH>/
    └──<ortholog_group_name>.*.aln.fasta

<PH>/<ortholog_group_name>.*.aln.fasta
The alignments which are inferred via the <PH> paralog-handling method and multiple sequence alignment by MAFFT.

NOTE:
<ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, 4757_1 and 4757_2 are inferred ortholog group names from locus 4757.

`05-Trimmed_alignments`

A directory containing trimmed alignments inferred via different <PH> paralog-handling methods.

05-Trimmed_alignments/
└── <PH>/
    └── <ortholog_group_name>.*.aln.trimmed.fasta

<PH>/<ortholog_group_name>.*.aln.trimmed.fasta
The alignments which are inferred via the <PH> paralog-handling method, aligned using MAFFT, and trimmed via TrimAl or cleaned via HMMCleaner.

NOTE:
<ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, 4757_1 and 4757_2 are inferred ortholog group names from locus 4757.

`06-Final_alignments`

A directory containing final <PH> orthogroup alignments ready for downstream species tree inference.

06-Final_alignments/
└── <PH>/
    └── <ortholog_group_name>.*.aln.trimmed.fasta

<PH>/<ortholog_group_name>.*.aln.trimmed.fasta Final <PH> orthogroup alignments for downstream species tree inference (stage4).

Stage4 output

`07-Concatenated_analysis`

A directory containing concatenated analysis results.

07-Concatenated_analysis/
<PH>
  ├── 01-Supermatrix
      ├── partition.txt
      └── <prefix>_<PH>.fasta
  └── 02-Species_tree
      ├── IQTREE
          └── IQ-TREE*
      ├── RAxML
          └── RAxML*
      ├── RAxML-NG
          └── RAxML-NG*
      ├── <prefix>_<PH>_ModelTest_NG.txt.tree
      ├── <prefix>_<PH>_ModelTest_NG.txt.log
      ├── <prefix>_<PH>_ModelTest_NG.txt.out
      └── <prefix>_<PH>_ModelTest_NG.txt.ckp

`07-Concatenated_analysis` -> `<PH>` -> `01-Supermatrix`

A directory containing the supermatrix concatenated from <PH> orthogroup alignments and the partition file.

<prefix>_<PH>.fasta
The concatenated supermatrix file for orthology groups inferred by the <PH> method.
partition.txt
The partition file for concatenation.

`07-Concatenated_analysis` -> `<PH>` -> `02-Species_tree`

`IQ-TREE/`

A directory containing IQ-TREE results and final rooted trees (created only when IQ-TREE is applied by setting -sp_tree 1).

IQ-TREE_<prefix>_<PH>.*
IQ-TREE intermediate output files.
IQ-TREE_<prefix>_<PH>.treefile
The tree file with branch lengths and bootstrap values, generated by IQ-TREE.
IQ-TREE_<prefix>_<PH>.rr.tre
The rerooted tree file with branch lengths and bootstrap values from IQ-TREE results.

`RAxML/`

A directory containing RAxML results and final rooted trees (created only when RAxML is applied by setting -sp_tree 2).

RAxML_*.<prefix>_<PH>.*
RAxML intermediate output files.
RAxML_<prefix>_<PH>.rr.tre
The rerooted tree file with branch lengths and bootstrap values from RAxML results.

`RAxML-NG/`

A directory containing RAxML-NG results and final rooted trees (created only when RAxML-NG is applied by setting -sp_tree 3).

RAxML-NG_<prefix>_<PH>.raxml.*
RAxML-NG intermediate output files.
RAxML-NG_<prefix>_<PH>.rr.tre
The rerooted tree file with branch lengths and bootstrap values from RAxML-NG results.

`07-Concatenated_analysis` -> `<PH>` -> `<prefix>_<PH>_ModelTest_NG.txt.*`

The output files generated by ModelTest-NG.

`08-Coalescent_analysis`

A directory containing coalescent analysis results.

08-Coalescent_analysis/
├── <PH>
    ├── 01-Gene_trees
    ├── 02-Combined_gene_trees
    ├── 03-Species_tree
    ├── 04-Rerooted_gene_trees
    └── 05-PhyParts_PieCharts
└── ASTRAL-Pro
    ├── 01-Gene_trees
    ├── 02-Combined_gene_trees
    ├── 03-Species_tree
    └── 04-Rerooted_gene_trees

`08-Coalescent_analysis` -> `<PH>`

A directory containing coalescent-based phylogenetic tree results for a specific dataset generated by the <PH> paralog-handling method.

`08-Coalescent_analysis` -> `<PH>` -> `01-Gene_trees`

A directory containing gene trees inferred from final <PH> alignments.

<ortholog_group_name>.tre: The gene tree for locus/orthogroup <ortholog_group_name>.

NOTE: <ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, ortholog group names 4757_1 and 4757_2 are inferred from locus 4757.

`08-Coalescent_analysis` -> `<PH>` -> `02-Combined_gene_trees`

A directory containing combined gene trees generated from <PH> alignments.

Combined_gene_trees.tre: File containing all gene trees combined into a single file.
Combined_gene_trees.tre.collapsed: File containing all gene trees with low-support branches collapsed.

`08-Coalescent_analysis` -> `<PH>` -> `03-Species_tree`

A directory containing species trees inferred from <PH> alignments.

`ASTRAL-IV/`

A directory containing the final species tree for the <PH> dataset, generated by ASTRAL-IV.

ASTRAL-IV_<prefix>_<PH>.log
The log file generated by ASTRAL-IV.
ASTRAL-IV_<prefix>_<PH>.tre
The species tree inferred by ASTRAL-IV from the combined gene trees.
ASTRAL-IV_<prefix>_<PH>.bootstrap.tre
The species tree generated by ASTRAL-IV and bootstrapped using ASTRAL-III, following the ASTER protocol.
ASTRAL-IV_<prefix>_<PH>.bootstrap.rr.tre
The rerooted species tree generated by ASTRAL-IV and bootstrapped using ASTRAL-III.
ASTRAL-III_LPP.log
The ASTRAL-III log file which documents the bootstrapping process performed by ASTRAL-III.

`wASTRAL/`

A directory containing the final species tree for the <PH> dataset, generated by wASTRAL.

wASTRAL_<prefix>_<PH>.tre
The species tree inferred by wASTRAL from the combined gene trees.
wASTRAL_<prefix>_<PH>.log
The log file generated by wASTRAL.
wASTRAL_<prefix>_<PH>.rr.tre
The rerooted species tree generated by wASTRAL.

`08-Coalescent_analysis` -> `<PH>` -> `04-Rerooted_gene_trees`

A directory containing rerooted gene trees from the <PH> dataset.

<ortholog_group_name>.rr.tre
File containing the rerooted gene tree for <ortholog_group_name> alignments in the <PH> dataset, generated using Phyx or the MAD method.

`08-Coalescent_analysis` -> `<PH>` -> `05-PhyParts_PieCharts`

A directory containing phylogenetic concordance analysis results using rerooted gene trees and species trees.

`ASTRAL-IV/`

A directory containing ASTRAL-IV species tree conflict assessment using rerooted gene trees from directory 04-Rerooted_gene_trees (created only when users choose to run ASTRAL-IV by setting -sp_tree 4).

ASTRAL_PhyParts.*
Files containing the PhyParts output (more details can be found here).
ASTRAL_PhyPartsPieCharts_<prefix>_<PH>.svg
Visualization of concordance and conflict between gene trees and the species tree, generated by our newly developed modified_phypartspiecharts.py script.

`wASTRAL/`

A directory containing wASTRAL species tree conflict assessment using rerooted gene trees from directory 04-Rerooted_gene_trees (created only when users choose to run wASTRAL by setting -sp_tree 5).

wASTRAL_<prefix>_<PH>.tre The species tree inferred by wASTRAL from rerooted <PH> gene trees.
wASTRAL_<prefix>_<PH>_sorted_rr.tre
The final rerooted species tree, rerooted by Phyx and sorted by Newick_Utilities.

Comprehensive output

`hybsuite_logs`

A directory containing the comprehensive log file generated by HybSuite.

hybsuite_logs/
└── hybsuite_<current_time>.log

hybsuite_<current_time>.log
The log file produced when running the HybSuite pipeline (running the extension tools will not produce this logfile).

`hybsuite_checklists`

A directory containing checklist files, including species checklists and locus checklists.

hybsuite_checklists/
├── All_Spname_list.txt
├── My_Spname.txt
├── Outgroup.txt
├── Pre-assembled_Spname.txt
├── Public_Spname.txt
├── Public_Spname_SRR.txt
├── Recovered_locus_num_for_samples.tsv
├── Recovered_sample_num_for_loci.tsv
└── Ref_gene_name_list.txt

All_Spname_list.txt
A file containing all sample names from your research.
My_Spname.txt
A file containing all sample names for user-provided raw data in your research.
Outgroup.txt
A file containing all outgroup taxa specified by the user.
Pre-assembled_Spname.txt
A file containing the names of all pre-assembled samples specified by the user.
Public_Spname.txt
A file containing all sample names whose Next Generation Sequencing (NGS) raw data was downloaded from NCBI.
Public_Spname_SRR.txt
A file containing all Sequence Read Archive (SRA) IDs used to download NGS raw data from NCBI. These SRA IDs correspond with the sample names listed in the Public_Spname.txt file.
Recovered_locus_num_for_samples.tsv
A file containing the number of recovered loci by HybPiper for each sample.
Recovered_sample_num_for_loci.tsv
A file containing the number of recovered samples by HybPiper for each locus.
Ref_gene_name_list.txt
A file containing the names of all genes in the target sequences (specified by the -t option).

`hybsuite_reports`

A directory containing comprehensive statistical summaries of the results generated by the pipeline.

hybsuite_results/
├── Alignments_stats
    ├── <PH>-01_Alignments_stats_AMAS.tsv
    ├── <PH>-02_Trimmed_alignments_stats_AMAS.tsv
    ├── <PH>-03_Removed_alignments_without_parsimony_informative_sites.txt
    ├── <PH>-04_Removed_alignments_with_length_less_than_4.txt
    ├── <PH>-05_Removed_alignments_with_sample_number_less_than_5.txt
    ├── <PH>-06_Final_alignments_list.txt
    └── <PH>-07_Final_alignments_stats_AMAS.tsv
└── Supermatrix_stats
    └── <PH>-Supermatrix_stats_AMAS.tsv

`hybsuite_reports` -> `Alignments_stats`

<PH>-01_Alignments_stats_AMAS.tsv
Summary table of orthogroup alignments inferred via the <PH> paralog-handling method (generated by AMAS.py).
<PH>-02_Trimmed_alignments_stats_AMAS.tsv
Summary table of trimmed orthogroup alignments inferred via the <PH> paralog-handling method (generated by AMAS.py).
<PH>-03_Removed_alignments_without_parsimony_informative_sites.txt
List of alignments without parsimony informative sites. These alignments are removed for downstream species tree inference.
<PH>-04_Removed_alignments_with_length_less_than_4.txt
List of alignments with base pair length less than 4. These alignments are removed for downstream species tree inference.
<PH>-05_Removed_alignments_with_sample_number_less_than_5.txt
List of alignments with fewer than 5 samples. These alignments are removed for downstream species tree inference.
<PH>-06_Final_alignments_list.txt
List of final <PH> alignments selected for downstream species tree inference.
<PH>-07_Final_alignments_stats_AMAS.tsv
Summary table of final <PH> alignments for downstream species tree inference (generated by AMAS.py).

Filtering process: Alignments without parsimony-informative sites, low bp length, and with low sample number are removed.

`hybsuite_reports` -> `Supermatrix_stats`

<PH>-Supermatrix_stats_AMAS.tsv
Summary table of final <PH> supermatrix for downstream concatenation-based species tree inference (generated by AMAS.py).

Introduction

🧬 Pipeline overview

✨ Features

🏆 Advantages

1. End-to-end pipeline from reads to trees

2. Unique functionality of integrating pre-assembled sequences

3. Customizable sequences filtering strategies

4. Advanced paralog-handling methods

5. Multi-method Phylogenetic tree inference

6. Integrated visualization tools

7. High-Performance Computing

1 - Changelog

2 - Example dataset

1. Download the example dataset

2. Configure inputs

Example dataset 1: Angiosperms353

Input_list.txt

Input_sequences

Target_file_Angiosperms353.fasta

Example dataset 2: Arabidopsis100

Input_list.txt

Target_file_Arabidopsis_thaliana100.fasta

3. Run the pipeline

Angiosperms353

Arabidopsis100

3 - Extension tools

1. plot_paralog_heatmap.py

(1) Overview

(2) Dependencies

(3) Input file requirements

(4) Basic usage

(5) Full parameters:

(6) Output examples

(7) Use cases

(8) Tips and tricks

2. plot_recovery_heatmap_v2.py

(1) Overview

(2) Dependencies

(3) Input file requirements

Required input:

(4) Optional input:

(5) Basic usage

(6) Example

(7) Full parameters

3. RLWP.py

(1) Overview

(2) Dependencies

(3) Input file requirements

(4) Basic usage

Required parameters:

Optional parameters:

(5) Output examples

Tips and tricks

4. filter_seqs_by_length.py

(1) Overview

(2) Dependencies

(3) Input file requirements

(4) Basic usage

(5) Output examples

Filtered FASTA Files

Removed Sequences Report

(6) Use cases

Cleaning Assembled Data

Tips and Tricks

5. filter_seqs_by_sample_and_locus_coverage.py

(1) Overview

(2) Dependencies

(3) Input file requirements

(4) Basic usage

Required parameters

Optional parameters

(5) Output examples

(6) Use cases

(7) Tips and tricks

6. modified_phypartspiecharts.py

(1) Overview

(2) Basic usage

(3) Extended functionality

a. Running Efficiency Control

b. Output Files Control

`Example dataset 1: Angiosperms353`

`Input_list.txt`

`Input_sequences`

`Target_file_Angiosperms353.fasta`

`Example dataset 2: Arabidopsis100`

`Input_list.txt`

`Target_file_Arabidopsis_thaliana100.fasta`

1. `plot_paralog_heatmap.py`

2. `plot_recovery_heatmap_v2.py`

3. `RLWP.py`

4. `filter_seqs_by_length.py`

5. `filter_seqs_by_sample_and_locus_coverage.py`

6. `modified_phypartspiecharts.py`

7. `Fasta_formatter.py`

8. `rename_assembled_data.py`

Parameters for running `hybsuite stage1`

Parameters for running `hybsuite stage2`

Parameters for running `hybsuite stage3`

Parameters for running `hybsuite stage4`

Parameters for running `hybsuite full_pipeline`

Run `hybsuite stage1`

Run `hybsuite stage2`

Run `hybsuite stage3`

Run `hybsuite stage4`

Run `hybsuite full_pipeline`