This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Introduction

This page offers detailed introduction of HybSuite. Feel free to explore!


🧬 Pipeline overview

HybSuite performs end-to-end hybrid capture (Hyb-Seq) phylogenomic analysis from raw reads (Hyb-Seq preferred; compatible with RNA-seq, WGS, and genome skimming data) to phylogenetic trees.

The full pipeline is composed of 4 stages:

HybSuite workflow

  • Stage 1: NGS dataset construction

    • (1) Optionally download public raw reads from NCBI (via SRA Toolkit );
    • (2) Optionally integrate user-provided raw reads (if provided);
    • (3) Raw reads trimming (via Trimmomatic);

  • Stage 2: Data assembly and paralog retrieval

    • (1) Target loci assembly and putative paralogs retrieval (via HybPiper)
    • (2) Integrate pre-assembled sequences (if provided);
    • (3) Filter putative paralogs;
    • (4) Plot recovery heatmap and paralog heatmap of original and filtered sequences;

  • Stage 3: Paralog handling

    • Optionally execute seven paralogs-handling methods (HRS, RLWP, LS, MO, MI, RT, 1to1; see our Tutorial and generate filtered alignments for downstream analysis:
      • HRS:
        (1) Retrieve seqeunces via command hybpiper retrieve_sequences in HybPiper;
        (2) Integrate pre-assembled sequences (if provided);
        (3) Filter sequences by length to remove potential mis-assembled seqeunces;
        (4) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner);
        (5) Filter trimmed alignments to generate final alignments.
      • RLWP:
        (1) Retrieve seqeunces via hybpiper retrieve_sequences via HybPiper;
        (2) Integrate pre-assembled sequences (if provided);
        (3) Filter sequences by length to remove potential mis-assembled seqeunces;
        (4) Remove loci with putative paralogs masked in more than samples;
        (5) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner);
        (6) Filter trimmed alignments to generate final alignments.
      • PhyloPypruner pipeline (LS, MI, MO, RT, 1to1):
        (1) Mutiple sequences aligning (via MAFFT) and trimming (via trimAl or HMMCleaner) for all putative paralogs;
        (2) Gene trees inference of all putative paralogs;
        (3) Obtain orthogroup alignments using tree-based orthology inference algorithms (via PhyloPypruner);
        (4) Realign (via MAFFT) and trim (via trimAl or HMMCleaner) the orthogroup alignments;
        (5) Filter trimmed orthogroup alignments to generate final alignments.
      • ParaGone pipeline (MI, MO, RT, 1to1):
        (1) Use the directory cantaining all putative paralogs generated in stage 2 as input;
        (2) Obtain orthogroup alignments using tree-based orthology inference algorithms via ParaGone;
        (3) Filter trimmed orthogroup alignments to generate final alignments.

  • Stage 4: Species tree inference


✨ Features

πŸ”„ Transparent: Full workflow visibility with real-time progress logging at each step
πŸ“ Reproducible: Automatically archives exact software commands & parameters for every run
🧩 Modular: Execute individual stages or complete pipeline in one command
⚑ Flexible: 7 paralog handling methods & 5+ species tree inference options
πŸš€ Scalable: Built-in parallelization for large-scale phylogenomic datasets


πŸ† Advantages

1. End-to-end pipeline from reads to trees

  • Processes data from raw reads to phylogenetic trees with single-command workflows
  • Supports both full pipeline execution and modular stage-specific operations
  • Minimizes manual intervention while maintaining flexibility

2. Unique functionality of integrating pre-assembled sequences

  • Allows for integrating pre-assembled loci sequences into the working dataset. (click here to grasp skills)

3. Customizable sequences filtering strategies

  • Dual filtering strategies for both loci and samples
  • Configurable thresholds for read depth, missing data, and sequence quality
  • Enables dataset optimization for different study goals

4. Advanced paralog-handling methods

  • Implements 7 distinct methods for paralog detection and processing
  • Includes both similarity-based and topology-based approaches
  • Improves orthology assessment accuracy

5. Multi-method Phylogenetic tree inference

6. Integrated visualization tools

  • plot_paralog_heatmap.py (click here to grasp skills);
  • plot_recovery_heatmap.py (click here to grasp skills)
  • modified_phypartspiecharts.py (click here to grasp skills)

7. High-Performance Computing

  • Parallel processing across samples and loci (option -process), which can significantly improve computational efficiency.

1 - Changelog

1.1.7 January, 2026

  • New function: Steps control in stage 4
    • Added support for controlling individual steps within Stage 4, allowing users to selectively run specific steps (e.g., gene tree inference, alignment trimming, species tree inference) rather than executing the entire stage in one go. here for details.

1.1.6 September, 2025

  • New dependency: Plotly
    Plotly has been integrated into the new script plot_recovery_heatmap_v2.py to generate an interactive HTML heatmap visualizing target locus recovery in Stage 2. This heatmap provides useful guidance for parameter selection in Stages 3–4.

  • TreeShrink integrated into Stage 3
    TreeShrink has been incorporated into Stage 3 as an optional processing step. Users can enable it by setting the option -run_treeshrink to TRUE. TreeShrink removes genes with excessively long branches and is available for all seven paralog-handling pipelines except ParaGone, as TreeShrink is already implemented within the ParaGone pipeline.

1.1.5 September, 2025 - MAJOR UPDATE !

  • Pipeline restructuring

    • Stage consolidation: Combined previous stages 3 and 4, simplifying the pipeline from 5 to 4 stages.
    • Stagewise execution: Added flexible stage-by-stage execution capability.
  • Enhanced functionality
    Gene tree inference:

    Alignment trimming:

    • New alternative: Integrated HMMCleaner
    • Maintained trimAl as the default setting.

    Species tree inference:

    • Added ASTRAL-pro3 for multi-copy gene aware coalescent analysis.

1.1.3-1.1.4 August, 2025

Fixed some bugs in stages control. These versions have been abondoned.

1.1.2 August, 2025

Integrated ASTRAL-IV into the pipeline stage 4.

Usage Update:

New dependency:

1.1.1 August, 2025

Fixed some common bugs.

2 - Example dataset

This page provides detailed instructions on how to run the example dataset included with HybSuite.


1. Download the example dataset

If you have downloaded the HybSuite source package, a directory named example_dataset is already included. In this case, no additional download is required.

Alternatively, you can download the repository on your server using:

git clone https://github.com/Yuxuanliu-HZAU/HybSuite
cd HybSuite/example_dataset

2. Configure inputs

The directory example_dataset contains two folders: Angiosperms353 and Arabidopsis100, respectively encompassing all inputs for running HybSuite pipeline for the corresponding two example datasets in our analyss.

Example dataset 1: Angiosperms353

Angiosperms353/
β”œβ”€β”€ Input_list.txt
β”œβ”€β”€ Target_file_Angiosperms353.fasta
β”œβ”€β”€ Input_sequences/
    β”œβ”€β”€ Elaeagnus_pungens.fasta
    └── Hippophae_rhamnoides.fasta

Input_list.txt

This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):

Elaeagnus_angustifolia	SRR12569928
Elaeagnus_bambusetorum	SRR27547630
Elaeagnus_henryi	SRR15533155
Elaeagnus_macrophylla	SRR23618743
Elaeagnus_mollis	SRR30566771
Hippophae_neurocarpa	SRR17549374
Hippophae_salicifolia	ERR7621632
Hippophae_tibetana	SRR17549370
Shepherdia_argentea	ERR7621633
Barbeya_oleoides	SRR16214280	Outgroup
Elaeagnus_oldhamii	A
Elaeagnus_pungens	B
Hippophae_rhamnoides	B
  • Identifiers prefixed with SRR or ERR: Public raw NGS data of the corresponding samples (the first row) ready to be downloaded in HybSuite pipeline.
  • Identifier A: User-provided raw NGS data of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline.
  • Identifier B: User-provided pre-assembled sequences of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline.
  • Identifier Outgroup : Specifing the outgroup taxon.

Input_sequences

This directory should contain either user-provided raw reads, pre-assembled sequences, or both, according to the information provided in Input_list.txt.

  • type1: user-provided raw reads
    In our analysis, only the data of species Elaeagnus oldhamii belongs to user-provided raw reads, which needs to be downloaded here prior to running HybSuite pipeline. After downloading the raw data, transfer them to FASTQ.GZ format and move them to this directory. The two pair-ended files should be named as:
Elaeagnus_oldhamii_1.fastq.gz
Elaeagnus_oldhamii_2.fastq.gz
  • type2: pre-assembled sequences
    Two taxa with pre-assembled sequences are provided: Elaeagnus_pungens, and Hippophae_rhamnoides (corresponding to the taxon name along with the identifier B provided in the sample list file Sample_list.tsv. Their FASTA files are named as Elaeagnus_pungens.fasta and Hippophae_rhamnoides.fasta respectively. (<taxon>.fasta)

Target_file_Angiosperms353.fasta

This file is the target sequence file for Angiosperms353.
The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:

>Elaeagnus-pungens-4471
AATGTCATCCAGGATAAATATCGGTTGGAAGCTGCAAATACTGACTGGATGAACAAGTAC
AAAGGCTCTAGTAAGCTTCTATTGCATCCAAGGAACACTGAGGAGGTTTCACAGATACTC
...
>Hippophae-rhamnoides-4527
GAAGAGAGGGTTGTAGTATTAGTGATTGGTGGAGGAGGAAGAGAACATGCTCTTTGCTAT
GCAATGAATCGATCACCATCCTGCGATGCAGTCTTTTGTGCTCCTGGCAATGCTGGGATT
...
>Hippophae-salicifolia-4691
CAGAGACTGCCTCCATTGTCAACTGATCCCAACAGATGCGAGCGTGCATTTGTTGGAAAC
ACGATAGGTCAAGCAAATGGTGTGTACGACAAGCCAATCGATCTCCGATTCTGTGATTAC
...

Example dataset 2: Arabidopsis100

Arabidopsis100/
β”œβ”€β”€ Input_list.txt
└── Target_file_Arabidopsis100.fasta

Input_list.txt

This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):

Elaeagnus angustifolia	SRR26705271
Elaeagnus bambusetorum	SRR26757993
Elaeagnus henryi	SRR26705270
Elaeagnus macrophylla	SRR26753865
Elaeagnus mollis	SRR26758012
Elaeagnus oldhamii	SRR26705501
Elaeagnus pungens	SRR26705285
Hippophae neurocarpa	SRR26705287
Hippophae rhamnoides	SRR26756417
Hippophae salicifolia	SRR26705274
Hippophae tibetana	SRR26704952
Shepherdia argentea	SRR26756705
Barbeya_oleoides	SRR26756183	Outgroup

Target_file_Arabidopsis_thaliana100.fasta

This file is the target sequence file for Arabidopsis100.
The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:

>Locus-1
MAFRRVLTTVILFCYLLISSQSIEFKNSQKPHKIQGPIKTIVVVVMENRSFDHILGWLKSTRPEIDGLTGKESNPLNVSDPNSKKIFVSDDAVFVDMDPGHSFQAIREQIFGSNDTSGDPKMNGFAQQSESMEPGMAKNVMSGFKPEVLPVYTELANEFGVFDRWFASVPTSTQPNRFYVHSATSHGCSSNVKKDLVKGFPQKTIFDSLDENGLSFGIYYQNIPATFFFKSLRRLKHLVKFHSYALKFKLDAKLGKLPNYSVVEQRYFDIDLFPANDDHPSHDVAAGQRFVKEVYETLRSSPQWKEMALLITYDEHGGFYDHVPTPVKGVPNPDGIIGPDPFYFGFDRLGVRVPTFLISPWIEKGTVIHEPEGPTPHSQFEHSSIPATVKKLFNLKSHFLTKRDAWAGTFEKYFRIRDSPRQDCPEKLPEVKLSLRPWGAKEDSKLSEFQVELIQLASQLVGDHLLNSYPDIGKNMTVSEGNKYAEDAVQKFLEAGMAALEAGADENTIVTMRPSLTTRTSPSEGTNKYIGSY*
>Locus-2
MSDQQLETEINFWGETSEEDYFNLKGIIGSKSFFTSPRGLNLFTRSWLPSSSSPPRGLIFMVHGYGNDVSWTFQSTPIFLAQMGFACFALDIEGHGRSDGVRAYVPSVDLVVDDIISFFNSIKQNPKFQGLPRFLFGESMGGAICLLIQFADPLGFDGAVLVAPMCKISDKVRPKWPVDQFLIMISRFLPTWAIVPTEDLLEKSIKVEEKKPIAKRNPMRYNEKPRLGTVMELLRVTDYLGKKLKDVSIPFIIVHGSADAVTDPEVSRELYEHAKSKDKTLKIYDGMMHSMLFGEPDDNIEIVRKDIVSWLNDRCGGDKTKTQV*
>Locus-3
MSSRENPSGICKSIPKLISSFVDTFVDYSVSGIFLPQDPSSQNEILQTRFEKPERLVAIGDLHGDLEKSREAFKIAGLIDSSDRWTGGSTMVVQVGDVLDRGGEELKILYFLEKLKREAERAGGKILTMNGNHEIMNIEGDFRYVTKKGLEEFQIWADWYCLGNKMKTLCSGLDKPKDPYEGIPMSFPRMRADCFEGIRARIAALRPDGPIAKRFLTKNQTVAVVGDSVFVHGGLLAEHIEYGLERINEEVRGWINGFKGGRYAPAYCRGGNSVVWLRKFSEEMAHKCDCAALEHALSTIPGVKRMIMGHTIQDAGINGVCNDKAIRIDVGMSKGCADGLPEVLEIRRDSGVRIVTSNPLYKENLYSHVAPDSKTGLGLLVPVPKQVEVKA*

3. Run the pipeline

First of all, change your working directory to the downloaded example dataset file:

cd <the path to the directory of "example_dataset">

Next, create output directories (or specify an existing directory when running HybSuite):

mkdir -p ./Angiosperms353/Output ./Arabidopsis100/Output

After setting the right working directory, run the following commands for the two example datasets:

Angiosperms353

hybsuite full_pipeline \
-input_list ./Angiosperms353/Input_list.txt \
-input_data ./Angiosperms353/Input_sequences \
-output_dir ./Angiosperms353/Output \
-nt 5 \
-process 5 \
-t ./Angiosperms353/Target_file_Angiosperms353.fasta \
-seqs_min_length 100 \
-seqs_min_sample_coverage 0.1 \
-PH 1234567 \
-sp_tree 14

Arabidopsis100

hybsuite full_pipeline \
-input_list ./Arabidopsis100/Input_list.txt \
-output_dir ./Arabidopsis100/Output \
-nt 5 \
-process 5 \
-t ./Arabidopsis100/Target_file_Arabidopsis_thaliana100.fasta \
-seqs_min_length 100 \
-seqs_min_sample_coverage 0.1 \
-PH 1234567 \
-sp_tree 14

3 - Extension tools

Apart from the main pipeline, we also offer some extention tools for results visualization and statistic analysis. This page tells you how to use them!


1. plot_paralog_heatmap.py

(1) Overview

Paralogs are homologous genes that arise from gene duplication within the same species. plot_paralog_heatmap.py is a Python script for analyzing and visualizing paralog distribution patterns across samples and loci. As a part of the HybSuite toolkit, it processes unaligned FASTA file for each locus to:

  • Count paralogous sequences for each sample at each locus and generate a TSV format data table recording the counts.
  • Generate heatmaps to visualize paralog distribution patterns, with auto-adjusted dimensions based on sample and locus counts.
  • Support multi-threading for improved efficiency.

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:
conda activate <conda_env>
  • Otherwise, manually install the dependencies first:
pip install pandas seaborn matplotlib numpy
  • Key differences from HybPiper’s paralog heatmap plotting function paralog_retriever:

    • (1) Input format

      • hybpiper paralog_retriever: Requires the sample folders generated by hybpiper assemble.
      • This script: Accepts a folder containing FASTA files, making it applicable to a wider range of datasets (e.g., those including pre-assembled sequences).
    • (2) Visualization features

      • Customizable heatmap colors.
      • Option to display the paralog count for each sample–locus cell directly on the heatmap.
      • Generates higher-quality, more publication-ready figures.

(3) Input file requirements

This script processes the following input files:

  • (1) Input Directory (required, specified by -i/--input_dir):
    A directory containing multiple FASTA format files, each FASTA file represents a locus and contains sequences from multiple samples. Files should be named <locus_name>.fasta, <locus_name>_paralogs.fasta or <locus_name>_paralogs_all.fasta.

For example:

>Sample1
ATGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>Sample2
ATGCTAGCTATCGATCGATCGATCGATCGA
>Sample3
ATGCTCGATCGATCGATCGATCGATCGATC

NOTE:
If a sample has only one single sequence in one locus, then this sequence is the orthology one. If a single sample have multiple sequences in one locus, then these seuqences are putative paralogs.

The paralog sequence FASTA files retrieved by hybpiper paralog_retriever can directly be the input files for this script. For example, the 5942_paralogs_all.fasta in our dataset:

>Elaeagnus_angustifolia.main NODE_1_length_1285_cov_267.250828,Elaeagnus-pungens-Elaeagnus_pungens,0,201,98.51,(1),312,915
ACCTTCCTTGACCTCAAGACCGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATCATTACAGACCTGCAGAACAAACGCCGTGGCACCGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGCGGAAAGAAGCCCTACCCACAAAAGAAAACGGGTAGGGCTCGACAGGGGTCCAAGAGAACTCCACTCCGGCCAGGTGGAGGAGTCGTCTTTGGGCCTAAGCCCAGAGATTGGAGCATCAAGATCAATAGAAAGGAGAAAAGGTTGGCGATTTCGACAGCAATGTCTAGTGCAGCTGCGAATACGATCGTGGTGGAGGATTTTTGGGACAATATGGATAAACCCAGGACGAAGGATTTTATAGCTGCTATGAAGAGGTGGGGTTTAAATCCACCGGGAGAGAAAGCTATGTTTATGATGGACGAAATTTCGGATAACGTGAGGCTTTCAAGTAGAAATATTCCGAAAGTGAAGGTTTTGACCCCGAGGACTTTGAATTTGTTTGATATTTTAAATGCGGATAAGTTGGTGCTTACCCCTGCTGCTGTGGATTACTTGAATGGACGTTATGGTGTTAATTATGAGGGTGAGAGT
>Elaeagnus_angustifolia.0 NODE_2_length_1266_cov_276.023549,Elaeagnus-pungens-Elaeagnus_pungens,2,201,89.95,(-1),301,898
CTTGATCTCAAAACAGCACCACCCGAAACTGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCGTGGGACTGCCTCAACCCTAACCCGTGGTGAGGTTAGAGGTGGTGGGAAAAAACCTTACCCACAAAAGAAAACGGGTCGGGCCCGACAAGGGTCCAAGAGAACTCCACTCCGTCCCGGTGGAGGTGTCGTTTTTGGTCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTGGTGGAGGATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGAGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCGGAAGAGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCTAATTATGAGGGGGAGAAT
>Elaeagnus_macrophylla.main NODE_2_length_1269_cov_183.823826,Elaeagnus-pungens-Elaeagnus_pungens,0,201,88.06,(1),297,900
ACATACCTCGATCTCAAAACAGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCCTGGGACTGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGTGGAAAGAAACCTTACCCACAAAAGAAAACGGGTCGCGCTCGACAAGGGTCAAAAAGAACTCCACTCCGTCCAGGTGGAGGTGTCGTTTTTGGGCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTAGTGGAGCATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGGGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCTGAAGTGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCAAATTATGAGGGTGAGAAT
...
  • (2) Species List File: (Optional, specified by -species_list)
    If you want to analyze only specific species, you can provide a species list file:
Sample1
Sample2
Sample3

(4) Basic usage

python plot_paralog_heatmap.py \
    -i <input_dir> \
    -opr <counts.tsv> \
    -oph <heatmap.png> \
    [options] ...
  • Required parameters in basic usage
    • -i/--input_dir
      Directory containing FASTA files with paralogous sequences (formatted as <locus_name>_paralogs.fasta)
    • At least one output option must be specified:
      • -opr, --output_paralog_report
        Generate a TSV file containing paralog counts
      • -oph, --output_paralog_heatmap
        Generate a heatmap visualization (format determined by file extension)

(5) Full parameters:

General options:
  -t THREADS, --threads THREADS
                        Number of threads to use for processing (default: 1)
  --species_list SPECIES_LIST
                        File containing list of species to include in the analysis (one species per line)
  --output_species_list OUTPUT_SPECIES_LIST
                        Output file to save the list of processed species

Heatmap customization options:  
  --dpi DPI             DPI (dots per inch) for output image (default: 300)
  --fig_length FIG_LENGTH
                        Figure length in inches (default: auto-calculated based on number of loci)
  --fig_height FIG_HEIGHT
                        Figure height in inches (default: auto-calculated based on number of species)
  --sample_font SAMPLE_FONT
                        Font size for sample labels in points (default: 10)
  --gene_font GENE_FONT
                        Font size for gene labels in points (default: 10)
  --hide_xlabels        Hide x-axis labels (locus names)
  --hide_ylabels        Hide y-axis labels (sample names)
  --no_grid             Do not show grid lines in heatmap
  --color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient (default: black)
  --show_values         Show numerical values in heatmap cells (only for values >= 2)
  --grid_color GRID_COLOR
                        Color for grid lines in heatmap (default: grey)
  --add_markers         Add visual markers in cells (dots for 1s, diagonal lines for 0s)

(6) Output examples

  • TSV Report (paralog_counts.tsv):
    (use -opr, --output_paralog_report to specify the output filename and directory.)
Species   gene1   gene2   gene3
Sample1   2       1       0
Sample2   1       1       1
Sample3   0       2       1
  • Heatmap Visualization
    The heatmap uses color intensity to represent the number of recovered sequences:
    • White: No sequences (0)
    • Light color: Single sequence (1), representing single copy orthologs.
    • Darker color: Multiple sequences (β‰₯2), representing putative paralogs.
  • For example, the following figure is the default output of our test dataset Arabidopsis100.
    You can find it in <output_dir>/02-All_paralogs/04-Filtered_paralog_reports_and_heatmap/Filtered_paralog_heatmap.png after running HybSuite by following our guide. In the HybSuite stage 2 pipeline, this script was applied to generate the heatmaps for original paralogs and filtered paralogs and by default, --show_values is used to display the specific numbers of recovered sequences in each locus of each sample.

test_dataset-paralog_heatmap_default

  • When running this script manually, the recovered sequence counts won’t display if you don’t use the --show_values option.:

test_dataset-paralog_heatmap_no_value

  • To clearly show the type of sequence in each locus of each sample, it is advisable to use --add_markers plus --show_value to add markers and numbers to the figure:
    • X: No sequences (0)
    • Β·: Single sequence (1), representing single copy orthologs.
    • <number>: Multiple sequences (β‰₯2), representing putative paralogs.
python plot_paralog_heatmap.py ... -add_markers --show_value

test_dataset-paralog_heatmap_markers

  • Besides, you can also use --color to change a new color theme:
python plot_paralog_heatmap.py ... --color red 

test_dataset-paralog_heatmap_red_color

python plot_paralog_heatmap.py ... --color blue 

test_dataset-paralog_heatmap_blue_color

NOTE: Our script only provides nine color themes, they are black(default), red, blue, purple, green, orange, yellow, brown and pink.

(7) Use cases

  1. Paralog Distribution Analysis: Identify which species and genes tend to have more paralogs
  2. Data Quality Assessment: Evaluate completeness of sequencing and assembly
  3. Evolutionary Analysis: Study gene duplication events across different species
  4. Data Visualization: Generate high-quality visualizations for papers or reports

(8) Tips and tricks

  • For large datasets, increase the thread count (-t parameter) to speed up processing
  • If sample names are long, use a smaller sample font size (–sample_font)
  • This script can adjust the image dimensions automatically (which might be best for visualization in many cases). You can also use --fig_length and --fig_height to manually adjust your image.
  • Use –show_values to display specific paralog counts directly on the heatmap (for counts β‰₯2)

2. plot_recovery_heatmap_v2.py

(1) Overview

plot_recovery_heatmap_v2.py visualizes sequence recovery across samples and loci. It generates heatmaps showing the percentage of sequence length recovered for each gene in each sample, relative to reference sequences.

It highlights:

  • Well-recovered loci across samples
  • Samples with poor overall recovery
  • Recovery patterns indicating systematic biases

Key features:

  • Calculates sequence lengths from FASTA files
  • Generates comprehensive sequence length tables in TSV format
  • Creates customizable heatmaps with multiple color schemes
  • Supports comparison against mean or maximum reference lengths
  • Offers extensive visualization options including value display and grid customization
  • Provides multi-threading support for processing large datasets
  • Support for interactive Plotly HTML output.

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:
conda activate <conda_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas seaborn matplotlib numpy plotly

The script requires:

  • Python 3.6 or higher
  • Biopython (for sequence parsing)
  • NumPy and Pandas (for data manipulation)
  • Matplotlib and Seaborn (for visualization) The script automatically checks for required packages and will provide clear error messages if any are missing.

(3) Input file requirements

Required input:

  1. Directory of FASTA files - Each file should contain sequences for a single locus across multiple samples
  • Supported file extensions: .fna, .fasta, .fa.
  • Each sequence header should start with the species/sample name (e.g., >species_name rest_of_header).
  1. Target sequence file - A single FASTA file containing reference sequences for all target loci
  • Each sequence ID should include the locus name at the end, separated by a hyphen (e.g., >ref-locusnameA).
  • The script automatically detects whether references are nucleotide or protein sequences.

(4) Optional input:

  • Species list file - A simple text file with one species name per line (-s <FILE>)
  • If not provided, species names will be automatically extracted from the FASTA files

(5) Basic usage

The basic command requires only the input directory and reference file:

python plot_recovery_heatmap_v2.py -i /path/to/fasta_files -r /path/to/reference.fasta \
-output_heatmap /path/to/recovery_heatmap(without_suffix)

This will:

  1. Calculate sequence lengths for each sample and locus.
  2. Generate a seq_lengths.tsv file in the current directory. (use --output_seq_lengths to change the output path).
  3. Create a heatmap image named recovery_heatmap.png in the directory /path/to/.

(6) Example

If you have finished running our pipeline, open the directory 02-All_paralogs/03-Filtered_paralogs, which is one of the output directories of our example dataset Angiosperms353. Then you will find there are many FASTA files in it:

4471_paralogs_all.fasta
4527_paralogs_all.fasta
4691_paralogs_all.fasta
4724_paralogs_all.fasta
...
  • Locus name 4471, 4527, 4691, 4724
  • Filename suffix: _paralogs_all
  • Suffix: .fasta

In that case:

  • use the option -i/--input_dir to specify the path to this directory;
  • use the option -r to specify the path to the Angiosperms353 target file (locus names in target file must be corresponded with that in your input directory);
  • use the option --filename_suffix to specify the filename suffix _paralogs_all to enable the script extract the locus name from the filename.

Run:

cd /path/to/02-paralogs/03-Filtered_paralogs/
python plot_recovery_heatmap_v2.py -i . -r /path/to/Target_file_Angiosperms353.fasta --filename_suffix "_paralogs_all" -output_heatmap ./recovery_heatmap.html -gw 0

Then you can obtain an interactive heatmap HTML file ./recovery_heatmap.html:

Loading chart...
  • The blue bars along with x- and y-axes indicate how many loci are recovered in each sample and how many samples each locus are recovered in, respectively.
  • The color intensity of each cell indicates the proportion of gene length recovered for a given sample (y-axis) at a specific target locus (x-axis). When multiple sequences are recovered for a locus within a sample (putative paralogs), only the longest sequence is retained for visualization in the heatmap.

Now, let’s play with this interactive html file for fun and better effect!

  • Choose the button “Sort by” as “Descending” to sort samples and loci on the heatmap from high to low recovery.
  • Click on the “Plus” (+) and “Minus” (-) icons in the upper right corner to zoom in and out of the heatmap.
  • Click on the “AutoScale” icon in the upper right corner to auto-scale the heatmap.
  • Click the “Camera” (πŸ“·) icon in the upper right corner to download the current heatmap view as a PNG file.

(7) Full parameters

usage: plot_recovery_heatmap_v2.py [-h] -i INPUT_DIR -r REFERENCE [-s SPECIES_LIST] [--filename_suffix FILENAME_SUFFIX]
                                   [--output_species_list OUTPUT_SPECIES_LIST] [--output_heatmap OUTPUT_HEATMAP]
                                   [--output_seq_lengths OUTPUT_SEQ_LENGTHS] [-t THREADS]
                                   [--color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}] [--title TITLE]
                                   [--use_max] [--xlabel XLABEL] [--ylabel YLABEL] [-gw GRID_WIDTH]

plot_recovery_heatmap_v2.py - A visualization tool in HybSuite

This script is a component of the HybSuite toolkit, designed for visualizing sequence recovery 
rates across different taxa and loci. It generates heatmaps that display the percentage of 
sequence length recovered for each gene in each taxon, relative to either the average or 
maximum length of reference sequences.

Key features:
1. Calculates sequence lengths and generates a seq_lengths.tsv file
2. Calculates the percentage length recovered relative to reference sequences
3. Generates customizable heatmaps showing recovery rates
4. Supports both average and maximum reference length comparisons
5. Offers flexible visualization options

Both the seq_lengths.tsv file and heatmap generation are optional outputs.
Part of HybSuite

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_DIR, --input_dir INPUT_DIR
                        Directory containing FASTA files for each locus
  -r REFERENCE, --ref REFERENCE
                        Target sequence file (FASTA format)
  -s SPECIES_LIST, --species_list SPECIES_LIST
                        File containing list of species names (one per line). If not provided, species names will be extracted from FASTA files
  --filename_suffix FILENAME_SUFFIX
                        Suffix(es) to remove from input FASTA filenames to get locus names. Multiple suffixes can be separated by commas. Example: "_paralogs_all". If not specified, the input filenames will be recognized as loci names.
  --output_species_list OUTPUT_SPECIES_LIST, -osp OUTPUT_SPECIES_LIST
                        Output file for extracted species list (when species_list is not provided)
  --output_heatmap OUTPUT_HEATMAP, -oh OUTPUT_HEATMAP
                        Output path and filename for the heatmap (default: recovery_heatmap.html). Should end with .html extension.
  --output_seq_lengths OUTPUT_SEQ_LENGTHS, -osl OUTPUT_SEQ_LENGTHS
                        Output file for sequence lengths (TSV format). If not provided, sequence lengths will be written to seq_lengths.tsv in current directory
  -t THREADS, --threads THREADS
                        Number of threads to use (default: 1)
  --color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}
                        Color scheme for the HTML heatmap (default: blue). Available options: viridis, magma, inferno, plasma, cividis, turbo, purple, blue, green, black
  --title TITLE         Main title of the heatmap (default: "Percentage length recovery for each gene")
  --use_max             Use maximum length instead of average length from reference sequences
  --xlabel XLABEL       X-axis label (default: "Locus")
  --ylabel YLABEL       Y-axis label (default: "Sample")
  -gw GRID_WIDTH, --grid_width GRID_WIDTH
                        The value of grid width of the heatmap, recommended to set as "0" when the locus number is huge (default: 0.5)

3. RLWP.py

(1) Overview

RLWP.py (Remove Loci With Paralogs) is a Python script within the HybSuite toolkit designed to filter out genetic loci with excessive paralog occurrences. Paralogs are gene copies that arise from gene duplication events and can complicate phylogenetic analyses. This tool identifies and removes loci that exceed a user-defined threshold of paralog presence across samples, helping to improve the quality of downstream analyses by maintaining only single-copy orthologous markers.
Key features:

  • Filters loci based on paralog occurrence statistics
  • Supports multi-threading for improved performance
  • Provides detailed logging and reporting
  • Offers in-place filtering or non-destructive output to a separate directory
  • Works with various sequence file formats (suffix can be FASTA, FNA, fasta, fa)

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:
conda activate <conda1_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas

(3) Input file requirements

RLWP.py requires two main types of input:

  1. A directory containing sequence files: A directory containing nucleotide sequence files in FASTA format (.fa, .fasta, .fna, .fas or their uppercase variants). Each file should represent one locus.
  2. Paralog statistics file: A tab-separated values (TSV) file containing paralog counts per sample for each locus.
  • 0: No any sequence is recovered in this loci of the sample.
  • 1: The only recovered sequence in this loci of the sample is a single-copy orthology sequence.
  • more than 1: Putative paralogs exist in this loci of the sample.

This file should have:

  • Sample IDs in the first column
  • Locus names as column headers
  • Values representing the number of paralogs found for each sample-locus combination

NOTE: This file can be generated by running plot_paralog_heatmap.py (with the option -oph, see here)

Example paralog statistics file format:

Sample  Locus1  Locus2  Locus3
sample1 1       2       1
sample2 1       1       3
sample3 2       1       1

(4) Basic usage

  • Remove loci where >2 samples show putative paralogs:
python RLWP.py -i input_directory -p paralog_statistics.tsv -s 2 -or deletion_report.tsv

Required parameters:

  • -i, --input_dir: Directory containing sequence files
  • -p, --paralog_heatma: Path to paralog statistics file (TSV format)
  • -s, --samples_threshold: Minimum number of samples with paralogs to trigger locus removal
  • -or, --output_report: Path for saving the deletion report

Optional parameters:

  • -o, --output_dir: Optional directory to output filtered files (preserves originals)
  • -t, --threads: Number of threads to use for parallel processing (default: 1)

(5) Output examples

Tips and tricks

  • Choosing the Right Threshold: Start with a conservative threshold (e.g., 5% of your total samples) and adjust based on your dataset characteristics.
  • Non-destructive Workflow: Use the -o option to create a filtered copy of your data without modifying original files:
python RLWP.py -i <input_directory> -p paralog_reports.tsv -s 3 -o filtered_data
  • Start with a conservative threshold: (e.g., 5% of your total samples) and adjust based on your dataset characteristics.

  • Performance Optimization: For large datasets, increase thread count to speed up processing


4. filter_seqs_by_length.py

(1) Overview

  • filter_seqs_by_length.py is a Python script within the HybSuite package that filters DNA sequences based on length criteria. It allows filtering sequences using absolute minimum length or relative length compared to reference sequences.
  • It is particularly useful for removing short, potentially truncated sequences before downstream analyses, helping to ensure high-quality datasets for phylogenomic analysis.
  • It processes multiple FASTA files in parallel, reference file can be DNA or protein sequences.
  • It provides detailed logging and reporting of filtered sequences, making it easy to track what was removed and why.

(2) Dependencies

  • Key dependencies:

    • BioPython: For sequence handling
    • Pandas: For reporting and data manipulation
    • Python 3.6+: For pathlib and f-string support
  • If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:

conda activate <conda1_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas

(3) Input file requirements

filter_seqs_by_length.py requires two types of input:

  1. A directory containing sequence files in FASTA format:
  • Supported extensions: .fa, .fasta, .fna, .fas (case-insensitive)
  • Each file is assumed to contain sequences from a single locus
  • Filename determines locus ID (e.g., the locus name of GeneName.fasta is GeneName)
  1. Reference Sequences
  • The format of reference sequences is as the same as the requirement of HybSuite main program (here to check).

(4) Basic usage

  • Filter sequences by absolute minimum length:
python filter_seqs_by_length.py -i input_directory --min_length 300
  • Filter using reference sequences to according to length ratio (compared to mean length or maximum length of each locus in reference file):
python filter_seqs_by_length.py -i input_directory -r reference.fasta --mean_length_ratio 0.7
  • Combine multiple criteria to filter:
python filter_seqs_by_length.py -i input_directory -r reference.fasta \
    --min_length 200 --mean_length_ratio 0.6 --max_length_ratio 0.5
  • Save output to a different directory rather than modifing original files:
python filter_seqs_by_length.py -i input_directory --output_dir filtered_sequences
  • Generate report of removed sequences:
python filter_seqs_by_length.py -i input_directory -r reference.fasta \
    --mean_length_ratio 0.7 --output_report removed_seqs.tsv

(5) Output examples

Filtered FASTA Files

Filtered sequences are written either:

  1. To the original files (overwriting them)
  2. To a new directory if --output_dir is specified

Removed Sequences Report

When using --output_report, a TSV file is created with details of removed sequences:

FileSequence_IDLengthMean_Length_RatioMax_Length_Ratio
gene1.fastaSample1_gene11250.4350.391
gene2.fastaSample3_gene2780.2130.185

(6) Use cases

Cleaning Assembled Data

  • Remove truncated sequences resulting from poor assembled and captured sequences:
python filter_seqs_by_length.py -i captured_exons -r reference.fasta \
    --mean_length_ratio 0.3 --output_report removed_sequences.tsv

Tips and Tricks

  • Locus Identification: Ensure filenames match locus IDs in reference sequences (part after last hyphen).
  • Preserving Originals: Always use --output_dir when testing filtering parameters to avoid overwriting original files.
  • Speed Optimization: Set -t to match available CPU cores for maximum performance.
  • Multiple Filters: Combining --min_length with ratio-based filters creates more stringent filtering.
  • Protein Sequences: The script automatically detects the type of reference file (DNA/protein) and adjusts length calculations appropriately.

5. filter_seqs_by_sample_and_locus_coverage.py

(1) Overview

filter_sequences_by_sample_and_locus_coverage.py is a Python script designed to remove samples with low loci coverage and loci with low sample coverage in phylogenomic dataset based on user-defined threshold.

This tool can:

  • Filter samples and loci based on minimum coverage thresholds
  • Generate reports of removed samples and loci
  • Process files in parallel to improve performance

(2) Dependencies

  • Key dependencies:

    • BioPython: For sequence handling
    • Pandas: For reporting and data manipulation
    • Python 3.6+: For pathlib and f-string support
  • If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:

conda activate <conda1_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas

(3) Input file requirements

The tool processes FASTA files in a directory with the following requirements:

  1. A directory containing sequence files in FASTA format:
  • Supported extensions: .fa, .fasta, .fna, .fas, FNA (case-insensitive).
  • Each file is assumed to contain sequences from a single locus.
  • Filename determines locus ID (e.g., the locus name of GeneName.fasta is GeneName).

(4) Basic usage

python filter_seqs_by_sample_and_locus_coverage.py -i input_directory --min_sample_coverage 0.5 --min_locus_coverage 0.7

Required parameters

  • -i, --input: Directory containing FASTA files.
  • --min_sample_coverage: Minimum sample coverage ratio (0-1) for each locus (default: 0.0)
  • `–min_locus_coverage: Minimum locus coverage ratio (0-1) for each sample (default: 0.0)

Optional parameters

  • -o, --output_dir: Directory for filtered sequences (if not specified, original files are modified)
  • -t, --threads: Number of threads to use (default: 1)
  • --removed_samples_info: Output TSV file for removed samples coverage information
  • --removed_loci_info: Output TSV file for removed loci coverage information

(5) Output examples

  • Example of removed_samples.tsv (specified by --removed_samples_info):
Sample  Locus_Coverage
Species1    0.45
Species2    0.32
  • Example of removed_loci.tsv (specified by --removed_loci_info):
Locus   Sample_Coverage
Locus1  0.38
Locus2  0.42

(6) Use cases

  • The following running codes remove loci that appear in less than 60% of samples and samples that have less than 50% of loci.
python filter_seqs_by_sample_and_locus_coverage.py \
-i assembled_loci/ -o filtered_loci/ \
--min_sample_coverage 0.6 --min_locus_coverage 0.5 \
--removed_samples_info removed_samples.tsv --removed_loci_info removed_loci.tsv

(7) Tips and tricks

  • Choosing Coverage Thresholds: Start with lower thresholds (e.g., 0.3-0.5) and gradually increase until you achieve the desired balance between data completeness and taxon/locus sampling.
  • Preserving Original Files: Always use the -o option to output to a new directory when experimenting with different thresholds.
  • Removing Problematic Samples Only: Set --min_locus_coverage without setting --min_sample_coverage to filter out only low-coverage samples while keeping all loci.
  • Tracking Removed Data: Always use the --removed_samples_info and --removed_loci_info options to keep records of what was filtered out for documentation and troubleshooting.
  • Performance Optimization: Use the -t option with a value close to your CPU core count for faster processing of large datasets.

6. modified_phypartspiecharts.py

(1) Overview

(2) Basic usage

The basic usage of modified_phypartspiecharts.py is nearly as the same as that of phypartspiecharts.py. The only difference is that users have to use --output to specify the path and filename of the output visiualization results when running modified_phypartspiecharts.py, rather than --svg_name in phypartspiecharts.py.

python modified_phypartspiecharts.py \
species_tree phyparts_prefix num_genes ...
  • Required Parameters in basic usage
    • species_tree: Path to species tree file (Newick format)
    • phyparts_root: Prefix of PhyParts output files
    • num_genes: Total number of gene trees

(3) Extended functionality

Compared to the original version, modified_phypartspiecharts.py offers the following extended functionality:

a. Running Efficiency Control

  • Multithreading Support: Use -nt/--threads <NUM> for multithreaded processing, significantly improving speed for large datasets

b. Output Files Control

  • Support for SVG and PDF format outputs: use --output <output_file> and specify your outputfile with the extension .pdf or .svg.

  • Additional Statistical Output: use --stat parameter to export detailed node statistics to TSV files.
    The detailed node statistics table generated by --stat parameter contains the following columns:

    • Node: Node ID
    • Support(blue): Number of genes supporting the species tree
    • TopConflict(green): Number of genes with main conflict
    • OtherConflict(red): Number of genes with other conflict
    • NoSignal(gray): Number of genes with no signal
    • Support/Total_Ratio: Ratio of supporting to conflicting genes

    The file also includes the average ratios for internal nodes including Support/Total Ratio, Conflict/Total Ratio, NoSignal/Total Ratio, Support/Signal_Ratio and Conflict/Signal_Ratio.

    For example:

Node	Support(blue)	TopConflict(green)	OtherConflict(red)	NoSignal(gray)	Support/Total_Ratio	Conflict/Total_Ratio	NoSignal/Total_Ratio	Support/Signal_Ratio	Conflict/Signal_Ratio
0	85	0	0	157	0.3512	0.0000	0.6488	1.0000	0.0000
1	202	7	31	2	0.8347	0.1570	0.0083	0.8417	0.1583
2	135	41	59	7	0.5579	0.4132	0.0289	0.5745	0.4255
3	205	6	20	11	0.8471	0.1074	0.0455	0.8874	0.1126
4	123	22	91	6	0.5083	0.4669	0.0248	0.5212	0.4788
5	164	33	36	9	0.6777	0.2851	0.0372	0.7039	0.2961
6	190	18	29	5	0.7851	0.1942	0.0207	0.8017	0.1983
7	91	35	112	4	0.3760	0.6074	0.0165	0.3824	0.6176
8	129	18	86	9	0.5331	0.4298	0.0372	0.5536	0.4464

Average ratios for internal nodes only:
Support/Total Ratio: 0.6079
Conflict/Total Ratio: 0.2957
NoSignal/Total Ratio: 0.0964
Support/Signal Ratio: 0.6963
Conflict/Signal Ratio: 0.3037

c. Extended Visualization Functionality

  • Support for controlling whether taxonomic names use italic font: use --no_italic
  • Support for flexible number displayed on branches: use --show_num_mode <NUM>
  • Support for controlling tree branch width display: use --line_width <NUM>
  • Support for controlling pie chart size: use --pie_size <NUM>
  • Support for controlling leaf node label size: use --tip_size <NUM>
  • Support for controlling node number label size: use --number_size <NUM>
  • Support for circular, cladogram, and phylogram display types: use --tree_type <circular|cladogram>

Github_wiki_page-modified_phypartspiecharts

(4) Full Options

options:
  -h, --help            show this help message and exit
  --taxon_subst TAXON_SUBST
                        Comma-delimited file to translate tip names.
  --output OUTPUT       Output filename with extension (.svg or .pdf)
  --output_node_tree    Generate an additional tree file with '_nodes' suffix showing:
                        - All node identifiers in the tree
                        - No pie charts
                        - No numerical annotations
  --no_ladderize        Don't ladderize tree
  --to_csv              Export data to CSV
  --tree_type {circle,cladogram,phylo}
                        Tree visualization type (cladogram or circle, default: cladogram)
  --line_width VT_LINE_WIDTH
                        Width of tree branches (default: 0)
  --no_italic           Display species names in normal font style (default: italic)
  --tip_size TIP_SIZE_FACTOR
                        Scale factor for tip label font size (default: 1.0)
  --number_size NUMBER_SIZE_FACTOR
                        Scale factor for gene tree count font size (default: 1.0)
  --show_num_mode SHOW_NUM_MODE
                        Control what numbers to show on branches (specify 0-2 digits):
                        0: Hide all numbers
                        1: Number of genes supporting species tree (blue)
                        2: Number of genes conflicting with species tree (red+green)
                        3: Number of genes with no signal (gray)
                        4: Proportion of supporting genes (blue/total)
                        5: Proportion of conflicting genes ((red+green)/total)
                        6: Proportion of no signal genes (gray/total)
                        7: Ratio of supporting to all signal genes (blue/(blue+red+green))
                        8: Ratio of conflicting to all signal genes (red+green/(blue+red+green))
                        9: Original node support values from the input tree
                        Example: --show_num_mode 0  (hide all numbers)
                                --show_num_mode 1  (show only support number)
                                --show_num_mode 12 (default, show support and conflict numbers)
                                --show_num_mode 47 (show support number and support/conflict ratio)
                                --show_num_mode 9  (show original node support values)
  --pie_size PIE_SIZE_FACTOR
                        Scale factor for pie chart size (default: 1.0)
  --stat STAT_OUTPUT    Output file path for node statistics (TSV format)
  -nt THREADS, --threads THREADS
                        Number of threads to use (default: 1)

Citation

If using this tool, please cite:


7. Fasta_formatter.py

(1) Overview

Fasta_formatter.py is a Python script for reformatting FASTA sequences into either interleaved (60 characters per line) or single-line format. It supports multi-threading for faster processing of large files.

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:
conda activate <conda_env>
  • Otherwise, manually install the dependencies first:
pip install pathlib concurrent.futures

(3) Basic usage

python Fasta_formatter.py \
    -i <input_fasta> \
    -o <output_fasta> \
    --inter|--single \
    [-nt <threads>]

Required parameters:

  • -i/--input: Input FASTA file
  • -o/--output: Output file path
  • --inter: Output in interleaved format (60 characters per line)
  • --single: Output in single-line format

Optional parameters:

  • -nt/--threads: Number of threads (default: 1)

(4) Example

Convert a FASTA file to interleaved format with 4 threads:

python Fasta_formatter.py -i sequences.fasta -o sequences_formatted.fasta --inter -nt 4

Convert to single-line format:

python Fasta_formatter.py -i sequences.fasta -o sequences_singleline.fasta --single

(5) Use cases

  • Data preprocessing: Prepare sequences for downstream analysis tools that require specific FASTA formatting
  • File standardization: Convert between different FASTA formats for compatibility
  • Large file processing: Use multi-threading to speed up formatting of big datasets

8. rename_assembled_data.py

(1) Overview

rename_assembled_data.py is a python script in the HybSuite package designed to handle batch renaming operations for assembled data directories produced in the HybSuite stage 2 and their contents. It provides a comprehensive solution for renaming directories, files, and file contents while maintaining data integrity and consistency.

Key features:

  • Recursively renames directory structures, file names, and file contents
  • Handles potential naming conflicts safely
  • Supports both single directory and batch renaming operations

(2) Basic usage

Single directory renaming

To rename a single directory and all its contents:

python rename_assembled_data.py -i /path/to/directory -n new_name

Parameters:

  • -i, --input: Path to the directory you want to rename
  • -n, --new_name: The new name to replace the old name

Example:

python rename_assembled_data.py -i ./sample_001 -n sample_002

Batch renaming

For batch renaming multiple directories, create a tab-delimited file containing old and new name pairs:

python rename_assembled_data.py --rename_list path/to/rename_list.txt -p /path/to/parent_directory

Parameters:

  • --rename_list: Path to a tab-delimited file containing old_name and new_name pairs
  • -p, --parent_dir: Path to the parent directory containing all the folders to be processed

The rename list file should be formatted as follows (tab-delimited):

old_name1   new_name1
old_name2   new_name2
old_name3   new_name3

Example:

python rename_assembled_data.py --rename_list rename_pairs.txt -p ./assembled_data

The script will:

  1. Process each directory listed in the rename file
  2. Rename all matching files and directories within each target directory
  3. Replace matching content within files
  4. Provide a summary of successful and failed operations

Note: The script includes safety checks and will skip operations that might cause conflicts or data loss.

4 - Full parameters

This page provides the full options and parameters for each subcommand, along with additional explanations and links where necessary. The available subcommands can be viewed using the command:

hybsuite -h/--help

or:

bash <the path to HybSuite.sh> -h/--help

Parameters for running hybsuite stage1

Stage 1 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage1 ...

Mandatory arguments: -input_list -input_data (required when including user-provided data) -output_dir

Essential arguments: -sra_maxsize -NGS_dir -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types. (Default: None)
  -input_data <DIR>     The directory containing all input data (required when the inputs include your own data / pre-assembled data). (Default: None).

Arguments for outputs:
  -output_dir <DIR>     The output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -NGS_dir <DIR>        The output directory containing raw and cleaned reads files (Default: <output_dir>/NGS_dataset).
                        Notes: Pre-existing cleaned reads will skip reads trimming steps.

General arguments:
  === Threads control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_fasterq_dump <INT>               
                        fasterq-dump threads. (Default: 1)
  -nt_pigz <INT>        pigz compression threads. (Default: 1)
  -nt_trimmomatic <INT> Trimmomatic threads. (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of public data downloading and raw reads trimming to run concurrently. (Default: 1)
                        "all" means running all samples concurrently. (be cautious to set this option)
   
  === Public raw reads doenloading control ===
  -rm_sra <TRUE/FALSE>  Whether to remove SRA files after conversion. (Default: TRUE)
  -download_format <fastq|fastq_gz>
                        Downloaded data format. (Default: fastq_gz)

  === Logfile Control ===
  -log_mode <simple|cmd|full>
                        The output mode of hybsuite logfile. (Default: cmd)

Arguments for integrated tools:
  === SRAToolkit ===
  -sra_maxsize <NUM>    The maximum size of sra files to download. (Default: 20GB)

  === Trimmomatic ===
  -trimmomatic_leading_quality <3-40> 
                        Leading base quality cutoff. (Default: 3)
  -trimmomatic_trailing_quality <3-40> 
                        Trailing base quality cutoff. (Default: 3)
  -trimmomatic_min_length <36-100>     
                        Minimum read length. (Default: 36)
  -trimmomatic_sliding_window_s <4-10> 
                        Sliding window size. (Default: 4)
  -trimmomatic_sliding_window_q <15-30>
                        Window average quality. (Default: 15)

Command example:
  # Run HybSuite stage1 with 1 thread and 1 parallel processing
  $ hybsuite stage1 -input_list ./input_list.txt -input_data ./Input_data -NGS_dir ./NGS_dir -output_dir ./
  
  # Run HybSuite stage1 with 5 threads and 5 parallel processing
  $ hybsuite stage1 -input_list ./input_list.txt -input_data ./Input_data -NGS_dir ./NGS_dir -output_dir ./ -nt 5 -process 5

Parameters for running hybsuite stage2

Stage 2 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage2 ...

Mandatory arguments: -input_list -NGS_dir -t -output_dir

Essential arguments: -eas_dir -seqs_min_length -seqs_min_sample_coverage -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types used in stage 1. (Default: None)
  -input_data <DIR>     The directory containing all input data (in this stage, only required when the inputs include pre-assembled data). (Default: None).
  -NGS_dir <DIR>        The directory containing NGS raw and cleaned reads files (generated in stage 1). (Default: ./NGS_dir)
  -t <FILE>             Target file for data assembly. (follows the format required in HybPiper)

Arguments for outputs:
  -output_dir <DIR>     The Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -eas_dir <DIR>        The output directory containing HybPiper assembly sequences. (Default: <output_dir>/01-Assembled_data)
                        Note: Pre-existing data in this directory will skip redundant assembly steps.

General arguments:
  === Putative paralogs filtering control ===
  -seqs_min_length <INT>         
                        Minimum sequence length for filtered paralogs. (Default: 0)
                        Putative paralogs shorter than this value will be filtered.             
  -seqs_mean_length_ratio <0-1>    
                        Minimum sequence length ratio relative to the mean value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_max_length_ratio <0-1>              
                        Minimum length ratio relative to the longest value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_min_sample_coverage <0-1>           
                        Minimum sample coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, HRS and RLWP sequences in stage 3, loci lower than this sample coverage will be filtered.
  -seqs_min_locus_coverage <0-1>            
                        Minimum locus coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, taxa (samples) with lower than this locus coverage will be filtered.
  
  === Heatmap control ===
  -heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient. (Default: black)
  
  === Threads control ===
  -nt <INT|AUTO>        Global thread setting (Default: 1)
  -nt_hybpiper <INT>    HybPiper threads (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of data assembly ('hybpiper assemble') to run concurrently (Default: 1)
                        "all" means running all samples concurrently (be cautious to set this option)
  
  === Logfile control ===
  -log_mode <simple|cmd|full>
                        The output mode of hybsuite logfile. (Default: cmd)

Arguments for integrated tools:
   === HybPiper ===
  -hybpiper_mapping_tool <blast|diamond>     
                        The tool used for mapping reads to targets in HybPiper (only for protein targets) (Default: blast)
  -hybpiper_check_chimeric_contigs	<FALSE|TRUE>
                        Check whether a stitched contig is a potential chimera of contigs from multiple paralogs when running "hybpiper assemble". (Default: TRUE)
  -hybpiper_cov_cutoff <INT>
                        Specify the value of "-cov_cutoff" when running "hybpiper assemble" in Stage 2. (Default: 8)
                        Increasing this value may increase the loci recovery efficiency but potentially introducing errors.

Command example:
  # Run HybSuite stage2 with filtering paralog sequences
  $ hybsuite stage2 -NGS_dir ./NGS_dir -t ./Angiosperms353.fasta -output_dir ./ -nt 5 -process 5 -seqs_min_length 100 -seqs_min_sample_coverage 0.1

  # Run HybSuite stage2 without filtering paralog sequences
  $ hybsuite stage2 -NGS_dir ./NGS_dir -t ./Angiosperms353.fasta -output_dir ./ -nt 5 -process 5

Parameters for running hybsuite stage3

Stage 3 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage3 ...

Mandatory arguments: -input_list -eas_dir -paralogs_dir -t -output_dir

Essential arguments: -PH -prefix -run_phyparts -aln_min_sample -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types used in stage 1&2. (Default: None)
  -input_data <DIR>     The directory containing all input data (in this stage, only required when the inputs include pre-assembled data). (Default: None).
  -eas_dir <DIR>        The output directory containing HybPiper assembly sequences (generated in stage 3). (Default: <output_dir>/01-Assembled_data)
  -paralogs_dir <DIR>   The directory containing all paralog sequences generated in stage 2 or by users themselves. (Default: None)
                        It's advisable to set this parameter as '<output_dir>/02-All_paralogs/03-Filtered_paralogs'.
  -t <FILE>             Target file for data assembly. (follows the format required in HybPiper)

Arguments for outputs:
  -output_dir <DIR>     Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -prefix <STRING>      Prefix for output files. (Default: HybSuite)

General arguments:
  === Paralog handling control ===
  -PH <1-7|a|b|all>     Paralog handling methods to execute: (one or more of them can be chosen)
                        1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1
                        a: PhyloPyPruner, b: ParaGone (Default: 1a)
  
  === Sequences and alignments filtering control ===
  -seqs_min_length <INT>
                        Minimum sequence bp length for filtering HRS and RLWP sequences. (Default: 0)
                        HRS and RLWP sequences shorter than this value will be removed.
  -aln_min_length <INT> 
                        Minimum sequence bp length for filtering HRS and RLWP final alignments. (Default: 4)
  -aln_min_sample <INT>
                        Minimum sample number for final alignments. (Default: 0)
                        Final alignments (aligned and trimmed) with sample number below this threshold will be removed.

  === Gene tree builder control ===
  -gene_tree <1/2>      Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1) 
  -gene_tree_bb <INT>   Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)

  === Alignments trimming tool control ===
  -trim_tool <1/2>      Choose the software to trim/clean alignments. (1: trimAl; 2: HMMCleaner) (Default: 1) 

  === Nucleotide ambiguity character replacement ===
  -replace_n <TRUE|FALSE>
                        Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
                        Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).

  === Threads control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_paragone <INT>    ParaGone threads. (Default: 1)
  -nt_phylopypruner <INT>              
                        PhyloPyPruner threads. (Default: 1)
  -nt_mafft <INT>       MAFFT threads. (Default: 1)
  -nt_amas <INT>        AMAS.py threads. (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads. (Default: 1)
  -nt_iqtree <INT>      IQ-TREE threads. (Default: 1)
  -nt_fasttree <INT>    FastTree threads. (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of multiple sequences aligning, alignments trimming, and gene trees inference to run concurrently. (Default: 1)
                        "all" means running all samples concurrently. (be cautious to set this option)

  === Heatmap control ===
  -heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient. (Default: black)

Arguments for integrated tools :
  === PhyloPyPruner ===
  -pp_min_taxa <INT>    Minimum taxa per cluster. (Default: 4)
  -pp_min_support <0-1> Minimum support value. (Default: 0=auto)
  -pp_trim_lb <INT>     Trim long branches. (Default: 5)

  === ParaGone ===  
  -paragone_pool <INT>  Parallel alignment tasks. (Default: 1, same as the option '-process')
  -treeshrink_q_value <0-1>        
                        TreeShrink quantile threshold (Default: 0.05)
  -paragone_cutoff_value <FLOAT>       
                        Branch length cutoff (Default: 0.3)
  -paragone_minimum_taxa <INT>         
                        Minimum taxa per alignment (Default: 4)
  -paragone_min_tips <INT>             
                        Minimum tips per tree (Default: 4)
  
  === HybPiper ===
  -hybpiper_skip_chimeric_genes <FALSE|TRUE>
                        Whether to skip recovering sequences for putative chimeric genes when running "hybpiper retrieve_sequences" (HRS method) in Stage 3. (Default: FALSE)
  -hybpiper_retrieved_seqs_type <dna|intron|supercontig>
                        The type of sequence to extract when running "hybpiper retrieve_sequences" in Stage 3. (default:dna, which means extracting coding sequences)

  === MAFFT ===  
  -mafft_algorithm <str>               
                        MAFFT algorithm [auto|linsi] (Default: auto)
  -mafft_adjustdirection <TRUE/FALSE>  
                        Whether to adjust sequence directions (Default: TRUE)
  -mafft_maxiterate <INT>              
                        Maximum number of iterations for MAFFT (Default: auto)
                        Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
  -mafft_pair <str>                    
                        Pairing strategy for MAFFT (Default: auto)
                        Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
  
  === trimAl ===
  -trimal_mode <str>                   
                        trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
  -trimal_gapthreshold <0-1>           
                        Gap threshold (Default: 0.12)
  -trimal_simthreshold <0-1>           
                        Similarity threshold (Default: auto)
  -trimal_cons <0-100>                 
                        Consensus threshold (Default: auto)
  -trimal_block <INT>                  
                        Minimum block size (Default: auto)
  -trimal_w <INT>                      
                        Window size (Default: auto)
  -trimal_gw <INT>                     
                        Gap window size (Default: auto)
  -trimal_sw <INT>                     
                        Similarity window size (Default: auto)
  -trimal_resoverlap <0-1>             
                        Minimum overlap of a positions with other positions in the column. (Default: auto) 
  -trimal_seqoverlap <0-100>           
                        Minimum percentage of sequences without gaps in a column. (Default: auto)
  
  === HMMCleaner ===
  -hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
                        Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45) 
                        Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)

Command example :
  # Run HybSuite stage3 without alignments filtering
  $ hybsuite stage3 -eas_dir ./01-Assembled_data -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5
  
  # Run HybSuite stage3 with alignments filtering
  $ hybsuite stage3 -eas_dir ./01-Assembled_data -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs -t ./Angiosperms353 -PH 124567b -output_dir ./ -nt -process 5 -aln_min_length 100 -aln_min_sample 0.1

Parameters for running hybsuite stage4

Stage 4 Manual
--------------------------------------------------------------------------------
Usage: hybsuite stage4 ...

Mandatory arguments: -input_list -aln_dir -output_dir

Essential arguments: -PH -sp_tree -prefix -run_phyparts -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types used in stage 1&2. (Default: None)
  -aln_dir              The directory containing different orthogroups alignments generated in stage 3. (Default: <output_dir>/06-Final_alignments)
                        It's advisable to set this parameter as '<output_dir>/06-Final_alignments'.
  -PH <1-7|a|b|all>     Choose alignments generated via paralog handling methods as input:
                        1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1 (one or more of them can be chosen)
                        a: PhyloPyPruner, b: ParaGone (Default: 1a)

Arguments for outputs:
  -output_dir <DIR>     Output directory for all pipeline results (better to be consistent across all stages). (Default: None)
  -prefix <STRING>      Prefix for output files. (Default: HybSuite)

General arguments:
  === Species tree builder control ===
  -sp_tree <1-5|all>    Species tree inference method:
                        1: IQ-TREE, 2: RAxML, 3: RAxML-NG, 4: ASTRAL-IV, 5: wASTRAL
  
  === Steps control ===
  -run_coalescent_step <INT> 
                        Control which coalescent analysis steps to run:
                        1: Construct single gene trees, 2: Combine and collapse gene trees, 3: Infer species tree, 4: Reroot gene trees, 5: PhyParts concordance analysis
                        (Default: 1234)
  -run_concatenated_step <INT> 
                        Control which concatenated analysis steps to run:
                        1: Construct concatenated alignment, 2: Infer species tree
                        (Default: 12)
  
  === Gene tree builder control ===
  -gene_tree <1/2>      Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1) 
  -gene_tree_bb <INT>   Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)
  
  === Gene trees collapse threshold ===
  -collapse_threshold <VALUE>
                        Specify the minimum support value threshold for internal nodes in gene trees. (Default: 0)
                        Nodes with support values ≀ this threshold will be collapsed into polytomies.

  === Nucleotide ambiguity character replacement ===
  -replace_n <TRUE|FALSE>
                        Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
                        Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).

  === Threads control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_amas <INT>        AMAS.py threads (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads (Default: 1)
  -nt_iqtree <INT>      IQ-TREE threads (Default: 1)
  -nt_fasttree <INT>    FastTree threads (Default: 1)
  -nt_raxml_ng <INT>    RAxML-NG threads (Default: 1)
  -nt_raxml <INT>       RAxML threads (Default: 1)
  -nt_astral4 <INT>     ASTRAL-IV threads (Default: 1)
  -nt_wastral <INT>     wASTRAL threads (Default: 1)
  -nt_astral_pro <INT>  ASTRAL-Pro3 threads (Default: 1)

  === Parallel control ===
  -process <INT|all>    Number of gene trees inference in coalescent analysis to run concurrently. (Default: 1)
                        "all" means running all samples concurrently. (be cautious to set this option)

Arguments for integrated tools :
  === IQ-TREE (cancatenated analysis)===
  -iqtree_bb <INT>      IQ-TREE bootstrap replicates (Default: 1000)
  -iqtree_alrt <INT>    SH-aLRT replicates (Default: 1000)
  -iqtree_run_option <str>      
                        IQ-TREE run mode [standard|undo] (Default: undo)
  -iqtree_partition <TRUE/FALSE>       
                        Whether to use partition models in IQ-TREE (Default: TRUE)
  -iqtree_constraint_tree <Treefile>           
                        The pathway to the constraint tree for running IQ-TREE (Default: none)

  === ModelTest-NG ===
  -run_modeltest_ng <TRUE/FALSE>       
                        Whether to run ModelTest-NG (Default: TRUE)

  === RAxML ===
  -raxml_m <str>        RAxML model [GTRGAMMA|PROTGAMMA] (Default: GTRGAMMA)
  -raxml_bb <INT>       RAxML bootstrap replicates (Default: 1000)
  -raxml_constraint_tree <Treefile>              
                        The pathway to the constraint tree for running RAxML (Default: no constraint tree)

  === RAxML-NG ===
  -rng_bs_trees <INT>   RAxML-NG bootstrap replicates (Default: 1000)
  -rng_force <TRUE/FALSE>              
                        Ignore thread warnings (Default: FALSE)
  -rng_constraint_tree <Treefile>                
                        The pathway to the constraint tree for running RAxML-NG (Default: no constraint tree)

  === ASTRAL-IV ===
  -astral4_root         Outermost (most distant) outgroup taxon name for ASTRAL-IV branch length calculation. (Default: none)
                        (Strongly recommended for accurate branch length estimation. Specify only the single outermost outgroup.)  
  -astral_r <INT>       ASTRAL-IV rounds of search. (Default: 4)
  -astral_s <INT>       ASTRAL-IV rounds of subsampling. (Default: 4)

  === wASTRAL ===
  -wastral_mode <1-4>   wASTRAL mode [1|2|3|4] (Default: 1)
                        1: hybrid weighting, 2: support only, 3: length only, 4: unweighted
  -wastral_r <INT>      wASTRAL rounds of search. (Default: 4)
  -wastral_s <INT>      wASTRAL rounds of subsampling. (Default: 4)

  === ASTRAL-Pro ===
  -astral_pro_r <INT>   ASTRAL-Pro rounds of search. (Default: 4)
  -astral_pro_s <INT>   ASTRAL-Pro rounds of subsampling. (Default: 4)

  === MAFFT (only for paralogs inclusion method -> ASTRAL-Pro) ===  
  -mafft_algorithm <str>               
                        MAFFT algorithm [auto|linsi] (Default: auto)
  -mafft_adjustdirection <TRUE/FALSE>  
                        Whether to adjust sequence directions (Default: TRUE)
  -mafft_maxiterate <INT>              
                        Maximum number of iterations for MAFFT (Default: auto)
                        Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
  -mafft_pair <str>                    
                        Pairing strategy for MAFFT (Default: auto)
                        Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
  
  === trimAl (only for paralogs inclusion method -> ASTRAL-Pro) ===
  -trimal_mode <str>                   
                        trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
  -trimal_gapthreshold <0-1>           
                        Gap threshold (Default: 0.12)
  -trimal_simthreshold <0-1>           
                        Similarity threshold (Default: auto)
  -trimal_cons <0-100>                 
                        Consensus threshold (Default: auto)
  -trimal_block <INT>                  
                        Minimum block size (Default: auto)
  -trimal_w <INT>                      
                        Window size (Default: auto)
  -trimal_gw <INT>                     
                        Gap window size (Default: auto)
  -trimal_sw <INT>                     
                        Similarity window size (Default: auto)
  -trimal_resoverlap <0-1>             
                        Minimum overlap of a positions with other positions in the column. (Default: auto) 
  -trimal_seqoverlap <0-100>           
                        Minimum percentage of sequences without gaps in a column. (Default: auto)
  
  === HMMCleaner (only for paralogs inclusion method -> ASTRAL-Pro) ===
  -hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
                        Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45) 
                        Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)

  === PhyPartsPieCharts & modified_phypartspiecharts ===
  -run_phyparts <TRUE|FALSE>
                        Enable/disable PhyParts concordance analysis and modified pie chart visualization. (Default: TRUE)
                        Note: Requires successful completion of previous coalescent analysis.
  -phypartspiecharts_tree_type <cladogram/circle>
                        The tree type of displaying when running modified_phypartspiecharts.py (Default: cladogram)
  -phypartspiecharts_num_mode <num>
                        Control what numbers to show on branches (specify 0-2 digits) (Default: 12)
                        0: Hide all numbers
                        1: Number of genes supporting species tree (blue)
                        2: Number of genes conflicting with species tree (red+green)
                        3: Number of genes with no signal (gray)
                        4: Proportion of supporting genes (blue/total)
                        5: Proportion of conflicting genes ((red+green)/total)
                        6: Proportion of no signal genes (gray/total)
                        7: Ratio of supporting to all signal genes (blue/(blue+red+green))
                        8: Ratio of conflicting to all signal genes ((red+green)/(blue+red+green))
                        9: Original node support values from the input tree

Command example :
  # Run HybSuite stage4 with IQ-TREE
  $ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 1
  
  # Run HybSuite stage4 with ASTRAL-IV
  $ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 4

  # Run HybSuite stage4 with ASTRAL-IV and PhyParts
  $ hybsuite stage4 -aln_dir ./06-Final_alignments -t ./Angiosperms353 -PH 1234567a -output_dir ./ -nt -process 5 -sp_tree 4 -run_phyparts TRUE

Parameters for running hybsuite full_pipeline

HybSuite full pipeline Manual
--------------------------------------------------------------------------------
Usage: hybsuite full_pipeline ...

Mandatory arguments: -input_list -input_data (required when including user-provided data) -t -output_dir

Essential arguments: -PH -sp_tree -seqs_min_length -aln_min_sample -prefix -nt -process

Arguments for inputs:
  -input_list <FILE>    The file listing input sample names and corresponding data types. (Default: None)
  -input_data <DIR>     The directory containing all input data (required when the inputs include your own data / pre-assembled data). (Default: None).
  -t <FILE>             Target file for data assembly. (follows the format required in HybPiper)

Arguments for outputs:
  -output_dir <DIR>     The output directory for all pipeline results. (Default: None)
  -NGS_dir <DIR>        The output directory containing raw and cleaned reads files (see GitHub documentation).
                        Notes: Pre-existing cleaned reads will skip reads trimming steps.
  -eas_dir <DIR>        The output directory containing HybPiper assembly sequences. (Default: <output_dir>/01-Assembled_data)
                        Note: Pre-existing data in this directory will skip redundant assembly steps.
  -prefix <STRING>      Prefix for output files. (Default: HybSuite)

General arguments:
  === Stages running control ===
  -skip_stage <1|2|3|12|123|>
                        Specify pipeline stages to skip during execution. (Default: None, running all stages)
                        Note: Particularly useful for re-running specific HybSuite pipeline stages.
                        (e.g., '-skip_stage 1' for skipping stage 1)
  -run_to_stage <1|2|3> Specify pipeline stages to run up to (Default: None, running all stages)
                        (e.g., '-run_to_stage 3' for stopping before stage 4)

  === Public raw reads downloading control (Stage 1) ===
  -rm_sra <TRUE/FALSE>  Whether to remove SRA files after conversion. (Default: TRUE)
  -download_format <fastq|fastq_gz>
                        Downloaded data format. (Default: fastq_gz)

  === Putative paralogs filtering control (Stage 2) ===
  -seqs_min_length <INT>         
                        Minimum sequence length for filtered paralogs. (Default: 0)
                        Putative paralogs shorter than this value will be filtered.             
  -seqs_mean_length_ratio <0-1>    
                        Minimum sequence length ratio relative to the mean value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_max_length_ratio <0-1>              
                        Minimum length ratio relative to the longest value per locus for putative paralogs. (Default: 0)
                        Putative paralogs shorter than this percentage of the maximum length will be filtered.
  -seqs_min_sample_coverage <0-1>           
                        Minimum sample coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, HRS and RLWP sequences in stage 3, loci lower than this sample coverage will be filtered.
  -seqs_min_locus_coverage <0-1>            
                        Minimum locus coverage for putative paralogs. (Default: 0)
                        For all putative paralogs in stage 2, taxa (samples) with lower than this locus coverage will be filtered.

  === Heatmap control (Stage 2&3) ===
  -heatmap_color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient. (Default: black)

  === Paralog handling control (Stage 3) ===
  -PH <1-7|a|b|all>     Paralog handling methods to execute: (one or more of them can be chosen)
                        1: HRS, 2: RLWP, 3: LS, 4: MI, 5: MO, 6: RT, 7: 1to1
                        a: PhyloPyPruner, b: ParaGone (Default: 1a)
  
  === Sequences and alignments filtering control (Stage 3) ===
  -seqs_min_length <INT>
                        Minimum sequence bp length for filtering HRS and RLWP sequences. (Default: 0)
                        HRS and RLWP sequences shorter than this value will be removed.
  -aln_min_length <INT> 
                        Minimum sequence bp length for filtering HRS and RLWP final alignments. (Default: 4)
  -aln_min_sample <INT>
                        Minimum sample number for final alignments. (Default: 5)
                        Final alignments (aligned and trimmed) with sample number below this threshold will be removed.

  === Alignments trimming tool control (Stage 3) ===
  -trim_tool <1/2>      Choose the software to trim/clean alignments. (1: trimAl; 2: HMMCleaner) (Default: 1)
  
  === Gene trees builder control (Stage 3&4) ===
  -gene_tree <1/2>      Choose the software to construct paralogs gene trees. (1: IQ-TREE; 2: FastTree) (Default: 1) 
  -gene_tree_bb <INT>   Choose the bootstrap value for paralogs gene trees inference. (Default: 1000)

  === Species tree builder control (Stage 4) ===
  -sp_tree <1-5|all>    Species tree inference method: (Default: 1)
                        1: IQ-TREE, 2: RAxML, 3: RAxML-NG, 4: ASTRAL-IV, 5: wASTRAL, 6: ASTRAL-Pro
  
  === Steps control in stage 4 ===
  -run_coalescent_step  <INT> 
                        Control which coalescent analysis steps to run:
                        1: Construct single gene trees, 2: Combine and collapse gene trees, 3: Infer species tree, 4: Reroot gene trees, 5: PhyParts concordance analysis
                        (Default: 1234)
  -run_concatenated_step <INT> 
                        Control which concatenated analysis steps to run:
                        1: Construct concatenated alignment, 2: Infer species tree
                        (Default: 12)
  
  === Nucleotide ambiguity character replacement (Stage 3&4) ===
  -replace_n <TRUE|FALSE>
                        Replace ambiguous characters ('n', 'N', '?') with gaps ('-') in alignment files. (Default: FALSE)
                        Note: Recommended for phylogenetic software compatibility (e.g., IQ-TREE, trimAl).

  === Gene trees collapse threshold ===
  -collapse_threshold <VALUE>
                        Specify the minimum support value threshold for internal nodes in gene trees. (Default: 0)
                        Nodes with support values ≀ this threshold will be collapsed into polytomies.
  
  === Threads Control ===
  -nt <INT|AUTO>        Global thread setting. (Default: 1)
  -nt_fasterq_dump <INT>               
                        fasterq-dump threads. (Default: 1)
  -nt_pigz <INT>        pigz compression threads. (Default: 1)
  -nt_trimmomatic <INT> Trimmomatic threads. (Default: 1)
  -nt_hybpiper <INT>    HybPiper threads (Default: 1)
  -nt_paragone <INT>    ParaGone threads. (Default: 1)
  -nt_phylopypruner <INT>              
                        PhyloPyPruner threads. (Default: 1)
  -nt_mafft <INT>       MAFFT threads. (Default: 1)
  -nt_amas <INT>        AMAS.py threads. (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads. (Default: 1)
  -nt_iqtree <INT>      IQ-TREE threads. (Default: 1)
  -nt_fasttree <INT>    FastTree threads. (Default: 1)
  -nt_modeltest_ng <INT>               
                        ModelTest-NG threads (Default: 1)
  -nt_raxml_ng <INT>    RAxML-NG threads (Default: 1)
  -nt_raxml <INT>       RAxML threads (Default: 1)
  -nt_astral4 <INT>     ASTRAL-IV threads (Default: 1)
  -nt_wastral <INT>     wASTRAL threads (Default: 1)
  -nt_astral_pro <INT>  ASTRAL-Pro3 threads (Default: 1)

  === Parallel Control ===
  -process <INT|all>    Number of subprocess to run concurrently. (Default: 1)
                        "all" means running all subprocesses concurrently. (be cautious to set this option)
                        The related steps are: 
                        Stage 1: public data downloading and raw reads trimming;
                        Stage 2: data assembly ('hybpiper assemble');
                        Stage 3: multiple sequences aligning, alignments trimming, and gene trees inference;
                        Stage 4: gene trees inference in coalescent analysis.

  === Logfile Control ===
  -log_mode <simple|cmd|full>
                        The output mode of hybsuite logfile. (Default: cmd)

Arguments for integrated tools :
  === SRAToolkit (Stage 1) ===
  -sra_maxsize <NUM>    The maximum size of sra files to download. (Default: 20GB)

  === Trimmomatic (Stage 1) ===
  -trimmomatic_leading_quality <3-40>
                        Leading base quality cutoff. (Default: 3)
  -trimmomatic_trailing_quality <3-40> 
                        Trailing base quality cutoff. (Default: 3)
  -trimmomatic_min_length <36-100>
                        Minimum read length. (Default: 36)
  -trimmomatic_sliding_window_s <4-10> 
                        Sliding window size. (Default: 4)
  -trimmomatic_sliding_window_q <15-30>
                        Window average quality. (Default: 15)

  === HybPiper (Stage 2 & 3) ===
  -hybpiper_mapping_tool <blast|diamond>     
                        The tool used for mapping reads to targets in HybPiper (only for protein targets) (Default: blast)
  -hybpiper_check_chimeric_contigs	<FALSE|TRUE>
                        Check whether a stitched contig is a potential chimera of contigs from multiple paralogs when running "hybpiper assemble". (Default: FALSE)
  -hybpiper_cov_cutoff <INT>
                        Specify the value of "-cov_cutoff" when running "hybpiper assemble" in Stage 2. (Default: 8)
                        Increasing this value may increase the loci recovery efficiency but potentially introducing errors.
  -hybpiper_skip_chimeric_genes <FALSE|TRUE>
                        Whether to recover sequences for putative chimeric genes when running "hybpiper retrieve_sequences" (HRS method) in Stage 3. (Default: FALSE)
  -hybpiper_retrieved_seqs_type <dna|intron|supercontig>
                        The type of sequence to extract when running "hybpiper retrieve_sequences" in Stage 3.
  
  === PhyloPyPruner (Stage 3) ===
  -pp_min_taxa <INT>    Minimum taxa per cluster. (Default: 4)
  -pp_min_support <0-1> Minimum support value. (Default: 0=auto)
  -pp_trim_lb <INT>     Trim long branches. (Default: 5)

  === ParaGone (Stage 3) ===  
  -paragone_pool <INT>  Parallel alignment tasks. (Default: 1, same as the option '-process')
  -treeshrink_q_value <0-1>        
                        TreeShrink quantile threshold (Default: 0.05)
  -paragone_cutoff_value <FLOAT>       
                        Branch length cutoff (Default: 0.3)
  -paragone_minimum_taxa <INT>         
                        Minimum taxa per alignment (Default: 4)
  -paragone_min_tips <INT>             
                        Minimum tips per tree (Default: 4)
  
  === TreeShrink (Stage 3) ===
  -treeshrink_q_value <0-1>        
                        TreeShrink quantile threshold (Default: 0.05)

  === MAFFT (Stage 3) ===  
  -mafft_algorithm <str>               
                        MAFFT algorithm [auto|linsi] (Default: auto)
  -mafft_adjustdirection <TRUE/FALSE>  
                        Whether to adjust sequence directions (Default: TRUE)
  -mafft_maxiterate <INT>              
                        Maximum number of iterations for MAFFT (Default: auto)
                        Specifies the maximum number of iterations MAFFT will perform during multiple sequence alignment. Higher iteration counts may improve alignment accuracy but will increase computation time.
  -mafft_pair <str>                    
                        Pairing strategy for MAFFT (Default: auto)
                        Specifies the pairing strategy used by MAFFT during multiple sequence alignment. Options include auto, localpair, globalpair, etc. Choosing the appropriate strategy can affect the alignment results and efficiency.
  
  === trimAl (Stage 3) ===
  -trimal_mode <str>                   
                        trimAl mode [automated1|strict|strictplus|gappyout|nogaps|noallgaps] (Default: automated1)
  -trimal_gapthreshold <0-1>           
                        Gap threshold (Default: 0.12)
  -trimal_simthreshold <0-1>           
                        Similarity threshold (Default: auto)
  -trimal_cons <0-100>                 
                        Consensus threshold (Default: auto)
  -trimal_block <INT>                  
                        Minimum block size (Default: auto)
  -trimal_w <INT>                      
                        Window size (Default: auto)
  -trimal_gw <INT>                     
                        Gap window size (Default: auto)
  -trimal_sw <INT>                     
                        Similarity window size (Default: auto)
  -trimal_resoverlap <0-1>             
                        Minimum overlap of a positions with other positions in the column. (Default: auto) 
  -trimal_seqoverlap <0-100>           
                        Minimum percentage of sequences without gaps in a column. (Default: auto)
  
  === HMMCleaner (Stage 3) ===
  -hmmcleaner_cost <NUM1_NUM2_NUM3_NUM4>
                        Cost parameters that defines the low similarity segments detected by HmmCleaner. (Default: -0.15_-0.08_0.15_0.45) 
                        Users can change each value but they have to be in increasing order. (NUM1 < NUM2 < 0 < NUM3 < NUM4)
  
  === IQ-TREE (Stage 4) ===
  -iqtree_bb <INT>      IQ-TREE bootstrap replicates (Default: 1000)
  -iqtree_alrt <INT>    SH-aLRT replicates (Default: 1000)
  -iqtree_run_option <str>      
                        IQ-TREE run mode [standard|undo] (Default: undo)
  -iqtree_partition <TRUE/FALSE>       
                        Whether to use partition models in IQ-TREE (Default: TRUE)
  -iqtree_constraint_tree <Treefile>           
                        The pathway to the constraint tree for running IQ-TREE (Default: none)

  === ModelTest-NG (Stage 4) ===
  -run_modeltest_ng <TRUE/FALSE>       
                        Whether to run ModelTest-NG (Default: TRUE)

  === RAxML (Stage 4) ===
  -raxml_m <str>        RAxML model [GTRGAMMA|PROTGAMMA] (Default: GTRGAMMA)
  -raxml_bb <INT>       RAxML bootstrap replicates (Default: 1000)
  -raxml_constraint_tree <Treefile>              
                        The pathway to the constraint tree for running RAxML (Default: no constraint tree)

  === RAxML-NG (Stage 4) ===
  -rng_bs_trees <INT>   RAxML-NG bootstrap replicates (Default: 1000)
  -rng_force <TRUE/FALSE>              
                        Ignore thread warnings (Default: FALSE)
  -rng_constraint_tree <Treefile>                
                        The pathway to the constraint tree for running RAxML-NG (Default: no constraint tree)

  === ASTRAL-IV (Stage 4) ===
  -astral4_root         Outermost (most distant) outgroup taxon name for ASTRAL-IV branch length calculation. (Default: none)
                        (Strongly recommended for accurate branch length estimation. Specify only the single outermost outgroup.)
  -astral_r <INT>       ASTRAL-IV rounds of search. (Default: 4)
  -astral_s <INT>       ASTRAL-IV rounds of subsampling. (Default: 4)

  === wASTRAL (Stage 4) ===
  -wastral_mode <1-4>   wASTRAL mode [1|2|3|4] (Default: 1)
                        1: hybrid weighting, 2: support only, 3: length only, 4: unweighted
  -wastral_r <INT>      wASTRAL rounds of search. (Default: 4)
  -wastral_s <INT>      wASTRAL rounds of subsampling. (Default: 4)

  === ASTRAL-Pro ===
  -astral_pro_r <INT>   ASTRAL-Pro rounds of search. (Default: 4)
  -astral_pro_s <INT>   ASTRAL-Pro rounds of subsampling. (Default: 4)

  === PhyPartsPieCharts & modified_phypartspiecharts (Stage 4) ===
  -run_phyparts <TRUE|FALSE>
                        Enable/disable PhyParts concordance analysis and modified pie chart visualization. (Default: TRUE)
                        Note: Requires successful completion of previous coalescent analysis.
  -phypartspiecharts_tree_type <cladogram/circle>
                        The tree type of displaying when running modified_phypartspiecharts.py (Default: cladogram)
  -phypartspiecharts_num_mode <num>
                        Control what numbers to show on branches (specify 0-2 digits) (Default: 12)
                        0: Hide all numbers
                        1: Number of genes supporting species tree (blue)
                        2: Number of genes conflicting with species tree (red+green)
                        3: Number of genes with no signal (gray)
                        4: Proportion of supporting genes (blue/total)
                        5: Proportion of conflicting genes ((red+green)/total)
                        6: Proportion of no signal genes (gray/total)
                        7: Ratio of supporting to all signal genes (blue/(blue+red+green))
                        8: Ratio of conflicting to all signal genes ((red+green)/(blue+red+green))
                        9: Original node support values from the input tree

Command example :
  === Run the full pipeline with all paralog-handling methods and all species trees inference approaches ===
  hybsuite full_pipeline \
  -input_list ./Input_list.txt \
  -input_data ./Input_data \
  -t Angiosperms353.fasta \
  -PH 1234567 \
  -sp_tree 12345 \
  -output_dir ./ \
  -nt 5 -process 5
  
  === Run the full pipeline with only tree-based orthology inference methods (MO/MI/RT/1to1) in ParaGone and ASTRAL-IV ===
  hybsuite full_pipeline \
  -input_list ./Input_list.txt \
  -input_data ./Input_data \
  -t Angiosperms353.fasta \
  -PH 4567b \
  -sp_tree 4 \
  -output_dir ./ \
  -nt 5 -process 5

5 - Tutorial

This page helps you prepare input files, configure parameters, and run HybSuite.


1. Prepare input files

(1) The sample list file

This file should document sample names along with their corresponding sequence types identifiers (seperated by \t). HybSuite pipeline supports three input sequence types:

Type1: public raw reads from NCBI SRA

To download NGS raw reads from NCBI for phylogenetic analysis, you should:

Format the sample list file as:

  • Row 1: Sample names
  • Row 2: Corresponding accession numbers

(Tab-delimited, one sample per column)

Taxon1	SRR...
Taxon2	SRR...
...

Type2: user-provided raw reads

To prepare existing samples’ NGS raw reads as input, you should:

1. Format the sample list file as:

  • Row 1: Sample names
  • Row 2: Character A

(Tab-delimited, one sample per column)

Taxon3	A
Taxon4	A
...

2. Place the raw data files (paired-end/single-end, FASTQ or FASTQ.GZ format) with corresponding names in the -input_data directory (see naming rules here ).


Type3: pre-assembled sequences

To prepare pre-assembled sequences as input, you should:

1. Format the sample list file as:

  • Row 1: Sample names
  • Row 2: Character B

(Tab-delimited, one sample per column)

Taxon5	B
Taxon6	B
...

2. Place the corresponding pre-assembled sequence files in the -input_data directory (see naming rules here).


Combine sequence types together

To include multiple sequence types showed above as pipeline inputs, just combine all entries into the sample list file:

Taxon1	SRR...
Taxon2	SRR...
Taxon3	A
Taxon4	A
Taxon5	B
Taxon6	B

Outgroup Specification

This step is not necessary if you only run stage1 and stage2, but required for stages 3-4 (orthology inference & species tree inference):

To specify the outgroup in the sample list file, just mark outgroups with chracters Outgroup in row 3 (tab-separated with row 2). Example (Outgroup = Taxon5):

Taxon1	SRR...
Taxon2	SRR...
Taxon3	A
Taxon4	A
Taxon5	B	Outgroup
Taxon6	B

(2) The directory containing sequence files

Required if:

Your sample list includes existing raw reads or pre-assembled sequences.

1. Naming rules for existing raw reads files in the input directory:

If <Taxon> is listed in the sample list file, its raw data file should be named as follows:

  • Paired-end data: <Taxon>_1.<suffix> + <Taxon>_2.<suffix>
  • Single-end data: <Taxon>.<suffix>

2. Naming rules for pre-assembled sequences in the input directory:

If <Taxon> is listed in the sample list file, its pre-assembled sequence file should be named as follows:

  • <Taxon>.fasta

(3) The target sequence file

  • The required format for the target sequences file is nearly identical to that used in HybPiper.
  • The only difference is that in HybSuite, the gene name for a target sequence must be placed immediately after the final hyphen (-) in the header line (see example showed below).
  • For example, see the Reference.fasta file in the example dataset.
  • For more details, refer to HybPiper’s documentation to edit your target file.
>Elaeagnus-pungens-4471
AATGTCATCCAGGATAAATATCGGTTGGAAGCTGCAAATACTGACTGGATGAACAAGTAC
AAAGGCTCTAGTAAGCTTCTATTGCATCCAAGGAACACTGAGGAGGTTTCACAGATACTC
...
>Hippophae-rhamnoides-4527
GAAGAGAGGGTTGTAGTATTAGTGATTGGTGGAGGAGGAAGAGAACATGCTCTTTGCTAT
GCAATGAATCGATCACCATCCTGCGATGCAGTCTTTTGTGCTCCTGGCAATGCTGGGATT
...
>Hippophae-salicifolia-4691
CAGAGACTGCCTCCATTGTCAACTGATCCCAACAGATGCGAGCGTGCATTTGTTGGAAAC
ACGATAGGTCAAGCAAATGGTGTGTACGACAAGCCAATCGATCTCCGATTCTGTGATTAC
...

2. Construct command

(1) Basic command pattern

Conda version:

hybsuite <subcommand> [options] ...

Local version:

bash <path to HybSuite.sh> <subcommand> [options] ...

(2) Available subcommands

HybSuite provides modular subcommands for flexible workflow execution:

Run individual stages:

hybsuite stage1 [options]...    # run Stage 1: NGS dataset construction
hybsuite stage2 [options]...    # run Stage 2: Data assembly and filtering
hybsuite stage3 [options]...    # run Stage 3: Paralog handling
hybsuite stage4 [options]...    # run Stage 4: Species tree inference

Run the full pipeline:

hybsuite full_pipeline [options]...    # Execute stages 1-4 in one go

Retrieve results:

After completing the full pipeline from stage 1 to 4, retrieve key output files:

hybsuite retrieve_results -i <hybsuite_comprehensive_output_dir> -o <results_dir>

This subcommand collects all final trees and summary statistics from the HybSuite comprehensive output directory.


3. Configure your parameters

This section guides you through running each stage sequentially and how to configure related parameters.

Suppose you have prepared all the required input files including the sample list file, the input sequence directory, and the target sequence file in the correct formats described above, then you can keep forward and run each stage sequentially or the full pipeline in one go.


HybSuite checking

Before running each stage or the full pipeline, HybSuite automatically checks all dependencies, configured parameters, and sample information. If any invalid parameters or incorrectly formatted input files are detected, the program notifies you and exits.

After the checks, the program prompts you to confirm whether to proceed with running HybSuite (y) or not (n).

To skip checking, specify -check as FALSE when running the pipeline.


Run hybsuite stage1

Purpose: running HybSuite Stage 1: "NGS dataset construction": download public raw reads, integrate user-provided data, and perform adapter trimming.

Step 1: Configure mandatory parameters

The following parameters must be specified when running hybsuite stage1 (failing to do so will cause HybSuite to exit during execution):

(1) Configure input file parameters:

  • -input_list <FILE>
    Specify the sample list file (as described in the section (1) The sample list file)
  • -input_data <DIR>
    Specify the directory including all user-provided raw reads and pre-assembled sequences (no need to specify this option if the sample names of user-provided raw reads and pre-assembled sequences are not included in the sample list file).

(2) Configure output file parameters:

  • -output_dir <DIR>
    Specify the directory for storing HybSuite’s comprehensive output files. It is recommended to specify this option as the same directory for convenience in later stages to make all outputs be in the same output folder.
Step 2: Configure essential parameters

The following parameters are essential for running this stage, you’d better check whether to configure depending on your analysis.

  • -nt <NUM>
    Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).
  • -process <NUM>
    Specify the number of parallel sample processing (Default: 1). This involves the steps “raw read downloading” and “adapter trimming” in Stage 1.
  • -sra_maxsize <?GB>
    Specify the maximum SRA file size to download (Default: 20GB). If the size of targeted raw read files to download from NCBI is larger than this value, downloading process will be skipped for corresponding samples.
  • -NGS_dir <DIR>
    Specify the output directory with raw and clean reads files (Default: <output_dir>/NGS_dataset (<output_dir> can be specified by -output_dir))
Step 3: Configure other parameters

See the full parameters for running hybsuite stage1 here for more customizable settings.

Example command:

# basic common pattern:
hybsuite stage1 -input_list <FILE> -input_data <DIR> -output_dir <DIR> -nt <NUM>or<AUTO> -process <NUM>

# command for our example datset (Angiosperms353) (8 threads; 5 in parallel)
cd <Path_to_"HybSuite-master/example_datasets/Angiosperms353/">
hybsuite stage1 -input_list Input_list.txt ./-input_data ./Input_data -output_dir ./Output -nt 8 -process 5
Step 4: Check output files

After running stage 1, you can check the output files for your analysis.
See the output files for hybsuite stage1 here for more details.


Run hybsuite stage2

Purpose: run HybSuite Stage 2: "Data assembly and paralog retrieval": assemble reads using HybPiper, retrieve paralog sequences, and filter paralog sequences by length, sample or locus coverage.

Step 1: Configure mandatory parameters

(1) Input file parameters

  • -input_list <FILE>
    Sample list file (same as stage 1) here to check.
  • -NGS_dir <DIR>
    NGS dataset outputted in stage 1.
    (default: <output_dir>/NGS_dataset)
  • -input_data <DIR>
    The same parameter as specified in Stage 1.
    This option is required only when pre-assembled sequences are provided as input.
  • -t <FILE>
    Target sequence file in HybPiper format.

For example, if you have existing clean data (pair-ended) of <taxon1> and clean data (single-ended) of <taxon2>, you can place them with corresponding file names showed below to let HybSuite skip processing these two samples:

<NGS_dir>/
β”œβ”€β”€01-Downloaded_raw_data
β”œβ”€β”€02-Downloaded_clean_data
└──03-My_clean_data
    β”œβ”€β”€<taxon1>_1_clean.paired.fq.gz
    β”œβ”€β”€<taxon1>_2_clean.paired.fq.gz
    └──<taxon2>_clean.paired.fq.gz

(2) Output file parameters

  • -output_dir <DIR>
    Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.

  • -eas_dir <DIR>
    Specify the output directory for storing assembled results (one sample per subdirectory) generated by hybpiper assemble (see HybPiper manual).
    (default: <output_dir>/NGS_dataset)

For example, if you have directories with assembled sequences outputted by hybpiper assemble for <taxon1> and <taxon2>:

  • Create a directory and place these two directories named <taxon1> and <taxon2> in it, and specify -eas_dir as the path to this new directory (showed below, <eas_dir> is the directory specified by -eas_dir).
<eas_dir>/
β”œβ”€β”€<taxon1>
└──<taxon2>
  • Then, include the name of <taxon1> and <taxon2> in the sample list file (specified by -input_list).
Step 2: Configure essential parameters

The following parameters are optional but essential for running this stage, check whether to configure depending on your analysis.

(1) Thread and parallel:

  • -nt <NUM|AUTO>
    Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).
  • process <NUM>
    Specify the number of parallel sample processing (Default: 1). This involves the steps “data assembly” in stage 2.
  • -eas_dir <DIR>
    Specify existing assembled data to skip redundant assembly.

(2) Paralog sequence filtering:

  • -seqs_min_sample_coverage <NUM:0-1>
    Specify the minimum sample coverage of recovered loci. Loci with sample coverage below this threshold are removed.
    (default: 0, recommended value: 0.1)
  • -seqs_min_locus_coverage <NUM:0-1>
    Specify the minimum locus coverage of samples. Samples with locus coverage below this threshold are removed.
    (default: 0)
  • -seqs_min_length <NUM>
    Specify the minimum sequence length for filtering paralog sequences. Sequences shorter than this threshold are removed.
    (default: 0, recommended value: 100)
  • -seqs_mean_length_ratio <NUM:0-1>
    Specify the minimum length ratio relative to the mean length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)
  • -seqs_max_length_ratio <NUM:0-1>
    Specify the minimum length ratio relative to the maximum length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)
  • -seqs_min_length_ratio <NUM:0-1>
    Specify the minimum length ratio relative to the minimum length of target sequences. Sequences with a length ratio below this threshold are removed. (default: 0)

(3) Parameters related to hybpiper assemble:

  • -hybpiper_mapping_tool <blast|diamond>
    Specify the read mapping tool used in data assembly via hybpiper assemble.
    (default: blast)
  • -hybpiper_mapping_tool <TRUE|FALSE>
    Specify whether to check chimeric contigs.
    (default: TRUE)
  • -hybpiper_cov_cutoff <INT>
    Coverage cutoff for SPAdes when running “hybpiper assemble” in Stage 2.
    Increasing this value may increase the loci recovery efficiency but potentially introducing errors.
    (Default: 8)
Step 3: Configure other parameters

See the full parameters for running hybsuite stage2 here for more customizable settings.

Example command

# Recommended command mode
hybsuite stage2 \
  -input_list <FILE> \
  -NGS_dir <DIR> \
  -t <FILE> \
  -output_dir <DIR> \
  -seqs_min_length <NUM> \
  -seqs_min_sample_coverage <NUM> \
  -nt <NUM> -process <NUM>

# Command for the example dataset (Angiosperms353)
hybsuite stage2 \
  -input_list ./Input_list.txt \
  -NGS_dir ./NGS_dataset \
  -t ./Target_file_Angiosperm353.fasta \
  -output_dir ./Output \
  -seqs_min_length 100 \
  -seqs_min_sample_coverage 0.1 \
  -nt 8 -process 5
Step 4: Check output files

After running stage 2, you can check the output files for your analysis.
See the output files for hybsuite stage2 here for more details.


Run hybsuite stage3

Purpose: run HybSuite Stage 3: "Paralog handling": optionally apply seven paralog-handling methods to infer orthology groups and generate final alignments for stage 4: species tree inference.

Step 1: Configure mandatory parameters:

(1) Input file parameters

  • -input_list <FILE>
    Sample list file (same as stage 1) here to check.
  • -eas_dir <DIR>
    Directory containing assembled sequences (one sample per subdirectory) generated by hybpiper assemble (see HybPiper manual) in Stage 2 or provided by users. (default: <output_dir>/NGS_dataset)
  • -paralogs_dir <DIR>
    Directory containing all paralog sequences generated in Stage 2 or provided by users. If Stage 2 has been executed, set this parameter to <output_dir>/02-All_paralogs/03-Filtered_paralogs, where <output_dir> refers to the comprehensive output directory specified by -output_dir.
    (default: none)
  • -input_data <DIR>
    The same parameter as specified in Stage 1.
    This option is required only when pre-assembled sequences are provided as input and the “HRS” or “RLWP” methods are selected.
  • -t <FILE>
    Target sequence file in HybPiper format.

(2) Output file parameters

  • -output_dir <DIR>
    Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.
  • -prefix <STRING>
    Output file prefix. (default: HybSuite)
Step 2: Configure essential parameters:

The following parameters are optional but essential for running this stage, check whether to configure them based on your analysis.

(1) Thread and parallel:

  • -nt <NUM|AUTO>
    Number of threads to use for HybSuite. All integrated tools will use this value (default: 1).
  • -process <NUM>
    Specify the number of parallel sample processing (Default: 1). This involves the steps “data assembly” in stage 2.

(2) Paralog-handling methods applied in this stage:

  • -PH <1-7|a|b|all>
    Paralog handling methods (1=HRS, 2=RLWP, 3=LS, 4=MI, 5=MO, 6=RT, 7=1to1, a=PhyloPyPruner, b=ParaGone)
    (One or several methods can be specified; default: 1a)

(3) Sequence filtering:

  • -seqs_min_length <INT>
    Minimum HRS/RLWP sequence length (Default: 0)
    Only HRS and RLWP methods filter sequences in this satge, other paralog-handling methods filter sequences in stage 2.
  • -aln_min_length <INT>
    Minimum alignment length (Default: 4)
  • -aln_min_sample <INT>
    Minimum sample number per alignment (Default: 0)

(4) Gene tree construction:

  • -gene_tree <1/2>
    Gene tree builder: 1=IQ-TREE, 2=FastTree (Default: 1)
  • -gene_tree_bb <INT>
    Bootstrap value (Default: 1000)
  • -trim_tool <1/2>
    Trimming tool: 1=trimAl, 2=HMMCleaner (Default: 1)
Step 3: Configure other parameters

See the full parameters for running hybsuite stage3 here for more customizable settings.

Example command:

# Recommended command
hybsuite stage3 \
  -input_list <FILE> \
  -eas_dir ./01-Assembled_data \
  -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs \
  -t <FILE> \
  -output_dir <DIR> \
  -PH 1234567a \  # this option can be determined by your data size, time schedule, but still suggested to try all methods, including "4567b" with ParaGone
  -nt 8 -process 5

# Command for the example dataset (Angiosperms353)
hybsuite stage3 \
  -input_list ./Input_list.txt \
  -eas_dir ./01-Assembled_data \
  -paralogs_dir ./02-All_paralogs/03-Filtered_paralogs \
  -t Target_file_Angiosperm353.fasta \
  -output_dir ./Output \
  -PH 1234567a \
  -mafft_algorithm linsi \
  -nt 8 -process 5
Step 4: Check output files

After running stage 3, you can check the output files for your analysis.
See the output files for hybsuite stage3 here for more details.


Run hybsuite stage4

Purpose: run HybSuite Stage4: "Species tree inference", infer species trees using concatenation-based and/or coalescent-based approaches.

Step 1: Configure mandatory parameters:

(1) Configure input file parameters

  • -input_list <FILE>
    Sample list file (same as stage 1-3) here to check.
  • -aln_dir <DIR>
    The path to directory 06-Final_alignments with final alignments generated in stage 3.

(2) Configure output file parameters

  • -output_dir <DIR>
    Directory to store output files. Using the same directory across stages is recommended for convenience, though different directories are allowed.
Step 2: Configure essential parameters:

The following parameters are optional but essential for running this stage, check whether to configure them based on your analysis.

  • -PH <1-7|a|b|all>
    Select alignments from one or more specific paralog-handling methods. Make it consistent with the one used in stage 3. Default: 1.
  • -sp_tree <1-6|all>
    Species tree inference methods. Default: 1.
    1=IQ-TREE, 2=RAxML, 3=RAxML-NG, 4=ASTRAL-IV, 5=wASTRAL, 6=ASTRAL-Pro
  • -nt <INT|AUTO>
    Thread setting.

Gene tree construction (for coalescent methods):

  • -gene_tree <1/2>
    Gene tree builder. Default: 1.
  • -gene_tree_bb <INT>
    Bootstrap value. Default: 1000.
  • -collapse_threshold <VALUE>
    Collapse weakly supported branches. Default: 0.

Concatenation-based methods:

  • -run_modeltest_ng <TRUE/FALSE>
    Run ModelTest-NG. Default: TRUE.
  • -iqtree_bb <INT>
    IQ-TREE bootstrap replicates. Default: 1000.
  • -iqtree_partition <TRUE/FALSE>
    Use partition models. Default: TRUE.

Coalescent-based methods:

  • -astral_r <INT>
    ASTRAL-IV search rounds. Default: 4.
  • -wastral_mode <1-4>
    wASTRAL weighting mode. Default: 1.
  • -run_phyparts <TRUE/FALSE>
    Run PhyParts analysis. Default: TRUE.

Example command:

# Concatenation-based method (IQ-TREE)
hybsuite stage4 \
  -input_list <FILE> \
  -aln_dir <output_dir>/06-Final_alignments \
  -output_dir <DIR> \
  -PH 1234567a \
  -sp_tree 1 \
  -nt 8

# Coalescent-based method (ASTRAL-IV)
hybsuite stage4 \
  -input_list sample_list.txt \
  -aln_dir <output_dir>/06-Final_alignments\
  -output_dir <DIR> \
  -PH 1a \
  -sp_tree 4 \
  -run_phyparts TRUE \
  -nt 8 -process 5

Run hybsuite full_pipeline

Purpose: run stages 1-4 sequentially in a single command.

Step 1: Configure mandatory parameters:
  • -input_list <FILE>
    Sample list file.
  • -t <FILE>
    Target file.
  • -output_dir <DIR>
    Output directory.

-input_data is required when including user-provided data.

Step 2: Configure essential parameters:

Methods control:

  • -PH <1-7|a|b|all>
    Paralog handling methods. Default: 1a.
  • -sp_tree <1-6|all>
    Species tree inference methods. Default: 1.

Threading:

  • -nt <INT|AUTO>
    Global thread setting. Default: 1.
  • -process <INT|all>
    Parallel processing. Default: 1.

Workflow control:

  • -skip_stage <1|12|123>
    Skip completed stages. Default: none.
  • -run_to_stage <1|2|3|4>
    Stop at specific stage. Default: 4.

Logging control:

  • -log_mode <simple|cmd|full>
    Logging verbosity. Default: cmd.
    • simple: Only log key information
    • cmd: Log key information + command history
    • full: Log detailed information + command history

Example command:

# Complete pipeline with all paralog-handling methods
hybsuite full_pipeline \
  -input_list sample_list.txt \
  -input_data Input_data \
  -t Angiosperms353.fasta \
  -output_dir ./ \
  -PH 1234567ab \
  -sp_tree 12345 \
  -seqs_min_length 100 \
  -aln_min_sample 4 \
  -nt AUTO -process 10

# Pipeline with existing NGS data, skip stage 1
hybsuite full_pipeline \
  -input_list sample_list.txt \
  -NGS_dir ./NGS_dataset \
  -t Angiosperms353.fasta \
  -output_dir ./ \
  -skip_stage 1 \
  -PH 1a \
  -sp_tree 14 \
  -nt 8 -process 5
Step 3: Configure other parameters

See the full parameters for run hybsuite full_pipeline here for more customizable settings.


4. Tips for rerunning the pipeline

If the initial results of HybSuite are not satisfactory, you can rerun the pipeline with modified parameters. The following strategies can help improve results or reduce runtime. These methods can be used individually or combined.

(1) Remove or add samples

You can remove or add samples by editing the sample_list.txt file and rerunning HybSuite.

If the same output_dir is used, HybSuite will automatically detect completed samples and skip the following steps:

  • Public data downloading (stage 1)
  • Raw reads trimming (stage 1)
  • Data assembly (stage 2)

This allows you to update the dataset without repeating previously completed computations.


(2) Reuse existing intermediate data

HybSuite allows reuse of intermediate results from previous runs.

Use the following options:

  • -NGS_dir
    Use the NGS_dataset directory generated by a previous run to skip data downloading and adapter trimming.

  • -eas_dir
    Use the 02-All_paralogs directory generated by a previous run to skip the data assembly step.

Reusing intermediate data can significantly reduce runtime when rerunning analyses.


(3) Adjust sequence filtering thresholds

You can improve dataset quality by adjusting filtering thresholds when rerunning the pipeline.

Common parameters include:

  • -seqs_min_length
    Minimum sequence length retained in stage 2.
  • -seqs_min_sample_coverage
    Minimum proportion of samples containing a sequence.
  • -aln_min_sample
    Minimum number of samples required for an alignment in stage 3.

Increasing these thresholds can improve alignment quality and downstream analyses.

(4) Steps control in concatenation-based and coalescent-based analysis

HybSuite allows selective execution of steps in both concatenation-based and coalescent-based analyses.

Concatenation-based analysis

Use -run_concatenated_step to specify which steps to run.

For example, if a concatenated alignment has already been generated in a previous run, you can skip the concatenation step and directly infer the species tree by setting:

-run_concatenated_step 2

Coalescent-based analysis

Use -run_coalescent_step to control the execution of coalescent-based analysis.

For example, if gene trees are already available from a previous run, you can skip earlier step 1 and directly infer the species tree by setting:

-run_coalescent_step 234

6 - Installation

This page tells you how to install HybSuite step by step.


1. Install HybSuite via conda

(1) Prerequisites

  • Conda installation is required. If you don’t already have conda installed, see here for instructions on installing Anaconda or Miniconda.

(2) Step-by-step installation

To avoid dependency conflicts, creating a new conda environment for HybSuite is more recommended:

conda create -n hybsuite

And then, you can activate your newly-created conda environment and install hybsuite using the command below directly (from the specified channel):

conda activate hybsuite
conda install yuxuanliu::hybsuite

Before installing hybsuite, you can edit your ~/.condarc file as this to avoid installation issues about conda channels:

channels:
  - conda-forge
  - bioconda
  - yuxuanliu
  - defaults

(3) Verification

After installation, you can check the help menu of HybSuite to confirm successful installation by running:

hybsuite -h

2. Install HybSuite manually

(1) Prerequisites

  • Conda installation is required. If you don’t already have conda installed, see here for instructions on installing Anaconda or Miniconda.

(2) Package installtion

Directly clone the github repository:

git clone https://github.com/Yuxuanliu-HZAU/HybSuite.git

(3) Verification

After installation, you can check the help menu of HybSuite to confirm successful installation by running:

bash <absolute or relative path to HybSuite.sh> -h

(4) Dependencies installtion

The most convenient way to install all dependencies for HybSuite is directly running our script: HybSuite-master/Install_all_dependencies.sh to install all desired dependencies.
Before running this script, activate your target conda environment.

conda activate <conda_environment_name>
bash HybSuite-master/Install_all_dependencies.sh

Method 2: Install dependencies manually

If you fail to install some dependencies when running Install_all_dependencies.sh, then it is more advisable to install dependencies manually. Just follow the following steps:

conda create -n <conda_environment_name>
conda activate <conda_environment_name>
conda install conda-forge::mamba -y
mamba install python=3.9.15 -y
mamba install bioconda::hybpiper -y
mamba install bioconda::paragone -y
mamba install bioconda::amas -y
mamba install bioconda::sra-tools -y
mamba install conda-forge::pigz -y
conda install conda-forge::plotly -y
mamba install bioconda::newick_utils -y
mamba install bioconda::mafft -y
mamba install bioconda::trimal -y
mamba install bioconda::iqtree -y
mamba install bioconda::raxml -y
mamba install bioconda::raxml-ng -y
mamba install bioconda::aster -y
mamba install r
pip install ete3
pip install PyQt5
pip install phylopypruner
pip install phykit
R
install.packages("phytools")
install.packages("ape")

3. Dependencies

7 - Output files


Output File Naming Conventions

PlaceholderRepresents
<PH>Any of the 7 orthology inference methods: HRS, RLWP, LS, MI, MO, RT, 1to1
<taxon>Taxon name (e.g., <taxon1>, <taxon2>, etc.) from your sample list file
<prefix>User-specified output prefix (via -prefix option)
<locus_name>Target sequence locus (e.g., <locus_name1>, <locus_name2>, etc.)

The specific explanation of every output file is illustrated as follows.


Stage1 output

<NGS_dataset> (specified by -d)

A directory containing next-generation raw sequencing data downloaded from public databases, existing raw reads provided by the user, and clean data produced by Trimmomatic-0.39.

<NGS_dataset>/
β”œβ”€β”€ 01-Downloaded_raw_data/
β”œβ”€β”€ 02-Downloaded_clean_data/
└── 03-My_clean_data/

<NGS_dataset> -> 01-Downloaded_raw_data

A directory containing next-generation raw sequencing data downloaded from public databases.

<NGS_dataset>/
└── 01-Downloaded_raw_data/
    β”œβ”€β”€ 01-Raw-reads_sra/
    └── 02-Raw-reads_fastq_gz/

<NGS_dataset> -> 01-Downloaded_raw_data -> 01-Raw-reads_sra

A directory containing raw sequencing data downloaded from NCBI in .sra format.

<NGS_dataset>/
└── 01-Downloaded_raw_data/
    └── 01-Raw-reads_sra/
        β”œβ”€β”€ <taxon>.sra
        ...
  • <taxon>.sra: File with raw sequencing data in SRA format.

By default, all *.sra files in this directory will be removed after converting them into fastq format to save space, unless you specify the option -rm_sra as FALSE to keep them.

<NGS_dataset> -> 01-Downloaded_raw_data -> 02-Raw-reads_fastq_gz

A directory containing raw sequencing data in .fastq or .fastq.gz format.

<NGS_dataset>/
└── 01-Downloaded_raw_data/
    └── 02-Raw-reads_fastq_gz/
        β”œβ”€β”€ <taxon>.fastq.gz or <taxon>.fastq
        ...
  • <taxon>.fastq.gz or <taxon>.fastq:

If the user specifies the option -download_format as fastq, pigz will not be used to compress the original .fastq files to .fastq.gz files, which will produce <taxon>.fastq in this folder.
If the user specifies the option -download_format as fastq.gz, pigz will be used to compress the original .fastq files to .fastq.gz files, which will produce <taxon>.fastq.gz in this folder.
Default: -download_format is specified as fastq.gz.

<NGS_dataset> -> 02-Downloaded_clean_data

A directory containing sequencing data cleaned from downloaded public raw reads.

<NGS_dataset>/
└── 02-Downloaded_clean_data/
    β”œβ”€β”€ <taxon>_1_clean.paired.fq.gz
    β”œβ”€β”€ <taxon>_2_clean.paired.fq.gz
    β”œβ”€β”€ <taxon>_1_clean.unpaired.fq.gz
    β”œβ”€β”€ <taxon>_2_clean.unpaired.fq.gz
    β”œβ”€β”€ <taxon>_clean.single.fq.gz
    β”œβ”€β”€ <taxon>_clean.single.fq.gz
    ...    
  • <taxon>_1_clean.paired.fq.gz & <taxon>_2_clean.paired.fq.gz
    Files with compressed cleaned and paired sequencing data (paired-end type) in fq.gz format (these files will be used for downstream analysis).
  • <taxon>_1_clean.unpaired.fq.gz & <taxon>_2_clean.unpaired.fq.gz
    Files with compressed cleaned and unpaired sequencing data (paired-end type) in fq.gz format.
  • <taxon>_clean.single.fq.gz
    File with compressed cleaned sequencing data (single-end type) in fq.gz format (these files will be used for downstream analysis).

<NGS_dataset> -> 03-My_clean_data

A directory containing user-provided cleaned sequencing data or sequencing data cleaned from user-provided raw data.

<NGS_dataset>/
└── 03-My_clean_data/
    β”œβ”€β”€ <taxon>_1_clean.paired.fq.gz
    β”œβ”€β”€ <taxon>_2_clean.paired.fq.gz
    β”œβ”€β”€ <taxon>_1_clean.unpaired.fq.gz
    β”œβ”€β”€ <taxon>_2_clean.unpaired.fq.gz
    β”œβ”€β”€ <taxon>_clean.single.fq.gz
    ...   
  • <taxon>_1_clean.paired.fq.gz & <taxon>_2_clean.paired.fq.gz
    Files with user-provided compressed cleaned and paired sequencing data (paired-end type) in fq.gz format (these files will be used for downstream analysis).
  • <taxon>_1_clean.unpaired.fq.gz & <taxon>_2_clean.unpaired.fq.gz
    Files with user-provided compressed cleaned and unpaired sequencing data (paired-end type) in fq.gz format.
  • <taxon>_clean.single.fq.gz
    File with user-provided compressed cleaned sequencing data (single-end type) in fq.gz format (these files will be used for downstream analysis).

Stage2 output

01-Assembled_data

A directory containing assembled sequence data produced by hybpiper assemble command in HybPiper.

01-Assembled_data/
β”œβ”€β”€ Assembled_data_namelist.txt    
β”œβ”€β”€ Old_assembled_data_namelist_<current_time>.log
β”œβ”€β”€ <taxon>/
    ...
  • Assembled_data_namelist.txt A file containing sample names used as input to run the hybpiper assemble command.
  • Old_assembled_data_namelist_<current_time>.log A file containing previous sample names used as input to run the hybpiper assemble command.
  • <taxon> More details can be found here.

02-All_paralogs

A directory containing all original putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper, filtered paralogs, along with their paralog heatmaps and related statistical results.

02-All_paralogs/
β”œβ”€β”€ 01-Original_paralogs
β”œβ”€β”€ 02-Original_paralog_reports_and_heatmap
β”œβ”€β”€ 03-Filtered_paralogs
└── 04-Filtered_paralog_reports_and_heatmap

02-All_paralogs -> 01-Original_paralogs

A directory containing all original putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper.

02-All_paralogs/
└── 01-Original_paralogs/
    └── <locus_name>_paralogs_all.fasta
  • <locus_name>_paralogs_all.fasta: FASTA files for each sample/locus, containing all putative paralogs recovered by the HybPiper hybpiper paralog_retriever command.

02-All_paralogs -> 02-Original_paralog_reports_and_heatmap

A directory containing all original reports and heatmaps.

02-All_paralogs/
└── 02-Original_paralog_reports_and_heatmap/
    β”œβ”€β”€ Original_paralog_heatmap.png
    β”œβ”€β”€ Original_paralog_report.tsv
    β”œβ”€β”€Original_recovered_seqs_length.tsv
    └── Original_recovery_heatmap.html
  • Original_paralog_heatmap.png
    A heatmap image file in PNG format, depicting the number of original putative paralog sequences for each locus/sample.
  • Original_paralog_report.tsv
    A TSV file recording the number of original putative paralog sequences for each locus/sample.
  • Original_recovered_seqs_length.tsv
    A TSV file recording the length of original recovered sequences for each locus/sample.
  • Original_recovery_heatmap.html
    An interactive HTML file for visualizing target locus recovery across all original paralogs (including both single-copy and multi-copy genes).

Here is a recovery heatmap example file you can play with: it shows a recovery result of the Angiosperms353 (Johnson et al., 2019) loci from 10 Elaeagnaceae species in our example dataset.

  • The blue bars along with x- and y-axes indicate how many loci are recovered in each sample and how many samples each locus are recovered in, respectively.
  • The color intensity of each cell indicates the proportion of gene length recovered for a given sample (y-axis) at a specific target locus (x-axis). When multiple sequences are recovered for a locus within a sample (putative paralogs), only the longest sequence is retained for visualization in the heatmap.
Loading chart...

Now, let’s play with this interactive html file for fun and better effect!

  • Choose the button “Sort by” as “Descending” to sort samples and loci on the heatmap from high to low recovery.
  • Click on the “Plus” (+) and “Minus” (-) icons in the upper right corner to zoom in and out of the heatmap.
  • Click on the “AutoScale” icon in the upper right corner to auto-scale the heatmap.
  • Click the “Camera” (πŸ“·) icon in the upper right corner to download the current heatmap view as a PNG file.

02-All_paralogs -> 03-Filtered_paralogs

A directory containing all filtered putative paralogs retrieved by the hybpiper paralog_retriever command in HybPiper.

02-All_paralogs/
└── 03-Filtered_paralogs/
    └── <locus_name>_paralogs_all.fasta

02-All_paralogs -> 04-Filtered_paralog_reports_and_heatmap

A directory containing all filtered reports and heatmaps.

02-All_paralogs/
└── 04-Filtered_paralog_reports_and_heatmap/
    β”œβ”€β”€ Filtered_paralog_heatmap.png
    └── Filtered_paralog_report.tsv
  • Filtered_paralog_heatmap.png
    A heatmap image file in PNG format, depicting the number of filtered putative paralog sequences for each locus/sample.
  • Filtered_paralog_report.tsv
    A TSV file recording the number of filtered putative paralog sequences for each locus/sample.
  • Filtered_recovered_seqs_length.tsv
    A TSV file recording the length of original recovered sequences for each locus/sample.
  • Filtered_recovery_heatmap.html
    An interactive HTML file for visualizing target locus recovery across all filtered paralogs (including both single-copy and multi-copy genes).
    The layout is identical to that of Original_recovery_heatmap.png, but it reflects the occupancy of filtered sequences rather than the original ones.

Stage3 output

03-Paralog_handling

03-Paralog_handling/
β”œβ”€β”€ HRS/ (optional)
β”œβ”€β”€ RLWP/ (optional)
β”œβ”€β”€ ParaGone/ (optional)
└── PhyloPyPruner/ (optional)
  • Different arguments for the option -PH will lead to different subdirectories in this output folder:

    • HRS/: created when -PH includes number 1 (the user applies the HRS orthology inference method)
    • RLWP/: created when -PH includes number 2 (the user applies the RLWP orthology inference method)
    • PhyloPyPruner/: created when -PH includes one or several numbers of 4, 5, 6, and 7 (using MI/MO/RT/1to1 orthology inference methods) and “a” (default), or includes number “3” (directly running PhyloPyPruner to carry out LS method). More details about the interpretation of these orthology inference methods can be found here
    • ParaGone/: created when -PH includes one or several numbers of 4, 5, 6, and 7 (the user applies MI/MO/RT/1to1 orthology inference methods) and “b” (running ParaGone rather than PhyloPyPruner). More details about the interpretation of these orthology inference methods can be found here
  • For example:

    • Using -PH 12 will create HRS and RLWP directories.
    • Using -PH 1234b will create HRS, RLWP, and ParaGone directories.

03-Paralog_handling -> HRS

A directory containing original and filtered HRS sequences, including the recovery heatmap and filtering reports.

03-Paralog_handling/
└── HRS/
    β”œβ”€β”€ 01-Original_HRS_sequences
        └── <locus_name>.FNA
    β”œβ”€β”€ 02-Original_HRS_sequences_reports_and_heatmap
        β”œβ”€β”€ Original_HRS_heatmap.png
        └── Original_HRS_seq_lengths.tsv
    β”œβ”€β”€ 03-Filtered_HRS_sequences
        └── <locus_name>.FNA
    └── 04-Filtered_HRS_sequences_reports_and_heatmap
        β”œβ”€β”€ Filtered_HRS_heatmap.png
        β”œβ”€β”€ Filtered_HRS_seq_lengths.tsv
        β”œβ”€β”€ Removed_HRS_seqs_with_low_length_info.tsv
        β”œβ”€β”€ Removed_samples_with_low_locus_coverage_info.tsv
        └── Removed_loci_with_low_sample_coverage_info.tsv
01-Original_HRS_sequences
  • <locus_name>.FNA
    Files with retrieved sequences in FASTA format, produced by hybpiper retrieve_sequences (referred to as HRS sequences in the following).

Notes:

  • In the HybSuite pipeline, supercontigs are automatically retrieved, including introns and exons, for downstream analysis. HybSuite doesn’t support retrieving only introns or exons.
  • Since the downstream analysis requires DNA sequences, only DNA sequences can be retrieved; protein sequences are not supported for the next stage.
02-Original_HRS_sequences_reports_and_heatmap
  • Original_HRS_heatmap.png
    A heatmap image file in PNG format, depicting the length of the original HRS sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
  • Original_HRS_seq_lengths.tsv
    A TSV file recording all original HRS sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
03-Filtered_HRS_sequences
04-Filtered_HRS_sequences_reports_and_heatmap
  • Filtered_HRS_heatmap.png
    A heatmap image file in PNG format, depicting the length of the filtered HRS sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
  • Filtered_HRS_seq_lengths.tsv
    A TSV file recording all filtered HRS sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
  • Removed_HRS_seqs_with_low_length_info.tsv
    A TSV file recording the information of the HRS sequences with low bp length/length ratio that have been filtered out from the dataset.
  • Removed_samples_with_low_locus_coverage_info.tsv
    A TSV file recording the information of the samples with low locus coverage that have been filtered out from the dataset.
  • Removed_loci_with_low_sample_coverage_info.tsv
    A TSV file recording the information of the loci with low sample coverage that have been filtered out from the dataset.

03-Paralog_handling -> RLWP

A directory containing original and filtered RLWP sequences, including the recovery heatmap and filtering reports.

03-Paralog_handling/
└── RLWP/
    β”œβ”€β”€ 01-Original_RLWP_sequences
        └── <locus_name>.FNA
    β”œβ”€β”€ 02-Original_RLWP_sequences_reports_and_heatmap
        β”œβ”€β”€ Original_RLWP_heatmap.png
        └── Original_RLWP_seq_lengths.tsv
    β”œβ”€β”€ 03-Filtered_RLWP_sequences
        └── <locus_name>.FNA
    └── 04-Filtered_RLWP_sequences_reports_and_heatmap
        β”œβ”€β”€ Filtered_RLWP_heatmap.png
        β”œβ”€β”€ Filtered_RLWP_seq_lengths.tsv
        β”œβ”€β”€ Removed_RLWP_seqs_with_low_length_info.tsv
        β”œβ”€β”€ Removed_samples_with_low_locus_coverage_info.tsv
        └── Removed_loci_with_low_sample_coverage_info.tsv
01-Original_RLWP_sequences
  • <locus_name>.FNA
    Files with retrieved sequences in FASTA format, produced by hybpiper retrieve_sequences (referred to as RLWP sequences in the following).

Notes:

  • In the HybSuite pipeline, supercontigs are automatically retrieved, including introns and exons, for downstream analysis. HybSuite doesn’t support retrieving only introns or exons.
  • Since the downstream analysis requires DNA sequences, only DNA sequences can be retrieved; protein sequences are not supported for the next stage.
02-Original_RLWP_sequences_reports_and_heatmap
  • Original_RLWP_heatmap.png
    A heatmap image file in PNG format, depicting the length of the original RLWP sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
  • Original_RLWP_seq_lengths.tsv
    A TSV file recording all original RLWP sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
03-Filtered_RLWP_sequences
04-Filtered_RLWP_sequences_reports_and_heatmap
  • Filtered_RLWP_heatmap.png
    A heatmap image file in PNG format, depicting the length of the filtered RLWP sequences for each gene and sample, relative to the mean length (default setting; users can customize by running plot_recovery_heatmap.py) of the sequences in the target file, produced by plot_recovery_heatmap.py in HybSuite.
  • Filtered_RLWP_seq_lengths.tsv
    A TSV file recording all filtered RLWP sequences’ bp length, length ratio relative to the maximum, and mean length of each locus’ sequences in the target file.
  • Removed_RLWP_seqs_with_low_length_info.tsv
    A TSV file recording the information of the RLWP sequences with low bp length/length ratio that have been filtered out from the dataset.
  • Removed_samples_with_low_locus_coverage_info.tsv
    A TSV file recording the information of the samples with low locus coverage that have been filtered out from the dataset.
  • Removed_loci_with_low_sample_coverage_info.tsv
    A TSV file recording the information of the loci with low sample coverage that have been filtered out from the dataset.

03-Paralog_handling -> ParaGone

03-Paralog_handling/
└── ParaGone/
    β”œβ”€β”€ 00_logs_and_reports
    ...
    β”œβ”€β”€ 28_RT_final_alignments_trimmed
    └── HybSuite_1to1_final_alignments
  • From 00_logs_and_reports to 28_RT_final_alignments_trimmed: More details about these output folders can be found on this wiki page of ParaGone.
  • If the user specifies the -paragone_keep_files option in HybSuite as TRUE, the intermediate folders from 01_input_paralog_fasta to 22_RT_stripped_names will be kept. If the user specifies the -paragone_keep_files option as FALSE, intermediate folders will be removed.
  • HybSuite_1to1_final_alignments: A directory containing orthology group alignments produced via the 1to1 algorithm, which were retrieved from results produced by ParaGone.

03-Paralog_handling -> PhyloPyPruner

03-Paralog_handling/
└── PhyloPyPruner/
    β”œβ”€β”€ Input
    β”œβ”€β”€ Output_LS
    β”œβ”€β”€ Output_MI
    β”œβ”€β”€ Output_MO
    β”œβ”€β”€ Output_RT
    └── Output_1to1
  • Input
    A directory containing trimmed alignments of each locus and their gene trees (input files for running PhyloPyPruner).
  • <locus_name>_paralogs_all.aln.trimmed.fasta
    The trimmed alignment of locus <locus_name> from 02-All_paralogs/03-Filtered_paralogs/<locus_name>_paralogs_all.fasta, generated by MAFFT and TrimAl.
  • <locus_name>_paralogs_all.aln.trimmed.fasta.tre
    The gene tree of locus <locus_name>, constructed by FastTree.
  • Output_<PH>
    A directory containing PhyloPyPruner output files for the <PH> algorithm (<PH> includes LS, MI, MO, RT, 1to1; more details can be found here).

04-Alignments

A directory containing alignments produced by different paralog-handling methods specified by the user. These alignments are then trimmed and filtered in stage 3.

04-Alignments/
└── <PH>/
    └──<ortholog_group_name>.*.aln.fasta
  • <PH>/<ortholog_group_name>.*.aln.fasta
    The alignments which are inferred via the <PH> paralog-handling method and multiple sequence alignment by MAFFT.

NOTE:
<ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, 4757_1 and 4757_2 are inferred ortholog group names from locus 4757.


05-Trimmed_alignments

A directory containing trimmed alignments inferred via different <PH> paralog-handling methods.

05-Trimmed_alignments/
└── <PH>/
    └── <ortholog_group_name>.*.aln.trimmed.fasta
  • <PH>/<ortholog_group_name>.*.aln.trimmed.fasta
    The alignments which are inferred via the <PH> paralog-handling method, aligned using MAFFT, and trimmed via TrimAl or cleaned via HMMCleaner.

NOTE:
<ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, 4757_1 and 4757_2 are inferred ortholog group names from locus 4757.


06-Final_alignments

A directory containing final <PH> orthogroup alignments ready for downstream species tree inference.

06-Final_alignments/
└── <PH>/
    └── <ortholog_group_name>.*.aln.trimmed.fasta
  • <PH>/<ortholog_group_name>.*.aln.trimmed.fasta Final <PH> orthogroup alignments for downstream species tree inference (stage4).

Stage4 output

07-Concatenated_analysis

A directory containing concatenated analysis results.

07-Concatenated_analysis/
<PH>
  β”œβ”€β”€ 01-Supermatrix
      β”œβ”€β”€ partition.txt
      └── <prefix>_<PH>.fasta
  └── 02-Species_tree
      β”œβ”€β”€ IQTREE
          └── IQ-TREE*
      β”œβ”€β”€ RAxML
          └── RAxML*
      β”œβ”€β”€ RAxML-NG
          └── RAxML-NG*
      β”œβ”€β”€ <prefix>_<PH>_ModelTest_NG.txt.tree
      β”œβ”€β”€ <prefix>_<PH>_ModelTest_NG.txt.log
      β”œβ”€β”€ <prefix>_<PH>_ModelTest_NG.txt.out
      └── <prefix>_<PH>_ModelTest_NG.txt.ckp

07-Concatenated_analysis -> <PH> -> 01-Supermatrix

A directory containing the supermatrix concatenated from <PH> orthogroup alignments and the partition file.

  • <prefix>_<PH>.fasta
    The concatenated supermatrix file for orthology groups inferred by the <PH> method.
  • partition.txt
    The partition file for concatenation.

07-Concatenated_analysis -> <PH> -> 02-Species_tree

IQ-TREE/

A directory containing IQ-TREE results and final rooted trees (created only when IQ-TREE is applied by setting -sp_tree 1).

  • IQ-TREE_<prefix>_<PH>.*
    IQ-TREE intermediate output files.
  • IQ-TREE_<prefix>_<PH>.treefile
    The tree file with branch lengths and bootstrap values, generated by IQ-TREE.
  • IQ-TREE_<prefix>_<PH>.rr.tre
    The rerooted tree file with branch lengths and bootstrap values from IQ-TREE results.
RAxML/

A directory containing RAxML results and final rooted trees (created only when RAxML is applied by setting -sp_tree 2).

  • RAxML_*.<prefix>_<PH>.*
    RAxML intermediate output files.
  • RAxML_<prefix>_<PH>.rr.tre
    The rerooted tree file with branch lengths and bootstrap values from RAxML results.
RAxML-NG/

A directory containing RAxML-NG results and final rooted trees (created only when RAxML-NG is applied by setting -sp_tree 3).

  • RAxML-NG_<prefix>_<PH>.raxml.*
    RAxML-NG intermediate output files.
  • RAxML-NG_<prefix>_<PH>.rr.tre
    The rerooted tree file with branch lengths and bootstrap values from RAxML-NG results.

07-Concatenated_analysis -> <PH> -> <prefix>_<PH>_ModelTest_NG.txt.*

The output files generated by ModelTest-NG.


08-Coalescent_analysis

A directory containing coalescent analysis results.

08-Coalescent_analysis/
β”œβ”€β”€ <PH>
    β”œβ”€β”€ 01-Gene_trees
    β”œβ”€β”€ 02-Combined_gene_trees
    β”œβ”€β”€ 03-Species_tree
    β”œβ”€β”€ 04-Rerooted_gene_trees
    └── 05-PhyParts_PieCharts
└── ASTRAL-Pro
    β”œβ”€β”€ 01-Gene_trees
    β”œβ”€β”€ 02-Combined_gene_trees
    β”œβ”€β”€ 03-Species_tree
    └── 04-Rerooted_gene_trees

08-Coalescent_analysis -> <PH>

A directory containing coalescent-based phylogenetic tree results for a specific dataset generated by the <PH> paralog-handling method.

08-Coalescent_analysis -> <PH> -> 01-Gene_trees

A directory containing gene trees inferred from final <PH> alignments.

  • <ortholog_group_name>.tre: The gene tree for locus/orthogroup <ortholog_group_name>.

NOTE: <ortholog_group_name> is the name of the ortholog group inferred by the <PH> algorithm. For example, ortholog group names 4757_1 and 4757_2 are inferred from locus 4757.

08-Coalescent_analysis -> <PH> -> 02-Combined_gene_trees

A directory containing combined gene trees generated from <PH> alignments.

  • Combined_gene_trees.tre: File containing all gene trees combined into a single file.
  • Combined_gene_trees.tre.collapsed: File containing all gene trees with low-support branches collapsed.
08-Coalescent_analysis -> <PH> -> 03-Species_tree

A directory containing species trees inferred from <PH> alignments.

ASTRAL-IV/

A directory containing the final species tree for the <PH> dataset, generated by ASTRAL-IV.

  • ASTRAL-IV_<prefix>_<PH>.log
    The log file generated by ASTRAL-IV.
  • ASTRAL-IV_<prefix>_<PH>.tre
    The species tree inferred by ASTRAL-IV from the combined gene trees.
  • ASTRAL-IV_<prefix>_<PH>.bootstrap.tre
    The species tree generated by ASTRAL-IV and bootstrapped using ASTRAL-III, following the ASTER protocol.
  • ASTRAL-IV_<prefix>_<PH>.bootstrap.rr.tre
    The rerooted species tree generated by ASTRAL-IV and bootstrapped using ASTRAL-III.
  • ASTRAL-III_LPP.log
    The ASTRAL-III log file which documents the bootstrapping process performed by ASTRAL-III.
wASTRAL/

A directory containing the final species tree for the <PH> dataset, generated by wASTRAL.

  • wASTRAL_<prefix>_<PH>.tre
    The species tree inferred by wASTRAL from the combined gene trees.
  • wASTRAL_<prefix>_<PH>.log
    The log file generated by wASTRAL.
  • wASTRAL_<prefix>_<PH>.rr.tre
    The rerooted species tree generated by wASTRAL.
08-Coalescent_analysis -> <PH> -> 04-Rerooted_gene_trees

A directory containing rerooted gene trees from the <PH> dataset.

  • <ortholog_group_name>.rr.tre
    File containing the rerooted gene tree for <ortholog_group_name> alignments in the <PH> dataset, generated using Phyx or the MAD method.
08-Coalescent_analysis -> <PH> -> 05-PhyParts_PieCharts

A directory containing phylogenetic concordance analysis results using rerooted gene trees and species trees.

ASTRAL-IV/

A directory containing ASTRAL-IV species tree conflict assessment using rerooted gene trees from directory 04-Rerooted_gene_trees (created only when users choose to run ASTRAL-IV by setting -sp_tree 4).

  • ASTRAL_PhyParts.*
    Files containing the PhyParts output (more details can be found here).
  • ASTRAL_PhyPartsPieCharts_<prefix>_<PH>.svg
    Visualization of concordance and conflict between gene trees and the species tree, generated by our newly developed modified_phypartspiecharts.py script.
wASTRAL/

A directory containing wASTRAL species tree conflict assessment using rerooted gene trees from directory 04-Rerooted_gene_trees (created only when users choose to run wASTRAL by setting -sp_tree 5).

  • wASTRAL_<prefix>_<PH>.tre The species tree inferred by wASTRAL from rerooted <PH> gene trees.
  • wASTRAL_<prefix>_<PH>_sorted_rr.tre
    The final rerooted species tree, rerooted by Phyx and sorted by Newick_Utilities.

Comprehensive output

hybsuite_logs

A directory containing the comprehensive log file generated by HybSuite.

hybsuite_logs/
└── hybsuite_<current_time>.log   
  • hybsuite_<current_time>.log
    The log file produced when running the HybSuite pipeline (running the extension tools will not produce this logfile).

hybsuite_checklists

A directory containing checklist files, including species checklists and locus checklists.

hybsuite_checklists/
β”œβ”€β”€ All_Spname_list.txt
β”œβ”€β”€ My_Spname.txt
β”œβ”€β”€ Outgroup.txt
β”œβ”€β”€ Pre-assembled_Spname.txt
β”œβ”€β”€ Public_Spname.txt
β”œβ”€β”€ Public_Spname_SRR.txt
β”œβ”€β”€ Recovered_locus_num_for_samples.tsv
β”œβ”€β”€ Recovered_sample_num_for_loci.tsv
└── Ref_gene_name_list.txt
  • All_Spname_list.txt
    A file containing all sample names from your research.
  • My_Spname.txt
    A file containing all sample names for user-provided raw data in your research.
  • Outgroup.txt
    A file containing all outgroup taxa specified by the user.
  • Pre-assembled_Spname.txt
    A file containing the names of all pre-assembled samples specified by the user.
  • Public_Spname.txt
    A file containing all sample names whose Next Generation Sequencing (NGS) raw data was downloaded from NCBI.
  • Public_Spname_SRR.txt
    A file containing all Sequence Read Archive (SRA) IDs used to download NGS raw data from NCBI. These SRA IDs correspond with the sample names listed in the Public_Spname.txt file.
  • Recovered_locus_num_for_samples.tsv
    A file containing the number of recovered loci by HybPiper for each sample.
  • Recovered_sample_num_for_loci.tsv
    A file containing the number of recovered samples by HybPiper for each locus.
  • Ref_gene_name_list.txt
    A file containing the names of all genes in the target sequences (specified by the -t option).

hybsuite_reports

A directory containing comprehensive statistical summaries of the results generated by the pipeline.

hybsuite_results/
β”œβ”€β”€ Alignments_stats
    β”œβ”€β”€ <PH>-01_Alignments_stats_AMAS.tsv
    β”œβ”€β”€ <PH>-02_Trimmed_alignments_stats_AMAS.tsv
    β”œβ”€β”€ <PH>-03_Removed_alignments_without_parsimony_informative_sites.txt
    β”œβ”€β”€ <PH>-04_Removed_alignments_with_length_less_than_4.txt
    β”œβ”€β”€ <PH>-05_Removed_alignments_with_sample_number_less_than_5.txt
    β”œβ”€β”€ <PH>-06_Final_alignments_list.txt
    └── <PH>-07_Final_alignments_stats_AMAS.tsv
└── Supermatrix_stats
    └── <PH>-Supermatrix_stats_AMAS.tsv

hybsuite_reports -> Alignments_stats

  • <PH>-01_Alignments_stats_AMAS.tsv
    Summary table of orthogroup alignments inferred via the <PH> paralog-handling method (generated by AMAS.py).

  • <PH>-02_Trimmed_alignments_stats_AMAS.tsv
    Summary table of trimmed orthogroup alignments inferred via the <PH> paralog-handling method (generated by AMAS.py).

  • <PH>-03_Removed_alignments_without_parsimony_informative_sites.txt
    List of alignments without parsimony informative sites. These alignments are removed for downstream species tree inference.

  • <PH>-04_Removed_alignments_with_length_less_than_4.txt
    List of alignments with base pair length less than 4. These alignments are removed for downstream species tree inference.

  • <PH>-05_Removed_alignments_with_sample_number_less_than_5.txt
    List of alignments with fewer than 5 samples. These alignments are removed for downstream species tree inference.

  • <PH>-06_Final_alignments_list.txt
    List of final <PH> alignments selected for downstream species tree inference.

  • <PH>-07_Final_alignments_stats_AMAS.tsv
    Summary table of final <PH> alignments for downstream species tree inference (generated by AMAS.py).

Filtering process: Alignments without parsimony-informative sites, low bp length, and with low sample number are removed.

hybsuite_reports -> Supermatrix_stats

  • <PH>-Supermatrix_stats_AMAS.tsv
    Summary table of final <PH> supermatrix for downstream concatenation-based species tree inference (generated by AMAS.py).