Extension tools

Apart from the main pipeline, we also offer some extention tools for results visualization and statistic analysis. This page tells you how to use them!


1. plot_paralog_heatmap.py

(1) Overview

Paralogs are homologous genes that arise from gene duplication within the same species. plot_paralog_heatmap.py is a Python script for analyzing and visualizing paralog distribution patterns across samples and loci. As a part of the HybSuite toolkit, it processes unaligned FASTA file for each locus to:

  • Count paralogous sequences for each sample at each locus and generate a TSV format data table recording the counts.
  • Generate heatmaps to visualize paralog distribution patterns, with auto-adjusted dimensions based on sample and locus counts.
  • Support multi-threading for improved efficiency.

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:
conda activate <conda_env>
  • Otherwise, manually install the dependencies first:
pip install pandas seaborn matplotlib numpy
  • Key differences from HybPiper’s paralog heatmap plotting function paralog_retriever:

    • (1) Input format

      • hybpiper paralog_retriever: Requires the sample folders generated by hybpiper assemble.
      • This script: Accepts a folder containing FASTA files, making it applicable to a wider range of datasets (e.g., those including pre-assembled sequences).
    • (2) Visualization features

      • Customizable heatmap colors.
      • Option to display the paralog count for each sample–locus cell directly on the heatmap.
      • Generates higher-quality, more publication-ready figures.

(3) Input file requirements

This script processes the following input files:

  • (1) Input Directory (required, specified by -i/--input_dir):
    A directory containing multiple FASTA format files, each FASTA file represents a locus and contains sequences from multiple samples. Files should be named <locus_name>.fasta, <locus_name>_paralogs.fasta or <locus_name>_paralogs_all.fasta.

For example:

>Sample1
ATGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>Sample2
ATGCTAGCTATCGATCGATCGATCGATCGA
>Sample3
ATGCTCGATCGATCGATCGATCGATCGATC

NOTE:
If a sample has only one single sequence in one locus, then this sequence is the orthology one. If a single sample have multiple sequences in one locus, then these seuqences are putative paralogs.

The paralog sequence FASTA files retrieved by hybpiper paralog_retriever can directly be the input files for this script. For example, the 5942_paralogs_all.fasta in our dataset:

>Elaeagnus_angustifolia.main NODE_1_length_1285_cov_267.250828,Elaeagnus-pungens-Elaeagnus_pungens,0,201,98.51,(1),312,915
ACCTTCCTTGACCTCAAGACCGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATCATTACAGACCTGCAGAACAAACGCCGTGGCACCGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGCGGAAAGAAGCCCTACCCACAAAAGAAAACGGGTAGGGCTCGACAGGGGTCCAAGAGAACTCCACTCCGGCCAGGTGGAGGAGTCGTCTTTGGGCCTAAGCCCAGAGATTGGAGCATCAAGATCAATAGAAAGGAGAAAAGGTTGGCGATTTCGACAGCAATGTCTAGTGCAGCTGCGAATACGATCGTGGTGGAGGATTTTTGGGACAATATGGATAAACCCAGGACGAAGGATTTTATAGCTGCTATGAAGAGGTGGGGTTTAAATCCACCGGGAGAGAAAGCTATGTTTATGATGGACGAAATTTCGGATAACGTGAGGCTTTCAAGTAGAAATATTCCGAAAGTGAAGGTTTTGACCCCGAGGACTTTGAATTTGTTTGATATTTTAAATGCGGATAAGTTGGTGCTTACCCCTGCTGCTGTGGATTACTTGAATGGACGTTATGGTGTTAATTATGAGGGTGAGAGT
>Elaeagnus_angustifolia.0 NODE_2_length_1266_cov_276.023549,Elaeagnus-pungens-Elaeagnus_pungens,2,201,89.95,(-1),301,898
CTTGATCTCAAAACAGCACCACCCGAAACTGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCGTGGGACTGCCTCAACCCTAACCCGTGGTGAGGTTAGAGGTGGTGGGAAAAAACCTTACCCACAAAAGAAAACGGGTCGGGCCCGACAAGGGTCCAAGAGAACTCCACTCCGTCCCGGTGGAGGTGTCGTTTTTGGTCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTGGTGGAGGATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGAGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCGGAAGAGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCTAATTATGAGGGGGAGAAT
>Elaeagnus_macrophylla.main NODE_2_length_1269_cov_183.823826,Elaeagnus-pungens-Elaeagnus_pungens,0,201,88.06,(1),297,900
ACATACCTCGATCTCAAAACAGCACCACCCGAAACAGCTCGAGCCGTCGTTCACCGAGCCATAATCACAGACCTCCAAAACAAACGCCCTGGGACTGCCTCAACCCTTACCCGCGGTGAGGTTAGAGGTGGTGGAAAGAAACCTTACCCACAAAAGAAAACGGGTCGCGCTCGACAAGGGTCAAAAAGAACTCCACTCCGTCCAGGTGGAGGTGTCGTTTTTGGGCCTAAGCCCAGAGATTGGACCATCAAGATCAATAGGAAGGAAAAGAGGTTGGCAATTTCGACAGCAATGGTTAGTGCTGCTACGAATACGATTGTAGTGGAGCATTTTGGGGACAAGTTTGAGAAACCCAAGACGAAGGGGTTCATAGAGGCAATGAAGAGGTGGGGTTTGGACCCACCTGAAGTGAAAGCTATGTTTTTGATGGAGGAGATATCTGATAATGTGAGGCTTTCGAGTAGAAATGTACCAAAAGTGAAGGTTTTGACACCAAGGACTTTGAATTTGTTTGATATTTTGAATGCTGATAAGTTGATTCTTTCCCCTGCTACTGTGGATTACTTGAATGCTCGATATGGGGCAAATTATGAGGGTGAGAAT
...
  • (2) Species List File: (Optional, specified by -species_list)
    If you want to analyze only specific species, you can provide a species list file:
Sample1
Sample2
Sample3

(4) Basic usage

python plot_paralog_heatmap.py \
    -i <input_dir> \
    -opr <counts.tsv> \
    -oph <heatmap.png> \
    [options] ...
  • Required parameters in basic usage
    • -i/--input_dir
      Directory containing FASTA files with paralogous sequences (formatted as <locus_name>_paralogs.fasta)
    • At least one output option must be specified:
      • -opr, --output_paralog_report
        Generate a TSV file containing paralog counts
      • -oph, --output_paralog_heatmap
        Generate a heatmap visualization (format determined by file extension)

(5) Full parameters:

General options:
  -t THREADS, --threads THREADS
                        Number of threads to use for processing (default: 1)
  --species_list SPECIES_LIST
                        File containing list of species to include in the analysis (one species per line)
  --output_species_list OUTPUT_SPECIES_LIST
                        Output file to save the list of processed species

Heatmap customization options:  
  --dpi DPI             DPI (dots per inch) for output image (default: 300)
  --fig_length FIG_LENGTH
                        Figure length in inches (default: auto-calculated based on number of loci)
  --fig_height FIG_HEIGHT
                        Figure height in inches (default: auto-calculated based on number of species)
  --sample_font SAMPLE_FONT
                        Font size for sample labels in points (default: 10)
  --gene_font GENE_FONT
                        Font size for gene labels in points (default: 10)
  --hide_xlabels        Hide x-axis labels (locus names)
  --hide_ylabels        Hide y-axis labels (sample names)
  --no_grid             Do not show grid lines in heatmap
  --color {black,blue,red,green,purple,orange,yellow,brown,pink}
                        Color scheme for heatmap gradient (default: black)
  --show_values         Show numerical values in heatmap cells (only for values >= 2)
  --grid_color GRID_COLOR
                        Color for grid lines in heatmap (default: grey)
  --add_markers         Add visual markers in cells (dots for 1s, diagonal lines for 0s)

(6) Output examples

  • TSV Report (paralog_counts.tsv):
    (use -opr, --output_paralog_report to specify the output filename and directory.)
Species   gene1   gene2   gene3
Sample1   2       1       0
Sample2   1       1       1
Sample3   0       2       1
  • Heatmap Visualization
    The heatmap uses color intensity to represent the number of recovered sequences:
    • White: No sequences (0)
    • Light color: Single sequence (1), representing single copy orthologs.
    • Darker color: Multiple sequences (≥2), representing putative paralogs.
  • For example, the following figure is the default output of our test dataset Arabidopsis100.
    You can find it in <output_dir>/02-All_paralogs/04-Filtered_paralog_reports_and_heatmap/Filtered_paralog_heatmap.png after running HybSuite by following our guide. In the HybSuite stage 2 pipeline, this script was applied to generate the heatmaps for original paralogs and filtered paralogs and by default, --show_values is used to display the specific numbers of recovered sequences in each locus of each sample.

test_dataset-paralog_heatmap_default

  • When running this script manually, the recovered sequence counts won’t display if you don’t use the --show_values option.:

test_dataset-paralog_heatmap_no_value

  • To clearly show the type of sequence in each locus of each sample, it is advisable to use --add_markers plus --show_value to add markers and numbers to the figure:
    • X: No sequences (0)
    • ·: Single sequence (1), representing single copy orthologs.
    • <number>: Multiple sequences (≥2), representing putative paralogs.
python plot_paralog_heatmap.py ... -add_markers --show_value

test_dataset-paralog_heatmap_markers

  • Besides, you can also use --color to change a new color theme:
python plot_paralog_heatmap.py ... --color red 

test_dataset-paralog_heatmap_red_color

python plot_paralog_heatmap.py ... --color blue 

test_dataset-paralog_heatmap_blue_color

NOTE: Our script only provides nine color themes, they are black(default), red, blue, purple, green, orange, yellow, brown and pink.

(7) Use cases

  1. Paralog Distribution Analysis: Identify which species and genes tend to have more paralogs
  2. Data Quality Assessment: Evaluate completeness of sequencing and assembly
  3. Evolutionary Analysis: Study gene duplication events across different species
  4. Data Visualization: Generate high-quality visualizations for papers or reports

(8) Tips and tricks

  • For large datasets, increase the thread count (-t parameter) to speed up processing
  • If sample names are long, use a smaller sample font size (–sample_font)
  • This script can adjust the image dimensions automatically (which might be best for visualization in many cases). You can also use --fig_length and --fig_height to manually adjust your image.
  • Use –show_values to display specific paralog counts directly on the heatmap (for counts ≥2)

2. plot_recovery_heatmap_v2.py

(1) Overview

plot_recovery_heatmap_v2.py visualizes sequence recovery across samples and loci. It generates heatmaps showing the percentage of sequence length recovered for each gene in each sample, relative to reference sequences.

It highlights:

  • Well-recovered loci across samples
  • Samples with poor overall recovery
  • Recovery patterns indicating systematic biases

Key features:

  • Calculates sequence lengths from FASTA files
  • Generates comprehensive sequence length tables in TSV format
  • Creates customizable heatmaps with multiple color schemes
  • Supports comparison against mean or maximum reference lengths
  • Offers extensive visualization options including value display and grid customization
  • Provides multi-threading support for processing large datasets
  • Support for interactive Plotly HTML output.

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:
conda activate <conda_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas seaborn matplotlib numpy plotly

The script requires:

  • Python 3.6 or higher
  • Biopython (for sequence parsing)
  • NumPy and Pandas (for data manipulation)
  • Matplotlib and Seaborn (for visualization) The script automatically checks for required packages and will provide clear error messages if any are missing.

(3) Input file requirements

Required input:

  1. Directory of FASTA files - Each file should contain sequences for a single locus across multiple samples
  • Supported file extensions: .fna, .fasta, .fa.
  • Each sequence header should start with the species/sample name (e.g., >species_name rest_of_header).
  1. Target sequence file - A single FASTA file containing reference sequences for all target loci
  • Each sequence ID should include the locus name at the end, separated by a hyphen (e.g., >ref-locusnameA).
  • The script automatically detects whether references are nucleotide or protein sequences.

(4) Optional input:

  • Species list file - A simple text file with one species name per line (-s <FILE>)
  • If not provided, species names will be automatically extracted from the FASTA files

(5) Basic usage

The basic command requires only the input directory and reference file:

python plot_recovery_heatmap_v2.py -i /path/to/fasta_files -r /path/to/reference.fasta \
-output_heatmap /path/to/recovery_heatmap(without_suffix)

This will:

  1. Calculate sequence lengths for each sample and locus.
  2. Generate a seq_lengths.tsv file in the current directory. (use --output_seq_lengths to change the output path).
  3. Create a heatmap image named recovery_heatmap.png in the directory /path/to/.

(6) Example

If you have finished running our pipeline, open the directory 02-All_paralogs/03-Filtered_paralogs, which is one of the output directories of our example dataset Angiosperms353. Then you will find there are many FASTA files in it:

4471_paralogs_all.fasta
4527_paralogs_all.fasta
4691_paralogs_all.fasta
4724_paralogs_all.fasta
...
  • Locus name 4471, 4527, 4691, 4724
  • Filename suffix: _paralogs_all
  • Suffix: .fasta

In that case:

  • use the option -i/--input_dir to specify the path to this directory;
  • use the option -r to specify the path to the Angiosperms353 target file (locus names in target file must be corresponded with that in your input directory);
  • use the option --filename_suffix to specify the filename suffix _paralogs_all to enable the script extract the locus name from the filename.

Run:

cd /path/to/02-paralogs/03-Filtered_paralogs/
python plot_recovery_heatmap_v2.py -i . -r /path/to/Target_file_Angiosperms353.fasta --filename_suffix "_paralogs_all" -output_heatmap ./recovery_heatmap.html -gw 0

Then you can obtain an interactive heatmap HTML file ./recovery_heatmap.html:

Loading chart...
  • The blue bars along with x- and y-axes indicate how many loci are recovered in each sample and how many samples each locus are recovered in, respectively.
  • The color intensity of each cell indicates the proportion of gene length recovered for a given sample (y-axis) at a specific target locus (x-axis). When multiple sequences are recovered for a locus within a sample (putative paralogs), only the longest sequence is retained for visualization in the heatmap.

Now, let’s play with this interactive html file for fun and better effect!

  • Choose the button “Sort by” as “Descending” to sort samples and loci on the heatmap from high to low recovery.
  • Click on the “Plus” (+) and “Minus” (-) icons in the upper right corner to zoom in and out of the heatmap.
  • Click on the “AutoScale” icon in the upper right corner to auto-scale the heatmap.
  • Click the “Camera” (📷) icon in the upper right corner to download the current heatmap view as a PNG file.

(7) Full parameters

usage: plot_recovery_heatmap_v2.py [-h] -i INPUT_DIR -r REFERENCE [-s SPECIES_LIST] [--filename_suffix FILENAME_SUFFIX]
                                   [--output_species_list OUTPUT_SPECIES_LIST] [--output_heatmap OUTPUT_HEATMAP]
                                   [--output_seq_lengths OUTPUT_SEQ_LENGTHS] [-t THREADS]
                                   [--color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}] [--title TITLE]
                                   [--use_max] [--xlabel XLABEL] [--ylabel YLABEL] [-gw GRID_WIDTH]

plot_recovery_heatmap_v2.py - A visualization tool in HybSuite

This script is a component of the HybSuite toolkit, designed for visualizing sequence recovery 
rates across different taxa and loci. It generates heatmaps that display the percentage of 
sequence length recovered for each gene in each taxon, relative to either the average or 
maximum length of reference sequences.

Key features:
1. Calculates sequence lengths and generates a seq_lengths.tsv file
2. Calculates the percentage length recovered relative to reference sequences
3. Generates customizable heatmaps showing recovery rates
4. Supports both average and maximum reference length comparisons
5. Offers flexible visualization options

Both the seq_lengths.tsv file and heatmap generation are optional outputs.
Part of HybSuite

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_DIR, --input_dir INPUT_DIR
                        Directory containing FASTA files for each locus
  -r REFERENCE, --ref REFERENCE
                        Target sequence file (FASTA format)
  -s SPECIES_LIST, --species_list SPECIES_LIST
                        File containing list of species names (one per line). If not provided, species names will be extracted from FASTA files
  --filename_suffix FILENAME_SUFFIX
                        Suffix(es) to remove from input FASTA filenames to get locus names. Multiple suffixes can be separated by commas. Example: "_paralogs_all". If not specified, the input filenames will be recognized as loci names.
  --output_species_list OUTPUT_SPECIES_LIST, -osp OUTPUT_SPECIES_LIST
                        Output file for extracted species list (when species_list is not provided)
  --output_heatmap OUTPUT_HEATMAP, -oh OUTPUT_HEATMAP
                        Output path and filename for the heatmap (default: recovery_heatmap.html). Should end with .html extension.
  --output_seq_lengths OUTPUT_SEQ_LENGTHS, -osl OUTPUT_SEQ_LENGTHS
                        Output file for sequence lengths (TSV format). If not provided, sequence lengths will be written to seq_lengths.tsv in current directory
  -t THREADS, --threads THREADS
                        Number of threads to use (default: 1)
  --color {viridis,magma,inferno,plasma,cividis,turbo,purple,blue,green,black}
                        Color scheme for the HTML heatmap (default: blue). Available options: viridis, magma, inferno, plasma, cividis, turbo, purple, blue, green, black
  --title TITLE         Main title of the heatmap (default: "Percentage length recovery for each gene")
  --use_max             Use maximum length instead of average length from reference sequences
  --xlabel XLABEL       X-axis label (default: "Locus")
  --ylabel YLABEL       Y-axis label (default: "Sample")
  -gw GRID_WIDTH, --grid_width GRID_WIDTH
                        The value of grid width of the heatmap, recommended to set as "0" when the locus number is huge (default: 0.5)

3. RLWP.py

(1) Overview

RLWP.py (Remove Loci With Paralogs) is a Python script within the HybSuite toolkit designed to filter out genetic loci with excessive paralog occurrences. Paralogs are gene copies that arise from gene duplication events and can complicate phylogenetic analyses. This tool identifies and removes loci that exceed a user-defined threshold of paralog presence across samples, helping to improve the quality of downstream analyses by maintaining only single-copy orthologous markers.
Key features:

  • Filters loci based on paralog occurrence statistics
  • Supports multi-threading for improved performance
  • Provides detailed logging and reporting
  • Offers in-place filtering or non-destructive output to a separate directory
  • Works with various sequence file formats (suffix can be FASTA, FNA, fasta, fa)

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:
conda activate <conda1_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas

(3) Input file requirements

RLWP.py requires two main types of input:

  1. A directory containing sequence files: A directory containing nucleotide sequence files in FASTA format (.fa, .fasta, .fna, .fas or their uppercase variants). Each file should represent one locus.
  2. Paralog statistics file: A tab-separated values (TSV) file containing paralog counts per sample for each locus.
  • 0: No any sequence is recovered in this loci of the sample.
  • 1: The only recovered sequence in this loci of the sample is a single-copy orthology sequence.
  • more than 1: Putative paralogs exist in this loci of the sample.

This file should have:

  • Sample IDs in the first column
  • Locus names as column headers
  • Values representing the number of paralogs found for each sample-locus combination

NOTE: This file can be generated by running plot_paralog_heatmap.py (with the option -oph, see here)

Example paralog statistics file format:

Sample  Locus1  Locus2  Locus3
sample1 1       2       1
sample2 1       1       3
sample3 2       1       1

(4) Basic usage

  • Remove loci where >2 samples show putative paralogs:
python RLWP.py -i input_directory -p paralog_statistics.tsv -s 2 -or deletion_report.tsv

Required parameters:

  • -i, --input_dir: Directory containing sequence files
  • -p, --paralog_heatma: Path to paralog statistics file (TSV format)
  • -s, --samples_threshold: Minimum number of samples with paralogs to trigger locus removal
  • -or, --output_report: Path for saving the deletion report

Optional parameters:

  • -o, --output_dir: Optional directory to output filtered files (preserves originals)
  • -t, --threads: Number of threads to use for parallel processing (default: 1)

(5) Output examples

Tips and tricks

  • Choosing the Right Threshold: Start with a conservative threshold (e.g., 5% of your total samples) and adjust based on your dataset characteristics.
  • Non-destructive Workflow: Use the -o option to create a filtered copy of your data without modifying original files:
python RLWP.py -i <input_directory> -p paralog_reports.tsv -s 3 -o filtered_data
  • Start with a conservative threshold: (e.g., 5% of your total samples) and adjust based on your dataset characteristics.

  • Performance Optimization: For large datasets, increase thread count to speed up processing


4. filter_seqs_by_length.py

(1) Overview

  • filter_seqs_by_length.py is a Python script within the HybSuite package that filters DNA sequences based on length criteria. It allows filtering sequences using absolute minimum length or relative length compared to reference sequences.
  • It is particularly useful for removing short, potentially truncated sequences before downstream analyses, helping to ensure high-quality datasets for phylogenomic analysis.
  • It processes multiple FASTA files in parallel, reference file can be DNA or protein sequences.
  • It provides detailed logging and reporting of filtered sequences, making it easy to track what was removed and why.

(2) Dependencies

  • Key dependencies:

    • BioPython: For sequence handling
    • Pandas: For reporting and data manipulation
    • Python 3.6+: For pathlib and f-string support
  • If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:

conda activate <conda1_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas

(3) Input file requirements

filter_seqs_by_length.py requires two types of input:

  1. A directory containing sequence files in FASTA format:
  • Supported extensions: .fa, .fasta, .fna, .fas (case-insensitive)
  • Each file is assumed to contain sequences from a single locus
  • Filename determines locus ID (e.g., the locus name of GeneName.fasta is GeneName)
  1. Reference Sequences
  • The format of reference sequences is as the same as the requirement of HybSuite main program (here to check).

(4) Basic usage

  • Filter sequences by absolute minimum length:
python filter_seqs_by_length.py -i input_directory --min_length 300
  • Filter using reference sequences to according to length ratio (compared to mean length or maximum length of each locus in reference file):
python filter_seqs_by_length.py -i input_directory -r reference.fasta --mean_length_ratio 0.7
  • Combine multiple criteria to filter:
python filter_seqs_by_length.py -i input_directory -r reference.fasta \
    --min_length 200 --mean_length_ratio 0.6 --max_length_ratio 0.5
  • Save output to a different directory rather than modifing original files:
python filter_seqs_by_length.py -i input_directory --output_dir filtered_sequences
  • Generate report of removed sequences:
python filter_seqs_by_length.py -i input_directory -r reference.fasta \
    --mean_length_ratio 0.7 --output_report removed_seqs.tsv

(5) Output examples

Filtered FASTA Files

Filtered sequences are written either:

  1. To the original files (overwriting them)
  2. To a new directory if --output_dir is specified

Removed Sequences Report

When using --output_report, a TSV file is created with details of removed sequences:

FileSequence_IDLengthMean_Length_RatioMax_Length_Ratio
gene1.fastaSample1_gene11250.4350.391
gene2.fastaSample3_gene2780.2130.185

(6) Use cases

Cleaning Assembled Data

  • Remove truncated sequences resulting from poor assembled and captured sequences:
python filter_seqs_by_length.py -i captured_exons -r reference.fasta \
    --mean_length_ratio 0.3 --output_report removed_sequences.tsv

Tips and Tricks

  • Locus Identification: Ensure filenames match locus IDs in reference sequences (part after last hyphen).
  • Preserving Originals: Always use --output_dir when testing filtering parameters to avoid overwriting original files.
  • Speed Optimization: Set -t to match available CPU cores for maximum performance.
  • Multiple Filters: Combining --min_length with ratio-based filters creates more stringent filtering.
  • Protein Sequences: The script automatically detects the type of reference file (DNA/protein) and adjusts length calculations appropriately.

5. filter_seqs_by_sample_and_locus_coverage.py

(1) Overview

filter_sequences_by_sample_and_locus_coverage.py is a Python script designed to remove samples with low loci coverage and loci with low sample coverage in phylogenomic dataset based on user-defined threshold.

This tool can:

  • Filter samples and loci based on minimum coverage thresholds
  • Generate reports of removed samples and loci
  • Process files in parallel to improve performance

(2) Dependencies

  • Key dependencies:

    • BioPython: For sequence handling
    • Pandas: For reporting and data manipulation
    • Python 3.6+: For pathlib and f-string support
  • If you’ve already installed all HybSuite dependencies in <conda1_env>, activate it to run this script:

conda activate <conda1_env>
  • Otherwise, manually install the dependencies first:
pip install biopython pandas

(3) Input file requirements

The tool processes FASTA files in a directory with the following requirements:

  1. A directory containing sequence files in FASTA format:
  • Supported extensions: .fa, .fasta, .fna, .fas, FNA (case-insensitive).
  • Each file is assumed to contain sequences from a single locus.
  • Filename determines locus ID (e.g., the locus name of GeneName.fasta is GeneName).

(4) Basic usage

python filter_seqs_by_sample_and_locus_coverage.py -i input_directory --min_sample_coverage 0.5 --min_locus_coverage 0.7

Required parameters

  • -i, --input: Directory containing FASTA files.
  • --min_sample_coverage: Minimum sample coverage ratio (0-1) for each locus (default: 0.0)
  • `–min_locus_coverage: Minimum locus coverage ratio (0-1) for each sample (default: 0.0)

Optional parameters

  • -o, --output_dir: Directory for filtered sequences (if not specified, original files are modified)
  • -t, --threads: Number of threads to use (default: 1)
  • --removed_samples_info: Output TSV file for removed samples coverage information
  • --removed_loci_info: Output TSV file for removed loci coverage information

(5) Output examples

  • Example of removed_samples.tsv (specified by --removed_samples_info):
Sample  Locus_Coverage
Species1    0.45
Species2    0.32
  • Example of removed_loci.tsv (specified by --removed_loci_info):
Locus   Sample_Coverage
Locus1  0.38
Locus2  0.42

(6) Use cases

  • The following running codes remove loci that appear in less than 60% of samples and samples that have less than 50% of loci.
python filter_seqs_by_sample_and_locus_coverage.py \
-i assembled_loci/ -o filtered_loci/ \
--min_sample_coverage 0.6 --min_locus_coverage 0.5 \
--removed_samples_info removed_samples.tsv --removed_loci_info removed_loci.tsv

(7) Tips and tricks

  • Choosing Coverage Thresholds: Start with lower thresholds (e.g., 0.3-0.5) and gradually increase until you achieve the desired balance between data completeness and taxon/locus sampling.
  • Preserving Original Files: Always use the -o option to output to a new directory when experimenting with different thresholds.
  • Removing Problematic Samples Only: Set --min_locus_coverage without setting --min_sample_coverage to filter out only low-coverage samples while keeping all loci.
  • Tracking Removed Data: Always use the --removed_samples_info and --removed_loci_info options to keep records of what was filtered out for documentation and troubleshooting.
  • Performance Optimization: Use the -t option with a value close to your CPU core count for faster processing of large datasets.

6. modified_phypartspiecharts.py

(1) Overview

(2) Basic usage

The basic usage of modified_phypartspiecharts.py is nearly as the same as that of phypartspiecharts.py. The only difference is that users have to use --output to specify the path and filename of the output visiualization results when running modified_phypartspiecharts.py, rather than --svg_name in phypartspiecharts.py.

python modified_phypartspiecharts.py \
species_tree phyparts_prefix num_genes ...
  • Required Parameters in basic usage
    • species_tree: Path to species tree file (Newick format)
    • phyparts_root: Prefix of PhyParts output files
    • num_genes: Total number of gene trees

(3) Extended functionality

Compared to the original version, modified_phypartspiecharts.py offers the following extended functionality:

a. Running Efficiency Control

  • Multithreading Support: Use -nt/--threads <NUM> for multithreaded processing, significantly improving speed for large datasets

b. Output Files Control

  • Support for SVG and PDF format outputs: use --output <output_file> and specify your outputfile with the extension .pdf or .svg.

  • Additional Statistical Output: use --stat parameter to export detailed node statistics to TSV files.
    The detailed node statistics table generated by --stat parameter contains the following columns:

    • Node: Node ID
    • Support(blue): Number of genes supporting the species tree
    • TopConflict(green): Number of genes with main conflict
    • OtherConflict(red): Number of genes with other conflict
    • NoSignal(gray): Number of genes with no signal
    • Support/Total_Ratio: Ratio of supporting to conflicting genes

    The file also includes the average ratios for internal nodes including Support/Total Ratio, Conflict/Total Ratio, NoSignal/Total Ratio, Support/Signal_Ratio and Conflict/Signal_Ratio.

    For example:

Node	Support(blue)	TopConflict(green)	OtherConflict(red)	NoSignal(gray)	Support/Total_Ratio	Conflict/Total_Ratio	NoSignal/Total_Ratio	Support/Signal_Ratio	Conflict/Signal_Ratio
0	85	0	0	157	0.3512	0.0000	0.6488	1.0000	0.0000
1	202	7	31	2	0.8347	0.1570	0.0083	0.8417	0.1583
2	135	41	59	7	0.5579	0.4132	0.0289	0.5745	0.4255
3	205	6	20	11	0.8471	0.1074	0.0455	0.8874	0.1126
4	123	22	91	6	0.5083	0.4669	0.0248	0.5212	0.4788
5	164	33	36	9	0.6777	0.2851	0.0372	0.7039	0.2961
6	190	18	29	5	0.7851	0.1942	0.0207	0.8017	0.1983
7	91	35	112	4	0.3760	0.6074	0.0165	0.3824	0.6176
8	129	18	86	9	0.5331	0.4298	0.0372	0.5536	0.4464

Average ratios for internal nodes only:
Support/Total Ratio: 0.6079
Conflict/Total Ratio: 0.2957
NoSignal/Total Ratio: 0.0964
Support/Signal Ratio: 0.6963
Conflict/Signal Ratio: 0.3037

c. Extended Visualization Functionality

  • Support for controlling whether taxonomic names use italic font: use --no_italic
  • Support for flexible number displayed on branches: use --show_num_mode <NUM>
  • Support for controlling tree branch width display: use --line_width <NUM>
  • Support for controlling pie chart size: use --pie_size <NUM>
  • Support for controlling leaf node label size: use --tip_size <NUM>
  • Support for controlling node number label size: use --number_size <NUM>
  • Support for circular, cladogram, and phylogram display types: use --tree_type <circular|cladogram>

Github_wiki_page-modified_phypartspiecharts

(4) Full Options

options:
  -h, --help            show this help message and exit
  --taxon_subst TAXON_SUBST
                        Comma-delimited file to translate tip names.
  --output OUTPUT       Output filename with extension (.svg or .pdf)
  --output_node_tree    Generate an additional tree file with '_nodes' suffix showing:
                        - All node identifiers in the tree
                        - No pie charts
                        - No numerical annotations
  --no_ladderize        Don't ladderize tree
  --to_csv              Export data to CSV
  --tree_type {circle,cladogram,phylo}
                        Tree visualization type (cladogram or circle, default: cladogram)
  --line_width VT_LINE_WIDTH
                        Width of tree branches (default: 0)
  --no_italic           Display species names in normal font style (default: italic)
  --tip_size TIP_SIZE_FACTOR
                        Scale factor for tip label font size (default: 1.0)
  --number_size NUMBER_SIZE_FACTOR
                        Scale factor for gene tree count font size (default: 1.0)
  --show_num_mode SHOW_NUM_MODE
                        Control what numbers to show on branches (specify 0-2 digits):
                        0: Hide all numbers
                        1: Number of genes supporting species tree (blue)
                        2: Number of genes conflicting with species tree (red+green)
                        3: Number of genes with no signal (gray)
                        4: Proportion of supporting genes (blue/total)
                        5: Proportion of conflicting genes ((red+green)/total)
                        6: Proportion of no signal genes (gray/total)
                        7: Ratio of supporting to all signal genes (blue/(blue+red+green))
                        8: Ratio of conflicting to all signal genes (red+green/(blue+red+green))
                        9: Original node support values from the input tree
                        Example: --show_num_mode 0  (hide all numbers)
                                --show_num_mode 1  (show only support number)
                                --show_num_mode 12 (default, show support and conflict numbers)
                                --show_num_mode 47 (show support number and support/conflict ratio)
                                --show_num_mode 9  (show original node support values)
  --pie_size PIE_SIZE_FACTOR
                        Scale factor for pie chart size (default: 1.0)
  --stat STAT_OUTPUT    Output file path for node statistics (TSV format)
  -nt THREADS, --threads THREADS
                        Number of threads to use (default: 1)

Citation

If using this tool, please cite:


7. Fasta_formatter.py

(1) Overview

Fasta_formatter.py is a Python script for reformatting FASTA sequences into either interleaved (60 characters per line) or single-line format. It supports multi-threading for faster processing of large files.

(2) Dependencies

  • If you’ve already installed all HybSuite dependencies in <conda_env>, activate it to run this script:
conda activate <conda_env>
  • Otherwise, manually install the dependencies first:
pip install pathlib concurrent.futures

(3) Basic usage

python Fasta_formatter.py \
    -i <input_fasta> \
    -o <output_fasta> \
    --inter|--single \
    [-nt <threads>]

Required parameters:

  • -i/--input: Input FASTA file
  • -o/--output: Output file path
  • --inter: Output in interleaved format (60 characters per line)
  • --single: Output in single-line format

Optional parameters:

  • -nt/--threads: Number of threads (default: 1)

(4) Example

Convert a FASTA file to interleaved format with 4 threads:

python Fasta_formatter.py -i sequences.fasta -o sequences_formatted.fasta --inter -nt 4

Convert to single-line format:

python Fasta_formatter.py -i sequences.fasta -o sequences_singleline.fasta --single

(5) Use cases

  • Data preprocessing: Prepare sequences for downstream analysis tools that require specific FASTA formatting
  • File standardization: Convert between different FASTA formats for compatibility
  • Large file processing: Use multi-threading to speed up formatting of big datasets

8. rename_assembled_data.py

(1) Overview

rename_assembled_data.py is a python script in the HybSuite package designed to handle batch renaming operations for assembled data directories produced in the HybSuite stage 2 and their contents. It provides a comprehensive solution for renaming directories, files, and file contents while maintaining data integrity and consistency.

Key features:

  • Recursively renames directory structures, file names, and file contents
  • Handles potential naming conflicts safely
  • Supports both single directory and batch renaming operations

(2) Basic usage

Single directory renaming

To rename a single directory and all its contents:

python rename_assembled_data.py -i /path/to/directory -n new_name

Parameters:

  • -i, --input: Path to the directory you want to rename
  • -n, --new_name: The new name to replace the old name

Example:

python rename_assembled_data.py -i ./sample_001 -n sample_002

Batch renaming

For batch renaming multiple directories, create a tab-delimited file containing old and new name pairs:

python rename_assembled_data.py --rename_list path/to/rename_list.txt -p /path/to/parent_directory

Parameters:

  • --rename_list: Path to a tab-delimited file containing old_name and new_name pairs
  • -p, --parent_dir: Path to the parent directory containing all the folders to be processed

The rename list file should be formatted as follows (tab-delimited):

old_name1   new_name1
old_name2   new_name2
old_name3   new_name3

Example:

python rename_assembled_data.py --rename_list rename_pairs.txt -p ./assembled_data

The script will:

  1. Process each directory listed in the rename file
  2. Rename all matching files and directories within each target directory
  3. Replace matching content within files
  4. Provide a summary of successful and failed operations

Note: The script includes safety checks and will skip operations that might cause conflicts or data loss.


Last modified March 5, 2026: Update plotly.html (84cc3e0)