This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Example dataset

    This page provides detailed instructions on how to run the example dataset included with HybSuite.


    1. Download the example dataset

    If you have downloaded the HybSuite source package, a directory named example_dataset is already included. In this case, no additional download is required.

    Alternatively, you can download the repository on your server using:

    git clone https://github.com/Yuxuanliu-HZAU/HybSuite
    cd HybSuite/example_dataset
    

    2. Configure inputs

    The directory example_dataset contains two folders: Angiosperms353 and Arabidopsis100, respectively encompassing all inputs for running HybSuite pipeline for the corresponding two example datasets in our analyss.

    Example dataset 1: Angiosperms353

    Angiosperms353/
    ├── Input_list.txt
    ├── Target_file_Angiosperms353.fasta
    ├── Input_sequences/
        ├── Elaeagnus_pungens.fasta
        └── Hippophae_rhamnoides.fasta
    

    Input_list.txt

    This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):

    Elaeagnus_angustifolia	SRR12569928
    Elaeagnus_bambusetorum	SRR27547630
    Elaeagnus_henryi	SRR15533155
    Elaeagnus_macrophylla	SRR23618743
    Elaeagnus_mollis	SRR30566771
    Hippophae_neurocarpa	SRR17549374
    Hippophae_salicifolia	ERR7621632
    Hippophae_tibetana	SRR17549370
    Shepherdia_argentea	ERR7621633
    Barbeya_oleoides	SRR16214280	Outgroup
    Elaeagnus_oldhamii	A
    Elaeagnus_pungens	B
    Hippophae_rhamnoides	B
    
    • Identifiers prefixed with SRR or ERR: Public raw NGS data of the corresponding samples (the first row) ready to be downloaded in HybSuite pipeline.
    • Identifier A: User-provided raw NGS data of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline.
    • Identifier B: User-provided pre-assembled sequences of the corresponding samples (the first row) ready to be inputted to HybSuite pipeline.
    • Identifier Outgroup : Specifing the outgroup taxon.

    Input_sequences

    This directory should contain either user-provided raw reads, pre-assembled sequences, or both, according to the information provided in Input_list.txt.

    • type1: user-provided raw reads
      In our analysis, only the data of species Elaeagnus oldhamii belongs to user-provided raw reads, which needs to be downloaded here prior to running HybSuite pipeline. After downloading the raw data, transfer them to FASTQ.GZ format and move them to this directory. The two pair-ended files should be named as:
    Elaeagnus_oldhamii_1.fastq.gz
    Elaeagnus_oldhamii_2.fastq.gz
    
    • type2: pre-assembled sequences
      Two taxa with pre-assembled sequences are provided: Elaeagnus_pungens, and Hippophae_rhamnoides (corresponding to the taxon name along with the identifier B provided in the sample list file Sample_list.tsv. Their FASTA files are named as Elaeagnus_pungens.fasta and Hippophae_rhamnoides.fasta respectively. (<taxon>.fasta)

    Target_file_Angiosperms353.fasta

    This file is the target sequence file for Angiosperms353.
    The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:

    >Elaeagnus-pungens-4471
    AATGTCATCCAGGATAAATATCGGTTGGAAGCTGCAAATACTGACTGGATGAACAAGTAC
    AAAGGCTCTAGTAAGCTTCTATTGCATCCAAGGAACACTGAGGAGGTTTCACAGATACTC
    ...
    >Hippophae-rhamnoides-4527
    GAAGAGAGGGTTGTAGTATTAGTGATTGGTGGAGGAGGAAGAGAACATGCTCTTTGCTAT
    GCAATGAATCGATCACCATCCTGCGATGCAGTCTTTTGTGCTCCTGGCAATGCTGGGATT
    ...
    >Hippophae-salicifolia-4691
    CAGAGACTGCCTCCATTGTCAACTGATCCCAACAGATGCGAGCGTGCATTTGTTGGAAAC
    ACGATAGGTCAAGCAAATGGTGTGTACGACAAGCCAATCGATCTCCGATTCTGTGATTAC
    ...
    

    Example dataset 2: Arabidopsis100

    Arabidopsis100/
    ├── Input_list.txt
    └── Target_file_Arabidopsis100.fasta
    

    Input_list.txt

    This file documents taxon names and their corresponding sequence sources (marked in the second row,seperated by tab key):

    Elaeagnus angustifolia	SRR26705271
    Elaeagnus bambusetorum	SRR26757993
    Elaeagnus henryi	SRR26705270
    Elaeagnus macrophylla	SRR26753865
    Elaeagnus mollis	SRR26758012
    Elaeagnus oldhamii	SRR26705501
    Elaeagnus pungens	SRR26705285
    Hippophae neurocarpa	SRR26705287
    Hippophae rhamnoides	SRR26756417
    Hippophae salicifolia	SRR26705274
    Hippophae tibetana	SRR26704952
    Shepherdia argentea	SRR26756705
    Barbeya_oleoides	SRR26756183	Outgroup
    

    Target_file_Arabidopsis_thaliana100.fasta

    This file is the target sequence file for Arabidopsis100.
    The gene name for a sequence should be placed immediately after the final hyphen (-) in the line:

    >Locus-1
    MAFRRVLTTVILFCYLLISSQSIEFKNSQKPHKIQGPIKTIVVVVMENRSFDHILGWLKSTRPEIDGLTGKESNPLNVSDPNSKKIFVSDDAVFVDMDPGHSFQAIREQIFGSNDTSGDPKMNGFAQQSESMEPGMAKNVMSGFKPEVLPVYTELANEFGVFDRWFASVPTSTQPNRFYVHSATSHGCSSNVKKDLVKGFPQKTIFDSLDENGLSFGIYYQNIPATFFFKSLRRLKHLVKFHSYALKFKLDAKLGKLPNYSVVEQRYFDIDLFPANDDHPSHDVAAGQRFVKEVYETLRSSPQWKEMALLITYDEHGGFYDHVPTPVKGVPNPDGIIGPDPFYFGFDRLGVRVPTFLISPWIEKGTVIHEPEGPTPHSQFEHSSIPATVKKLFNLKSHFLTKRDAWAGTFEKYFRIRDSPRQDCPEKLPEVKLSLRPWGAKEDSKLSEFQVELIQLASQLVGDHLLNSYPDIGKNMTVSEGNKYAEDAVQKFLEAGMAALEAGADENTIVTMRPSLTTRTSPSEGTNKYIGSY*
    >Locus-2
    MSDQQLETEINFWGETSEEDYFNLKGIIGSKSFFTSPRGLNLFTRSWLPSSSSPPRGLIFMVHGYGNDVSWTFQSTPIFLAQMGFACFALDIEGHGRSDGVRAYVPSVDLVVDDIISFFNSIKQNPKFQGLPRFLFGESMGGAICLLIQFADPLGFDGAVLVAPMCKISDKVRPKWPVDQFLIMISRFLPTWAIVPTEDLLEKSIKVEEKKPIAKRNPMRYNEKPRLGTVMELLRVTDYLGKKLKDVSIPFIIVHGSADAVTDPEVSRELYEHAKSKDKTLKIYDGMMHSMLFGEPDDNIEIVRKDIVSWLNDRCGGDKTKTQV*
    >Locus-3
    MSSRENPSGICKSIPKLISSFVDTFVDYSVSGIFLPQDPSSQNEILQTRFEKPERLVAIGDLHGDLEKSREAFKIAGLIDSSDRWTGGSTMVVQVGDVLDRGGEELKILYFLEKLKREAERAGGKILTMNGNHEIMNIEGDFRYVTKKGLEEFQIWADWYCLGNKMKTLCSGLDKPKDPYEGIPMSFPRMRADCFEGIRARIAALRPDGPIAKRFLTKNQTVAVVGDSVFVHGGLLAEHIEYGLERINEEVRGWINGFKGGRYAPAYCRGGNSVVWLRKFSEEMAHKCDCAALEHALSTIPGVKRMIMGHTIQDAGINGVCNDKAIRIDVGMSKGCADGLPEVLEIRRDSGVRIVTSNPLYKENLYSHVAPDSKTGLGLLVPVPKQVEVKA*
    

    3. Run the pipeline

    First of all, change your working directory to the downloaded example dataset file:

    cd <the path to the directory of "example_dataset">
    

    Next, create output directories (or specify an existing directory when running HybSuite):

    mkdir -p ./Angiosperms353/Output ./Arabidopsis100/Output
    

    After setting the right working directory, run the following commands for the two example datasets:

    Angiosperms353

    hybsuite full_pipeline \
    -input_list ./Angiosperms353/Input_list.txt \
    -input_data ./Angiosperms353/Input_sequences \
    -output_dir ./Angiosperms353/Output \
    -nt 5 \
    -process 5 \
    -t ./Angiosperms353/Target_file_Angiosperms353.fasta \
    -seqs_min_length 100 \
    -seqs_min_sample_coverage 0.1 \
    -PH 1234567 \
    -sp_tree 14
    

    Arabidopsis100

    hybsuite full_pipeline \
    -input_list ./Arabidopsis100/Input_list.txt \
    -output_dir ./Arabidopsis100/Output \
    -nt 5 \
    -process 5 \
    -t ./Arabidopsis100/Target_file_Arabidopsis_thaliana100.fasta \
    -seqs_min_length 100 \
    -seqs_min_sample_coverage 0.1 \
    -PH 1234567 \
    -sp_tree 14