Bioinformatic analysis#

1. Preparing data for processing#

1.1 Unzipping data#

The sequencing data for this project came as a single zipped file (HKeDNAworkshop2023.zip) which we moved to the sequenceData/2-raw subdirectory. To check if our sequence data is in the appropriate subfolder, let us use the ls -ltr command again.

ls -ltr sequenceData/2-raw

Warning

The sequencing data has the extension .zip. This is not the same as the extension .gz, which is a g-zipped file structure frequently provided by the sequencing service during file transfer. The following set of commands that deal with the .zip extension are specific for this data type, a result due to the alteration of files after subsetting from the original data. It is unlikely that these steps will be necessary for your project data if your data was received from a sequencing service.

Since the data comes zipped, we need to use the unzip command to get access to the sequencing files. We can use the d parameter to tell unzip where we would like to place the unzipped files. In our tutorial, we will place the files in a subdirectory of sequenceData/2-raw/ called sequenceData/2-raw/unzipped/. Note that we do not need to create this subdirectory first. The unzip command will generate this subdirectory automatically if it not yet exists.

unzip sequenceData/2-raw/HKeDNAworkshop2023.zip -d sequenceData/2-raw/unzipped/

Unzipping the HKeDNAworkshop2023.zip file has generated a bunch of files for all of the samples in our experiment, each represented with a forward (_1.fastq) and reverse (_2.fastq) fastq file.

1.1.1 Counting files (pipe |)#

Now that we are getting more familiar with the Terminal and bash commands, let’s introduce something that is called a pipe, which is represented by the | symbol. A pipe or placing the | in your terminal allows you to use the output from one command as the input of a second command. Hence, the terminology “pipe”, as | acts as a connection between two different commands. To show you the power of piping commands together, let’s look at the exercise of how you might count the number of samples you have in our data set.

STEP 1: To solve the issue of counting the number of samples, we can use a step-by-step approach. First, we can use a command you already know, the ls -1 command to list all the files in the directory. The -1 parameter will print the files into a single column. Let’s see the output of this command first before we continue.

ls -1 sequenceData/2-raw/unzipped/

STEP 2: Once we have a list, we can use a count function such as wc which stands for word count. However, we want to count lines, not words, so we need to add the -l parameter to specify this. When we pipe these two commands together, the Terminal window will output the number of files in our folder. Let’s try!

ls -1 sequenceData/2-raw/unzipped/ | wc -l

STEP 3: Remember that we wanted to know the number of samples for which we have data files and that each sample was comprised of one forward (_1.fastq) and one reverse (_2.fastq) file. So, for the last step, we need to divide the number of files by 2. The easiest way to accomplis this step is to pipe the number of files in our directory to an AWK command (a separate language in the Terminal) that takes the number and divides it by 2. For this last command, we specify that we would like to print the first item ($1), which in our case is the number of files, and divide it by two (/2).

ls -1 sequenceData/2-raw/unzipped | wc -l | awk '{print $1/2}'  

By using one line of code and two pipes, we have shown that we have 27 samples in our data set. In our tutorial data set, these 27 samples are comprised of 3 negative controls, and 2 size fractions * 3 replicates * 4 sites.

Tip

AWK, sed, and grep are very powerful computer languages to process text. Throughout this tutorial, we will be using snippets of code from these three computer languages. Unfortunately, the syntax of these languages is quite complex in my opinion. I recommend you to investigate these languages on your own time if bioinformatics is of interest to you. However, to keep this tutorial beginner friendly, we will not expand on the syntax further.

1.2 Fastq and Fasta file structure#

When unzipping the HKeDNAworkshop2023.zip file, we have revealed our sequence data files. Those sequence data files have the extension .fastq. Sequence data is most frequently represented as either .fastq or .fasta files. Both are simple text files that are structured in a particular way.

For example, within .fastq files each sequence record is represented by 4 lines. The first line contains the header information and the line starts with the symbol @. The second line contains the actual sequence. The third line can contain metadata, but usually is restricted to the + symbol. The fourth line provides the quality of the base call and should, therefore, have the same length as the second line containing the sequence. We can inspect the file structure of one of our sequence files we just unzipped using the head or tail command. Both commands print the first or last N number of lines in a document, respectively. By using the -n parameter, we can specify the number of lines to be printed.

head -n 4 sequenceData/2-raw/unzipped/COI_100_HK37_1.fastq

.fasta files, on the other hand, are simple text files structured in a slightly different way, whereby each sequence record is represented by 2 lines. The first line contains the header information and the line starts with the symbol >. The second line contains the actual sequence. The third and fourth line that are in the .fastq files are missing in the .fasta files, as this file structure does not incorporate the quality of a sequence. We can inspect the file structure of a .fasta file by using the head or tail command on the reference database.

tail -n 2 sequenceData/7-refdb/newDBcrabsCOIsintax.fasta

Exercise 2

Knowing the file structure of the .fastq and .fasta sequence files. How would you calculate the number of sequences incorporated in the files sequenceData/2-raw/unzipped/COI_100_HK37_1.fastq and sequenceData/7-refdb/leray_COI_sintax.fasta?

1.3 Changing file names#

When we inspect the file names of our .fastq files by using the ls command, we can see that they contain the _ symbols.

_images/filenamelist3.png

Fig. 4 : A list of the fastq file names#

While _ symbols will not cause downstream incompatability issues, some symbols in file names can cause problems downstream with certain programs. One frequently observed symbol that affects one of the programs we will use during our bioinformatic pipeline is the dash - symbol. While not pertinent for our tutorial file names, it is important to know how to remove or change such symbols in file names in the Terminal without having do to this manually. As an exercise, let’s replace all the underscores _ to dashes - and back to show you how the code would work if you need it in the future.

Luckily there is a simple perl one-liner to batch replace characters in file names using the rename command. We can tell rename that we want to substitute a pattern by providing the s parameter. The /_/-/ parameter tells rename the pattern we want to substitute, i.e., replace _ with -. Finally, we tell rename that we want to replace all instances using the g or global parameter. Without this parameter, rename would only substitute the first instance. The * at the end of the code tells the computer to go over all the files in the unzipped/ directory.

cd sequenceData/2-raw/
rename 's/_/-/g' unzipped/*
_images/filenamelist4.png

Fig. 5 : A list of fastq file names whereby the underscores are replaced by dashes#

rename 's/-/_/g' unzipped/*
cd ../../
_images/filenamelist5.png

Fig. 6 : A list of fastq file names whereby the dashes are replaced back by underscores#

1.4 Quality control of raw data#

Up to this point, we have unzipped our sequence data, checked the file structure, and batch renamed our files to exclude problematic symbols in the file names. Before processing the contents of each file, it is best to check the quality of the raw sequence data to ensure no problems were faced during sequencing. The programs we will be using to check the quality of all of our files are FASTQC and MULTIQC. First, we will generate reports for all of the files using FastQC. We can specify the output folder using the -o parameter and the number of threads or cores using the -t parameter.

Warning

Make sure to change the value of the -t parameter to the number of cores available on your system!

To run FASTQC on the supercomputer, we can modify the template script sequenceData/1-scripts/example-script.sh. We can open text documents with CLI text editors, such as nano.

nano sequenceData/1-scripts/example-script.sh

When opening the script, we can add the following line of code to execute the FASTQC command.

fastqc sequenceData/2-raw/unzipped/* -o sequenceData/3-fastqc/ -t 8

We can exit out of the editor by pressing ctrl + x and y. We will change the name of the script to fastqc_raw.sh and press y to save under a different name. After completing these steps, you should end back up in the normal Terminal window. To check if we created this new script, let’s use the ls -ltr command again.

ls -ltr sequenceData/1-scripts

To run the command on the supercomputer, we can use the sbatch command.

sbatch sequenceData/1-scripts/fastqc_raw.sh

Once this command is executed, it will be placed in the queue to be run on the cluster. We can use the squeue command to check if our code is running or if the job is finished. It is best to specify your jobs specifically using the -u or user parameter, followed by your user name.

squeue -u ednaw01

Warning

Make sure to change the -u parameter to your user name!

Once finished, FastQC will generate a .html report for every single file in the subdirectory sequenceData/2-raw/unzipped/. Let’s open one report to see what it contains. The .html reports are stored in the subdirectory sequenceData/3-fastqc/. From the FastQC .html reports, we are particularly interested in the summary tab, as well as the per base sequence quality and sequence length distribution figures.

Since we are working on the supercomputer, we need to download the .html report to our computer before we can open it. To do this, let’s open a new Terminal window that is not connected to the supercomputer, move to the Desktop directory using the cd command and download the file using the scp command, which stands for secure copy. While not essential in this case, you can add the -r parameter to allow the download of directories. We also need to specify the server location of the file to download and where we would like to download the file to on our system.

cd Desktop
scp -r ednaw01@hpc2021-io1.hku.hk:~/ednaw01/sequenceData/3-fastqc/COI_100_HK37_1_fastqc.html ./
_images/fastqcraw.png

Fig. 7 : A FastQC report of the raw sequencing data.#

The quality of file COI_100_HK37_1.fastq is looking very good and the number of sequences reported for this sample (160,732 reads) is a good sequencing depth as well!

Opening each report for the 54 files separately, however, is quite tedious and makes it difficult to compare differences between files. Luckily, multiQC was developed to collate these reports into a single .html report. We can use the . symbol in combination with the path to specify a minimal output to the Terminal window. The -o parameter let’s us set the output directory.

Again, to run this on the supercomputer, we need to modify the initial template script and add in the multiqc code.

nano sequenceData/1-scripts/example-script.sh
multiqc sequenceData/3-fastqc/. -o sequenceData/3-fastqc/

Press ctrl + x and y to save the file and change the name to multiqc_raw.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/multiqc_raw.sh
_images/multiqcterminal2.png

Fig. 8 : Terminal output for multiQC#

The multiQC program will combine all 54 FastQC reports into a single .html document. Let’s download and open this file to see how our raw sequence data is looking like.

scp -r ednaw01@hpc2021-io1.hku.hk:~/ednaw01/sequenceData/3-fastqc/multiqc_report.html ./
_images/multiqcreport2.png

Fig. 9 : The .html multiQC report#

The sequence quality, average read length, and number of sequences are all very comparable between the different files representing the samples. This is an excellent starting point! Interestingly, note the difference in length, read depth, and quality with the control samples. As those are our negative controls, these differences aren’t worrysome. If, on the other hand, some samples were looking like this, you might want to consider resequencing those samples to achieve a similar sequencing depth.

2. Merging forward and reverse reads#

2.1 A single sample example#

Once we have assessed the quality of the raw sequence data and did not observe any issues, we can move ahead with merging the forward and reverse reads of each sample. As a reminder, we have 54 .fastq files, indicating that we have 27 samples in our experiment, one forward and one reverse file for each sample.

For this experiment and particular library, we have amplified a ~313 bp fragment of the COI gene (excluding primer-binding regions). With the sequencing run specifications of a MiSeq 2x250 paired-end V2 sequencing kit, we have a partial overlap between the forward and reverse reads in the middle of the amplicon region. This overlap is essential for successful merging. Today, we will be using the --fastq_mergepairs command in VSEARCH to merge reads between the forward and reverse .fastq sequence files. We can specify the reverse sequence file through the --reverse parameter and the output file through the --fastqout parameter. To check the options available within the program, you can always read the documentation and pull up the help vignette by executing vsearch --help in the Terminal window. Before merging reads immediately on all samples, it is best to play around with the parameters on a single sample.

nano sequenceData/1-scripts/example-script.sh
vsearch --fastq_mergepairs sequenceData/2-raw/unzipped/COI_100_HK37_1.fastq --reverse sequenceData/2-raw/unzipped/COI_100_HK37_2.fastq --fastqout sequenceData/2-raw/COI_100_HK37merged.fastq

Press ctrl + x and y to save the file and change the name to merge_single.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/merge_single.sh

Using the default settings, we managed to merge 90.9% of reads. For this tutorial, we’ll take that as sufficient. However, for your own project, you might want to explore the various options within the fastq_mergepairs command further to try and increase the percentage of reads merged. From the vsearch --fastq_mergepairs output that was printed to the Terminal window, we can see that VSEARCH reports the number of observed pairs in the raw data files, the number of reads successfully merged, and the number of reads unable to be merged. Additionally, information on why reads couldn’t be merged and statistics about merged reads are provided.

2.2 Batch merging reads#

Since we need to merge sequences for 27 samples, we will introduce a new coding concept, the for loop. A for loop allows us to move through a list and execute a command on each item. While we will not go into too much detail for this tutorial, the basic syntax of for loops is as follows:

for file in list
do
  execute command
done

Basically, the first line tells us that the program will go through a list and for each item in the list, do something (line 2), which in this case is execute command (line 3). Once the program has executed the command in all items within the list, we need to close the loop by specifying done (line 4). So, with a for loop, we can tell the computer to merge reads for our 27 samples automatically, saving us the time to have to execute the code to merge reads 27 times by ourselves.

Within our for loop to merge reads for all samples, we will generate a list of all forward sequencing files (*_1.fastq). Before writing the code to merge reads in our for loop, we will first print which sample is being merged, as the VSEARCH output does not specify this (echo "Merging reads for: ${R1/_1.fastq/}"). To print to the Terminal, we can use the echo command. With the last parameter for the echo command (${R1/_1.fastq/}"), we specify that from our forward sequencing file name $R1, substitute _1.fastq with nothing. After the echo command, we execute the vsearch --fastq_mergepairs command as we did above, but we need to use the for loop syntax to tell the program what our forward, reverse, and output file names are. Remember that those names change when the program iterates over the list, so we cannot hard code them into the command.

rm sequenceData/2-raw/COI_100_HK37merged.fastq
nano sequenceData/1-scripts/example-script.sh
cd sequenceData/2-raw/unzipped/
for R1 in *_1.fastq
do

  echo "\n\nMerging reads for: ${R1/_1.fastq/}"
  vsearch --fastq_mergepairs ${R1} --reverse ${R1/_1.fastq/_2.fastq} --fastqout ../${R1/1.fastq/merged.fastq}

done

Press ctrl + x and y to save the file and change the name to merge_batch.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/merge_batch.sh

When executing this for loop, VSEARCH will attempt to merge all reads from the forward and reverse sequencing file for each sample separately. The output files containing the merged reads will be placed in the subdirectory sequenceData/2-raw/ and will have the extension _merged.fastq rather than _1.fastq or _2.fastq. Within the Terminal window, all the statistics are reported for each sample separately. A quick glance shows us that we managed to merge roughly 90% of reads for each sample. However, due to the number of samples, it is difficult to get an overview of the read merging success rate and compare different samples.

2.3 Summarizing the output#

To get a better overview, we can use the following python script to generate a bar plot with the raw and merged read statistics. The python code takes in two arguments, the first is the location where the merged sequence files are stored, the second is the location of the raw sequence files. Since this is python code, we cannot directly copy-paste the code into the Terminal as we did before. Rather, we have to create our first script. To do this, we will use a text editor that is implemented in the Terminal called nano. Running nano will open a new text window where we can copy-paste our python code. For this tutorial, it is not important or essential to understand python, though it is an extremely powerful and useful coding language for bioinforamticians!. To provide some context, I’ll give a brief explanation during the workshop when we’re executing this code.

nano sequenceData/1-scripts/rawMergedStatistics.py
#! /usr/bin/env python3

## import modules
import os
import sys
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## user arguments
mergedPath = sys.argv[1]
rawPath = sys.argv[2]

## first, create a sample name list
mergedFileList = os.listdir(mergedPath)
sampleNameList = []
for mergedFile in mergedFileList:
    sampleName = mergedFile.split('_merged.fastq')[0]
    if sampleName.startswith('COI_') or sampleName.startswith('NTC_'):
        sampleNameList.append(sampleName)

## count number of raw and merged sequences for each sample in sampleNameList
rawSeqCount = {}
mergedSeqCount = {}
for sample in sampleNameList:
    with open(f'{mergedPath}{sample}_merged.fastq', 'r') as mergedFile:
        x = len(mergedFile.readlines()) / 4
        mergedSeqCount[sample] = int(x)
    with open(f'{rawPath}{sample}_1.fastq', 'r') as rawFile:
        y = len(rawFile.readlines()) / 4
        rawSeqCount[sample] = int(y)

## create a dataframe from the dictionaries
df = pd.DataFrame({'Sample': list(rawSeqCount.keys()), 'Raw': list(rawSeqCount.values()), 'Merged': list(mergedSeqCount.values())})

## sort the dataframe by raw reads in descending order
df = df.sort_values(by='Raw', ascending=False)

## calculate the percentage of merged/raw and format it with 2 decimal places and the '%' symbol
df['Percentage'] = (df['Merged'] / df['Raw'] * 100).round(2).astype(str) + '%'

## create a horizontal bar plot using seaborn
plt.figure(figsize=(20, 8))  # Adjust the figure size as needed

## use seaborn's barplot
ax = sns.barplot(x='Raw', y='Sample', data=df, label='Raw', color='#BBC6C8')
sns.barplot(x='Merged', y='Sample', data=df, label='Merged', color='#469597')

## add labels and title
plt.xlabel('Number of sequences')
plt.ylabel('Samples')
plt.title('Horizontal bar graph of raw and merged reads (Sorted by Total in Reverse)')

## add a legend
plt.legend()

# Add raw read count next to the bars
for i, v in enumerate(df['Percentage']):
    ax.text(df['Raw'].values[i] + 50, i, v, va='center', fontsize=10, color='black')

## save the plot
plt.tight_layout()
plt.savefig('raw_and_merged_bargraph.png', dpi = 300)

Once we have copy-pasted the code, we can press ctrl +x to exit out of the editor, followed by y and return to save the file. After doing so, we’re back in the normal Terminal window. Before we can run or execute our first script, we need to make it executable.

Important

Remember the permissions for each file that we discussed before? By running the ls -ltr command, we can see that for the script we have just created, we only have read (r) and write (w) access, but no execution (x) permission. This is essential in this case, as we won’t be running the script using the sbatch command.

ls -ltr sequenceData/1-scripts/

To change the permissions or modifiers of a file, we can use the chmod command, which stands for change modifier. Since we want to make our file executable, we can specify the parameter +x.

chmod +x sequenceData/1-scripts/rawMergedStatistics.py

Rerunning the ls -ltr command shows that we have changed the permissions and that we can execute our script.

ls -ltr sequenceData/1-scripts/

To execute the script, we can use the ./ command followed by the python script and our two user parameters where our (parameter 1) merged and (parameter 2) raw files are located.

nano sequenceData/1-scripts/example-script.sh
./sequenceData/1-scripts/rawMergedStatistics.py sequenceData/2-raw/ sequenceData/2-raw/unzipped/

Press ctrl + x and y to save the file and change the name to check_merging.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/check_merging.sh
scp -r ednaw01@hpc2021-io1.hku.hk:~/ednaw01/raw_and_merged_bargraph.png ./
_images/raw_and_merged_bargraph.png

Fig. 10 : Raw and merged read count#

The figure we generated using the python script shows a horizontal bar graph with the merged reads in green and the raw sequence counts in grey. We can see that the percentage of merged reads is similar across all the samples, roughly 90%. We can also see that there is some abundance differences between the samples, which is to be expected, as pooling samples equimolarly is hard to do in the lab, something we covered during the first week of this eDNA workshop. Also, note the big difference between the samples and the negative controls, something we observed earlier in the MULTIQC report as well.

Amplicon length exceeding cycle number

To merge forward and reverse reads, it is essential to have at least a partial overlap in the amplicon region with the forward and reverse reads. Even better would be a full overlap, as it increases the base call quality score. Sometimes, however, the amplicon length is too long to achieve partial overlap with paired-end reads on an Illumina platform, since the Illumina sequencing technology is restricted by sequence length. While such a scenario should be avoided at all cost, if this was a situation you would find yourself in, you can concatenate paired-end reads using the following line of code, rather than merge them.

vsearch --fastq_join input_R1.fastq --reverse input_R2.fastq --fastqout output_joined.fastq

Keep in mind 3rd generation sequencing technologies, which we discussed during the first week of this workshop, if your experiment would benefit from using longer amplicon lengths than can be covered by Illumina sequencing technology.

3. Removing primer sequences#

3.1 A single sample example#

At this point, our merged reads still contain the primer sequences. Since these regions are artefacts from the PCR amplification and not biological, they will need to be removed from the reads before continuing with the bioinformatic pipeline. For this library and experiment, we have used the mlCOIintF/jgHCO2198 primer set (Leray et al., 2013). The forward primer corresponds to 5’-GGWACWGGWTGAACWGTWTAYCCYCC-3’ and the reverse primer sequence is 5’-TAIACYTCIGGRTGICCRAARAAYCA-3’. Before batch processing every sample, let’s test our code on a single sample again to start with. For primer or adapter removal, we can use the program cutadapt. To specify the primers to be removed, we can use the -a parameter. Since we’re removing both the forward and reverse primer, we can link them together using .... Remember to use the reverse complement of the reverse primer, as this would be the direction the reverse primer is found in our sequence data after merging. The minimum and maximum length of the amplicon can be specified with the -m and -M parameters, respectively. To only keep reads for which both primers were found and removed, we need to specify the --discard-untrimmed option. The --no-indels and -e 2 parameters allow us to tell the program to not include insertions and deletions in the search and allow a maximum of 2 errors in the primer sequence. We can specify the --revcomp parameter to search for the primer sequences in both directions. Finally, we can use the --cores=0 parameter to automatically detect the number of available cores.

nano sequenceData/1-scripts/example-script.sh
cutadapt sequenceData/2-raw/COI_100_HK37_merged.fastq -a GGWACWGGWTGAACWGTWTAYCCYCC...TGRTTYTTYGGHCAYCCHGARGTHTA -m 300 -M 330 --discard-untrimmed -o sequenceData/4-demux/COI_100_HK37_trimmed.fastq --no-indels -e 2 --revcomp --cores=0

Press ctrl + x and y to save the file and change the name to adapter_removal_single.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/adapter_removal_single.sh

The cutadapt output provides information on how many reads were analysed, how many reads were trimmed on the plus and minus strand, plus a detailed overview of where the adapters were cut in the sequence. For our sample sequenceData/2-raw/COI_100_HK37_merged.fastq, 146,040 reads were processed, 145,999 reads were found to contain both primers, and 4 reads were found where the primers were present as the reverse complement of the read. As these options that we specified in our command managed to remove the primer sequences from nearly all reads, we will use these settings in our for loop to batch process all samples.

3.2 batch trimming#

Similarly to the merging of reads, we need to process 27 samples. Hence, we will write another for loop to accomplish this goal, rather than executing the code to trim primer sequences 27 times by ourselves. Because the cutadapt output that is written to the Terminal window is very elaborate, we will write it to a file instead. In the next section (Summarizing cutadapt output), we’ll parse the output to produce some graphs in order to get a better overview of how cutadapt performed across the 27 samples. First, we will remove the file we just generated and move to the sequenceData/2-raw/ directory. As we’ve introduced the for loop syntax before and explained the cutadapt code snippet, I won’t be going into too much detail on each aspect.

rm sequenceData/4-demux/COI_100_HK37_trimmed.fastq
nano sequenceData/1-scripts/example-script.sh
cd sequenceData/2-raw/
for fq in *merged.fastq
do

  echo "trimming primer seqs for: ${fq/_merged.fastq/}"
  cutadapt ${fq} -a GGWACWGGWTGAACWGTWTAYCCYCC...TGRTTYTTYGGHCAYCCHGARGTHTA -m 300 -M 330 --discard-untrimmed -o ../4-demux/${fq/merged.fastq/trimmed.fastq} --no-indels -e 2 --revcomp --cores=0 >> ../0-metadata/cutadapt_primer_trimming.txt

done

Press ctrl + x and y to save the file and change the name to adapter_removal_batch.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/adapter_removal_batch.sh

When executing this for loop, cutadapt will attempt to remove the primer sequences from all reads in both directions (parameter --revcomp). The output files containing the reads where the primers were successfully removed are placed in the subdirectory sequenceData/4-demux/ and will have the extension _trimmed.fastq rather than _merged.fastq. As mentioned above, the cutadapt output was written to sequenceData/0-metadata/cutadapt_primer_trimming.txt, rather than the Terminal window. We can open this text file using a text editor, such as Sublime Text. However, due to the number of samples, it is difficult to get an overview of the success rate and compare the different samples, a similar issue we encountered when merging forward and reverse reads.

_images/cutadaptstatistics2.png

Fig. 11 : Output of the cutadapt code, providing information about adapter removal success rate.#

3.3 Summarizing cutadapt output#

To get a better overview of the read statistics after primer trimming and the success rate of cutadapt to locate and trim the primer-binding regions, we can alter the python code producing the bar graph to incorporate the sequence counts after primer trimming. This new version of the code takes in three arguments, the first is the location where the merged sequence files are stored, the second is the location of the raw sequence files, and the third is the location of the trimmed sequence files. Rather than trying to alter the code in the already-existing script sequenceData/1-scripts/rawMergedStatistics.py, we will generate a new script where we can simply cut and paste the code below. Generating, saving, and making the script executable follows the same process as discussed previously.

nano sequenceData/1-scripts/rawMergedTrimmedStatistics.py
#! /usr/bin/env python3

## import modules
import os
import sys
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## user arguments
mergedPath = sys.argv[1]
rawPath = sys.argv[2]
trimmedPath = sys.argv[3]

## first, create a sample name list
mergedFileList = os.listdir(mergedPath)
sampleNameList = []
for mergedFile in mergedFileList:
    sampleName = mergedFile.split('_merged.fastq')[0]
    if sampleName.startswith('COI_') or sampleName.startswith('NTC_'):
        sampleNameList.append(sampleName)

## count number of raw and merged sequences for each sample in sampleNameList
rawSeqCount = {}
mergedSeqCount = {}
trimmedSeqCount = {}
for sample in sampleNameList:
    with open(f'{mergedPath}{sample}_merged.fastq', 'r') as mergedFile:
        x = len(mergedFile.readlines()) / 4
        mergedSeqCount[sample] = int(x)
    with open(f'{rawPath}{sample}_1.fastq', 'r') as rawFile:
        y = len(rawFile.readlines()) / 4
        rawSeqCount[sample] = int(y)
    with open(f'{trimmedPath}{sample}_trimmed.fastq', 'r') as trimmedFile:
        z = len(trimmedFile.readlines()) / 4
        trimmedSeqCount[sample] = int(z)

## create a dataframe from the dictionaries
df = pd.DataFrame({'Sample': list(rawSeqCount.keys()), 'Raw': list(rawSeqCount.values()), 'Merged': list(mergedSeqCount.values()), 'Trimmed': list(trimmedSeqCount.values())})

## sort the dataframe by raw reads in descending order
df = df.sort_values(by='Raw', ascending=False)

## calculate the percentage of merged/raw and format it with 2 decimal places and the '%' symbol
df['Percentage'] = (df['Trimmed'] / df['Raw'] * 100).round(2).astype(str) + '%'

## create a horizontal bar plot using seaborn
plt.figure(figsize=(20, 8))  # Adjust the figure size as needed

## use seaborn's barplot
ax = sns.barplot(x='Raw', y='Sample', data=df, label='Raw', color='#BBC6C8')
sns.barplot(x='Merged', y='Sample', data=df, label='Merged', color='#469597')
sns.barplot(x='Trimmed', y='Sample', data=df, label='Trimmed', color='#DDBEAA')

## add labels and title
plt.xlabel('Number of sequences')
plt.ylabel('Samples')
plt.title('Horizontal bar graph of raw, merged, and trimmed reads (Sorted by Total in Reverse)')

## add a legend
plt.legend()

# Add raw read count next to the bars
for i, v in enumerate(df['Percentage']):
    ax.text(df['Raw'].values[i] + 50, i, v, va='center', fontsize=10, color='black')

## save the plot
plt.tight_layout()
plt.savefig('raw_and_merged_and_trimmed_bargraph.png', dpi = 300)

Press ctrl +x to exit out of the editor, followed by y and return.

chmod +x sequenceData/1-scripts/rawMergedTrimmedStatistics.py
nano sequenceData/1-scripts/example-script.sh
./sequenceData/1-scripts/rawMergedTrimmedStatistics.py sequenceData/2-raw/ sequenceData/2-raw/unzipped/ sequenceData/4-demux/

Press ctrl + x and y to save the file and change the name to check_adapter_trimming.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/check_adapter_trimming.sh
scp -r ednaw01@hpc2021-io1.hku.hk:~/ednaw01/raw_and_merged_and_trimmed_bargraph.png ./
_images/raw_and_merged_and_trimmed_bargraph.png

Fig. 12 : Read count of raw, merged, and trimmed files.#

The figure shows a similar success rate for cutadapt to locate and trim primers across all samples, with roughly 85% of raw sequences still included after merging and primer trimming. Interestingly, for the negative control samples (NTC_1, NTC_2, and NTC_3) hardly any reads are returned after primer trimming. To see what happened with the negative controls, let’s run this little python code below to print out the cutadapt summary from the file we generated during the for loop. The python code takes two user arguments, which are (1) the file name of the cutadapt summary file to parse and (2) a list of sample names to print the summary statistics for. This list should be separated by + symbols.

nano sequenceData/1-scripts/cutadaptParsing.py
#! /usr/bin/env python3

## import modules
import sys

## user arguments
cutadaptResults = sys.argv[1]
sampleNames = sys.argv[2]

## create sample list
sampleList = sampleNames.split('+')

## parse cutadapt file
printLine = 0
startPrint = 0
with open(cutadaptResults, 'r') as infile:
  for line in infile:
    if line.startswith('=== Adapter 1 ==='):
      startPrint = 0
      printLine = 0
    if startPrint == 1:
      print(line.rstrip('\n'))
    if printLine == 1:
      if line.startswith('=== Summary ==='):
        startPrint = 1
        print(line.rstrip('\n'))
    if line.startswith('Command line parameters:'):
      for sample in sampleList:
        if sample in line:
          print(f'{sample} summary statistics:')
          printLine = 1

Press ctrl +x to exit out of the editor, followed by y and return.

chmod +x sequenceData/1-scripts/cutadaptParsing.py
nano sequenceData/1-scripts/example-script.sh
./sequenceData/1-scripts/cutadaptParsing.py sequenceData/0-metadata/cutadapt_primer_trimming.txt NTC_1_merged.fastq+NTC_2_merged.fastq+NTC_3_merged.fastq

Press ctrl + x and y to save the file and change the name to check_neg_trimming.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/check_neg_trimming.sh

The cutadapt summary statistics for the negative control samples shows that a large proportion of the reads were discarded as they were too short (NTC_2: 98.1%; NTC_3: 97.0%). This probably meant that some primer-dimer sequences seeped through during the size selection step of the library preparation protocol and got sequenced. While primer-dimers should be removed during library preparation, as Illumina favours sequencing shorter fragments, such a low number as observed in this tutorial data set is a normal occurrence and nothing to be worried about.

4. Quality filtering#

During the bioinformatic pipeline, it is critical to only retain high-quality reads to reduce the abundance and impact of spurious sequences. There is an intrinsic error rate to all polymerases used during PCR amplification, as well as sequencing technologies. For example, the most frequently used polymerase during PCR is Taq, though lacks 3’ to 5’ exonuclease proofreading activity, resulting in relatively low replication fidelity. These errors will generate some portion of sequences that vary from their biological origin sequence. Such reads can substantially inflate metrics such as alpha diversity, especially in a denoising approach (more on this later). While it is near-impossible to remove all of these sequences bioinformatically, especially PCR errors, we will attempt to remove erroneous reads by filtering on base calling quality scores (the fourth line of a sequence record in a .fastq file).

Important

When processing your own metabarcoding data, I suggest you follow the same structure as we did with merging and trimming reads, i.e., check the code on one sample, play around with the parameter settings, and only batch process your samples when everything works fine. However, to save time during this tutorial, we will move straight to batch processing the samples for the quality filtering steps with the correct parameters.

4.1 Batch processing#

For quality filtering, we will be using the --fastq_filter command in VSEARCH. During this step, we will discard all sequences that do not adhere to a specific set of rules. Quality filtering parameters are not standardized, but rather specific for each library. For our tutorial data, we will filter out all sequences on a much stricter size range compared to the settings we used during primer trimming (--fastq_minlen 310 and --fastq_maxlen 316). Additionally, we will remove sequences that have ambiguous base calls, bases denotes as N, rather than A, C, G, or T (--fastq_maxns 0). Ambiguous base calls, however, are not a real issue with Illumina sequencing data. The last parameter we will use to filter our sequences is a maximum expected error rate (--fastq_maxee 1.0). As this is the last step in our bioinformatic pipeline where quality scores are essential, we will export our output files both in .fastq and .fasta format. Additionally, we will be merging all files together after this step in the bioinformatic pipeline. Before merging all files, however, we need to change the sequence headers to contain the information to which sample the sequence belongs to, which is currently stored in the file name. We can rename sequence header based on the file name using the --relabel parameter.

nano sequenceData/1-scripts/example-script.sh
cd sequenceData/4-demux/
for fq in *_trimmed.fastq
do

  echo "Merging reads for: ${fq/_trimmed.fastq/}"
  vsearch --fastq_filter ${fq} --fastq_maxee 1.0 --fastq_maxlen 316 --fastq_minlen 310 --fastq_maxns 0 --fastqout ../5-filter/${fq/_trimmed.fastq/_filtered.fastq} --fastaout ../5-filter/${fq/_trimmed.fastq/_filtered.fasta} --relabel ${fq/_trimmed.fastq/}.

done

Press ctrl + x and y to save the file and change the name to quality_filter.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/quality_filter.sh

Exercise 3

How would you check if the sequence headers of the .fastq and .fasta output files are relabeled and contain the sample information?

4.2 Summary output#

While VSEARCH provides the numbers for each sample before and after quality filtering, we will update the python script that generates the bar graph again, now to also include the filtered reads. This new python script takes in four user arguments, the same three as before, plus the path to where the filtered sequence files are stored. This will allow us to easily identify how the quality filtering performed and compare differences between samples.

nano sequenceData/1-scripts/rawMergedTrimmedFilteredStatistics.py
#! /usr/bin/env python3

## import modules
import os
import sys
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## user arguments
mergedPath = sys.argv[1]
rawPath = sys.argv[2]
trimmedPath = sys.argv[3]
filteredPath = sys.argv[4]


## first, create a sample name list
mergedFileList = os.listdir(mergedPath)
sampleNameList = []
for mergedFile in mergedFileList:
    sampleName = mergedFile.split('_merged.fastq')[0]
    if sampleName.startswith('COI_') or sampleName.startswith('NTC_'):
        sampleNameList.append(sampleName)

## count number of raw and merged sequences for each sample in sampleNameList
rawSeqCount = {}
mergedSeqCount = {}
trimmedSeqCount = {}
filteredSeqCount = {}
for sample in sampleNameList:
    with open(f'{mergedPath}{sample}_merged.fastq', 'r') as mergedFile:
        x = len(mergedFile.readlines()) / 4
        mergedSeqCount[sample] = int(x)
    with open(f'{rawPath}{sample}_1.fastq', 'r') as rawFile:
        y = len(rawFile.readlines()) / 4
        rawSeqCount[sample] = int(y)
    with open(f'{trimmedPath}{sample}_trimmed.fastq', 'r') as trimmedFile:
        z = len(trimmedFile.readlines()) / 4
        trimmedSeqCount[sample] = int(z)
    with open(f'{filteredPath}{sample}_filtered.fastq', 'r') as filteredFile:
        a = len(filteredFile.readlines()) / 4
        filteredSeqCount[sample] = int(a)


## create a dataframe from the dictionaries
df = pd.DataFrame({'Sample': list(rawSeqCount.keys()), 'Raw': list(rawSeqCount.values()), 'Merged': list(mergedSeqCount.values()), 'Trimmed': list(trimmedSeqCount.values()), 'Filtered': list(filteredSeqCount.values())})

## sort the dataframe by raw reads in descending order
df = df.sort_values(by='Raw', ascending=False)

## calculate the percentage of merged/raw and format it with 2 decimal places and the '%' symbol
df['Percentage'] = (df['Filtered'] / df['Raw'] * 100).round(2).astype(str) + '%'


## create a horizontal bar plot using seaborn
plt.figure(figsize=(20, 8))  # Adjust the figure size as needed

## use seaborn's barplot
ax = sns.barplot(x='Raw', y='Sample', data=df, label='Raw', color='#BBC6C8')
sns.barplot(x='Merged', y='Sample', data=df, label='Merged', color='#469597')
sns.barplot(x='Trimmed', y='Sample', data=df, label='Trimmed', color='#DDBEAA')
sns.barplot(x='Filtered', y='Sample', data=df, label='Filtered', color='#806491')

## add labels and title
plt.xlabel('Number of sequences')
plt.ylabel('Samples')
plt.title('Horizontal bar graph of raw, merged, trimmed, and filtered reads (Sorted by Total in Reverse)')

## add a legend
plt.legend()

# Add raw read count next to the bars
for i, v in enumerate(df['Percentage']):
    ax.text(df['Raw'].values[i] + 50, i, v, va='center', fontsize=10, color='black')

## save the plot
plt.tight_layout()
plt.savefig('raw_and_merged_and_trimmed_and_filtered_bargraph.png', dpi = 300)

Press ctrl +x to exit out of the editor, followed by y and return.

chmod +x sequenceData/1-scripts/rawMergedTrimmedFilteredStatistics.py
nano sequenceData/1-scripts/example-script.sh
./sequenceData/1-scripts/rawMergedTrimmedFilteredStatistics.py sequenceData/2-raw/ sequenceData/2-raw/unzipped/ sequenceData/4-demux/ sequenceData/5-filter/

Press ctrl + x and y to save the file and change the name to check_quality_filter.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/check_quality_filter.sh
scp -r ednaw01@hpc2021-io1.hku.hk:~/ednaw01/raw_and_merged_and_trimmed_and_filtered_bargraph.png ./
_images/raw_and_merged_and_trimmed_and_filtered_bargraph.png

Fig. 13 : Read count of raw, merged, trimmed, and filtered files.#

This bar graph shows that after quality filtering, we still keep around 80% of the raw reads, which is very good and due to the high quality data we started with (remember the multiQC output from before!). Again, we see a big difference between the actual samples and negative controls is observed.

4.3 Check filtering step#

Before continuing with the bioinformatic pipeline, it is best to check if the quality filtering step retained only high-quality sequences that are of length ~313 bp (estimated length of the amplicon). If quality filtering was successful, we can move on to the next steps in our bioinformatic pipeline. The check the quality and length of our filtered sequence files, we will use FASTQC and MULTIQC, same as before.

nano sequenceData/1-scripts/example-script.sh
fastqc sequenceData/5-filter/*.fastq -o sequenceData/3-fastqc/ -t 8

Press ctrl + x and y to save the file and change the name to fastqc_quality_filter.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/fastqc_quality_filter.sh

FastQC will generate a .html report for every single .fastq file in the subdirectory sequenceData/5-filter/. Since we are working with 27 files, it will be easier to compare the reports by collating them, something we can do using multiQC.

nano sequenceData/1-scripts/example-script.sh
multiqc sequenceData/3-fastqc/*filtered* -o sequenceData/3-fastqc/

Press ctrl + x and y to save the file and change the name to multiqc_quality_filter.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/multiqc_quality_filter.sh
scp -r ednaw01@hpc2021-io1.hku.hk:~/ednaw01/sequenceData/3-fastqc/multiqc_report.html ./

The multiQC program will combine all 27 FastQC reports into a single .html document. Let’s open this to see how our filtered sequence data is looking like. Since we have written both multiQC reports (raw and filtered) to the same sequenceData/3-fastqc/ directory, make sure to open the latest report. The file name will most likely end in _1.html.

_images/multiqcreportfiltered2.png

Fig. 14 : The .html multiQC report for the quality filtered reads#

From this multiQC report, we can see that we only retained high quality reads that are of length 313 bp. Exactly what we need to move forward with our bioinformatic pipeline!

5. Dereplication#

5.1 Combining data into one file#

Once we have verified that the reads passing quality filtering are of high quality and of a length similar to the expected amplicon size, we can move forward with our bioinformatic pipeline. For the next final steps of the bioinformatic pipeline, it makes more sense to combine all the files into a single file. So, let’s do this first using the cat command, which stands for concatenate. Since base calling quality scores are not essential for our analysis from this point onwards, we can combine all the .fasta files to reduce the file size and computational cost. To concatenate all files, we can use the * symbol. Make sure to specify the output file using >, otherwise cat will print the combined file to the Terminal window instead.

cat sequenceData/5-filter/*.fasta > sequenceData/6-quality/combined.fasta

Exercise 4

How would you determine if no sequences were lost during the cat command to merge all .fasta files?

5.2 Finding unique sequences#

With the cat command, we have created a single file. The next step in our bioinformatic pipeline is to dereplicate our data, i.e., find unique sequences. Since metabarcoding data is based on a PCR amplification method, the same DNA molecule will have been copied thousands or millions of times and, therefore, sequenced multiple times. In order to reduce the file size and computational cost, it is convenient to work with unique sequences and keep track of how many times each unique sequence was found in our data set. This dereplication step is also the reason why we combined all samples into a single file, as the same barcode (or species if you will) can be found across different samples.

We can dereplicate our data by using the --derep_fulllength command in VSEARCH. The --sizeout parameter keeps a tally of the number of times each unique sequence was observed across all samples and places this information in the sequence header. Since we are looking at unique sequences across samples, it wouldn’t make much sense to keep the current sequence header information, as it currently refers to which sample a sequence belongs to. We will, therefore, rename the sequence headers by using the --relabel parameter.

nano sequenceData/1-scripts/example-script.sh
vsearch --derep_fulllength sequenceData/6-quality/combined.fasta --sizeout --relabel uniq. --output sequenceData/6-quality/uniques.fasta

Press ctrl + x and y to save the file and change the name to dereplication.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/dereplication.sh

We can see from the VSEARCH output that 244,379 out of 2,230,993 sequences were unique. The average number a unique sequence was observed was 9.1 times, while the median is 1 and the maximum is 68,022 times. As mentioned before, the --sizeout parameter will have added the abundance information for each unique sequence to the header. Let’s use the head command to investigate the first couple of sequences.

head -n 10 sequenceData/6-quality/uniques.fasta

The output of the head command shows that the unique sequences are already sorted by abundance, with the most abundant unique sequence occurring 68,022 times and the second most abundant sequence occurring 59,045 times.

Exercise 5

How would you determine the number of times the 10 most and 10 least abundant unique sequences were observed?

6. Denoising#

Now that our data set is filtered and unique sequences have been retrieved, we are ready for the next step in the bioinformatic pipeline, i.e., looking for biologically meaningful or biologically correct sequences. Two approaches to achieve this goal exist, including denoising and clustering. There is still an ongoing debate on what the best approach is to obtain these biologicall meaningful sequences. For more information, these are two good papers to start with: Brandt et al., 2021 and Antich et al., 2021. For this tutorial we won’t be discussing this topic in much detail, but this is the basic idea…

When clustering the dataset, OTUs (Operational Taxonomic Units) will be generated by combining sequences that are similar to a set percentage level (traditionally 97%), with the most abundant sequence identified as the true sequence. When clustering at 97%, this means that if a sequence is more than 3% different than the generated OTU, a second OTU will be generated. The concept of OTU clustering was introduced in the 1960s and has been debated since. Clustering the dataset is usually used for identifying species in metabarcoding data.

Denoising, on the other hand, attempts to identify all correct biological sequences through an algorithm. In short, denoising will cluster the data with a 100% threshold and tries to identify errors based on abundance differences. The retained sequences are called ZOTU (Zero-radius Operation Taxonomic Unit) or ASVs (Amplicon Sequence Variants). Denoising the dataset is usually used for identifying intraspecific variation in metabarcoding data. A schematic of both approaches can be found below.

_images/denoisingandclustering.png

Fig. 15 : A schematic representation of denoising and clustering. Copyright: Antich et al., 2021#

This difference in approach may seem small but has a very big impact on your final dataset!

When you denoise the dataset, it is expected that one species may have more than one ZOTU. This means that when the reference database is incomplete, or you plan to work with ZOTUs instead of taxonomic assignments, your diversity estimates will be highly inflated. When clustering the dataset, on the other hand, it is expected that an OTU may have more than one species assigned to it, meaning that you may lose some correct biological sequences that are present in your data by merging species with barcodes more similar than 97%. In other words, you will miss out on differentiating closely related species and intraspecific variation.

For this tutorial, we will use the denoising approach, as it is favoured in recent years. However, clustering is still a valid option. So, feel free to explore this approach for your own data set if you prefer.

For denoising, we will use the unoise3 algorithm as implemented in the --cluster_unoise command in VSEARCH. Since denoising is based on read abundance of the unique sequences, we can specify the --sizein parameter. The minimum abundance threshold for a true denoised read is defaulted to 8 reads, as specified by the unoise3 algorithm developer. However, more recent research by Bokulich et al., 2013, identified a minimum abundance threshold to be more appropriate. Hence, we will set the --minsize parameter to 0.001%, which in our case is ~22 reads. As we will be merging different reads with varying abundances, we need to recalculate the new count for each denoised read using the --sizeout parameter. Relabeling the sequence headers can be done through the --relabel parameter and the output file is specified using --centroids.

nano sequenceData/1-scripts/example-script.sh
vsearch --cluster_unoise sequenceData/6-quality/uniques.fasta --sizein --minsize 22 --sizeout --relabel denoised. --centroids sequenceData/6-quality/denoised.fasta

Press ctrl + x and y to save the file and change the name to denoising.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/denoising.sh

From the output, we can see that we have generated 1,372 denoised reads and discarded 239,520 unique sequences, as they did not achieve an abundance of at least 22 reads.

7. Chimera removal#

The second to last step in our bioinformatic pipeline is to remove chimeric sequences. Amplicon sequencing has the potential to generate chimeric reads, which can cause spurious inference of biological variation. Chimeric amplicons form when an incomplete DNA strand anneals to a different template and primes synthesis of a new template derived from two different biological sequences, or in other words chimeras are artefact sequences formed by two or more biological sequences incorrectly joined together. More information can be found on this website and a simple illustration can be found below.

_images/chimeras.jpg

Fig. 16 : A schematic representation of chimera formation. Copyright: Genome Research#

We will use the --uchime3_denovo algorithm as implemented in VSEARCH for removing chimeric sequences from our denoised reads. This method is also based on sequence abundance, hence, we need to provide the --sizein parameter. As the output (parameter --nonchimeras) file will contain our biologically relevant sequences, we will relabel our sequence headers using --relabel zotu..

nano sequenceData/1-scripts/example-script.sh
vsearch --uchime3_denovo sequenceData/6-quality/denoised.fasta --sizein --nonchimeras sequenceData/8-final/zotus.fasta --relabel zotu.

Press ctrl + x and y to save the file and change the name to chimeras.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/chimeras.sh

The output shows that 98% of our denoised reads were kept and 27 sequences were identified as chimeric. Taking into account abundance information, this would result into 0.1% of reads identified as chimeras.

The sequenceData/8-final/zotus.fasta file is the first output file we have created that we need for our statistical analysis!

Important

If you use the unoise3 algorithm as implemented in USEARCH, chimera removal is built into the denoising step and will not have to be conducted separately!

8. Frequency table#

Now that we have created our list of biologically relevant sequences or ZOTUs or “species”, we are ready to generate a frequency table, also known as a count table. This will be the last step in our bioinformatic pipeline. A frequency table is something you might have encountered before during some traditional ecological surveys you have conducted, whereby a table was created with site names as column headers, a species list as rows, and the number in each cell representing the number of individuals observed for a specific species at a specific site.

In metabarcoding studies, a frequency table is analogous, where it tells us how many times each sequence has appeared in each sample. It is the end-product of all the bioinformatic processing steps we have conducted today. Now that we have identified what we believe to be true biological sequences, we are going to generate our frequency table by matching the merged sequences to our ZOTU sequence list. Remember that the sequences within the sequenceData/6-quality/combined.fasta file have the information of the sample they belong to present in the sequence header, which is how the --usearch_global command in VSEARCH can generate the frequency table. The --db parameter allows us to set the ZOTU sequence list (sequenceData/8-final/zotus.fasta) as the database to search against, while we can specify the --strand parameter as plus, since all sequences are in the same direction after primer trimming. Finally, we need to incorporate a 97% identity threshold for this function through the --id parameter. This might seem counter-intuitive, since we employed a denoising approach. However, providing some leniency on which sequences can map to our ZOTU will allow us to incorporate a larger percentage of the data set. As some ZOTUs might be more similar to each other than 97%, the algorithm will sort out the best match and add the sequence to the correct ZOTU sequence. If you’d like to be more conservative, you can set this threshold to 99%, though this is not recommended by the authors.

nano sequenceData/1-scripts/example-script.sh
vsearch --usearch_global sequenceData/6-quality/combined.fasta --db sequenceData/8-final/zotus.fasta --strand plus --id 0.97 --otutabout sequenceData/8-final/zotutable.txt

Press ctrl + x and y to save the file and change the name to frequency_table.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/frequency_table.sh

The output printed to the Terminal window shows we managed to map 2,195,816 (98.42%) sequences to the ZOTU list, which were incorporated in our frequency table. The output file is a simple text file, which we can open in Excel. Below is a screenshot where I coloured the samples (column headers) blue and the sequence list (row names) yellow.

_images/freqtable.png

Fig. 17 : The frequency table output file in Excel, where the samples are coloured blue and the sequences are coloured yellow.#

8.1 Summarise sequence counts#

As a final step, let’s update the python script that produces a bar graph again, now including the final read count incorporated for each sample. This new python script takes in five user arguments, the same four as before, plus the frequency table file name. This final bar graph will allow us to get a great overview of what we have accomplished today.

nano sequenceData/1-scripts/rawMergedTrimmedFilteredFinalStatistics.py
#! /usr/bin/env python3

## import modules
import os
import sys
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## user arguments
mergedPath = sys.argv[1]
rawPath = sys.argv[2]
trimmedPath = sys.argv[3]
filteredPath = sys.argv[4]
freqTable = sys.argv[5]


## first, create a sample name list
mergedFileList = os.listdir(mergedPath)
sampleNameList = []
for mergedFile in mergedFileList:
    sampleName = mergedFile.split('_merged.fastq')[0]
    if sampleName.startswith('COI_') or sampleName.startswith('NTC_'):
        sampleNameList.append(sampleName)

## count number of raw and merged sequences for each sample in sampleNameList
rawSeqCount = {}
mergedSeqCount = {}
trimmedSeqCount = {}
filteredSeqCount = {}
for sample in sampleNameList:
    with open(f'{mergedPath}{sample}_merged.fastq', 'r') as mergedFile:
        x = len(mergedFile.readlines()) / 4
        mergedSeqCount[sample] = int(x)
    with open(f'{rawPath}{sample}_1.fastq', 'r') as rawFile:
        y = len(rawFile.readlines()) / 4
        rawSeqCount[sample] = int(y)
    with open(f'{trimmedPath}{sample}_trimmed.fastq', 'r') as trimmedFile:
        z = len(trimmedFile.readlines()) / 4
        trimmedSeqCount[sample] = int(z)
    with open(f'{filteredPath}{sample}_filtered.fastq', 'r') as filteredFile:
        a = len(filteredFile.readlines()) / 4
        filteredSeqCount[sample] = int(a)

freqTableDF = pd.read_csv(freqTable, delimiter = '\t', index_col = 0).transpose()
freqTableDF['Final'] = freqTableDF.sum(axis = 1)
columnsToDrop = freqTableDF.columns.difference(['Final'])
freqTableDF.drop(columns = columnsToDrop, inplace = True)

## create a dataframe from the dictionaries
df = pd.DataFrame({'Sample': list(rawSeqCount.keys()), 'Raw': list(rawSeqCount.values()), 'Merged': list(mergedSeqCount.values()), 'Trimmed': list(trimmedSeqCount.values()), 'Filtered': list(filteredSeqCount.values())})

## merge both data frames
merged_df = df.merge(freqTableDF, left_on = ['Sample'], right_index = True, how = 'inner')

## sort the dataframe by raw reads in descending order
merged_df = merged_df.sort_values(by='Raw', ascending=False)

## calculate the percentage of merged/raw and format it with 2 decimal places and the '%' symbol
merged_df['Percentage'] = (merged_df['Final'] / merged_df['Raw'] * 100).round(2).astype(str) + '%'


## create a horizontal bar plot using seaborn
plt.figure(figsize=(20, 8))  # Adjust the figure size as needed

## use seaborn's barplot
ax = sns.barplot(x='Raw', y='Sample', data=merged_df, label='Raw', color='#BBC6C8')
sns.barplot(x='Merged', y='Sample', data=merged_df, label='Merged', color='#469597')
sns.barplot(x='Trimmed', y='Sample', data=merged_df, label='Trimmed', color='#DDBEAA')
sns.barplot(x='Filtered', y='Sample', data=merged_df, label='Filtered', color='#806491')
sns.barplot(x='Final', y='Sample', data=merged_df, label='Final', color='#2F70AF')

## add labels and title
plt.xlabel('Number of sequences')
plt.ylabel('Samples')
plt.title('Horizontal bar graph of raw, merged, trimmed, and filtered reads (Sorted by Total in Reverse)')

## add a legend
plt.legend()

# Add raw read count next to the bars
for i, v in enumerate(merged_df['Percentage']):
    ax.text(merged_df['Raw'].values[i] + 50, i, v, va='center', fontsize=10, color='black')

## save the plot
plt.tight_layout()
plt.savefig('raw_and_merged_and_trimmed_and_filtered_and_final_bargraph.png', dpi = 300)

Press ctrl +x to exit out of the editor, followed by y and return.

chmod +x sequenceData/1-scripts/rawMergedTrimmedFilteredFinalStatistics.py
nano sequenceData/1-scripts/example-script.sh
./sequenceData/1-scripts/rawMergedTrimmedFilteredFinalStatistics.py sequenceData/2-raw/ sequenceData/2-raw/unzipped/ sequenceData/4-demux/ sequenceData/5-filter/ sequenceData/8-final/zotutable.txt

Press ctrl + x and y to save the file and change the name to check_table.sh. Press return and y to save under a different name.

sbatch sequenceData/1-scripts/check_table.sh
scp -r ednaw01@hpc2021-io1.hku.hk:~/ednaw01/raw_and_merged_and_trimmed_and_filtered_and_final_bargraph.png ./
_images/raw_and_merged_and_trimmed_and_filtered_and_final_bargraph.png

Fig. 18 : Read count of raw, merged, trimmed, filtered, and final frequency table for each sample.#

This bar graph shows that for the final frequency table, we have kept around 80% of the raw reads for each sample, which is very good and due to the high quality data we started with (remember the multiQC output from before!). Again, we see a big difference between the actual samples and negative controls is observed.

That’s it for today, see you tomorrow when we will assign a taxonomic ID to our ZOTU sequence list!