Data curation#

1. Introduction#

At this point, we have created three essential output files for eDNA metabarcoding research, including:

  1. asvs.fasta: a fasta file containing the list and sequences of our ASVs.

  2. count_table.txt: a tab-delimited text file containing sample IDs as columns, ASV IDs as rows, and read abundance in cells.

  3. blast_parsed.txt: a tab-delimited text file containing the MRCA, taxonomic lineage, and BLAST values of our ASVs.

Before exploring the data and conducting the statistical analysis, however, we need to perform some data curation steps to further clean the data. Within the eDNA metabarcoding research community, there is not yet a standardised manner of data curation. However, steps that are frequently included are (list is not in order of importance):

  1. Abundance filtering of specific detections to eliminate the risk of barcode hopping signals interfering with the analysis.

    1. Either involves removing singleton detections or applying an abundance threshold (usually requires positive control samples).

  2. Filtering of ASVs that have a positive detection in the negative control samples, by:

    1. Removal of all ASVs in negative controls.

    2. Applying an abundance threshold obtained from the read count in the negative (and positive) controls.

    3. A statistical test.

  3. Removal of low-confidence taxonomy assignments (if analysis is taxonomy-dependent).

  4. Discarding samples with low read count.

  5. Removal of PCR artefacts.

  6. Data transformations from raw read count to relative abundance, presence-absence, or other transformations.

2. Artefact removal#

As a first data curation step, we will be removing spurious artefacts using a novel algorithm developed by our team, i.e., tombRaider. Although stringent quality filtering parameters have been employed during the bioinformatic processing steps, artefacts still persist and are known to inflate diversity estimates. These artefacts might arise from the incoproration of PCR errors (taq, the most frequently used enzyme in PCR amplification does not exhibit proofreading capabilities) or the presence of pseudogenes. For a more in depth look into artefact and artefact removel, please consult the tombRaider manuscript.

tombRaider can identify and remove artefacts using a single line of code, based on taxonomic ID, sequence similarity, the presence of stop codons, and co-occurrence patterns. For a detailed look into the workflow, please have a look at the figure below.

_images/tombRaider_workflow.png

Fig. 38 : The tombRaider workflow to remove artefacts from metabarcoding data. Copyright: Jeunen et al., 2024#

Similar to CRABS, tombRaider is a CLI program written in Python. Hence, we will use the same approach as before, where we will copy-paste the code into a new text document in RStudio and execute the code in the Terminal window.

To execute tombRaider, we need to provide a list of criteria that we want to incorporate into the algorithm for deciding what is an artefact and what is not (--criteria), a count table input file (--frequency-input), a list of ASVs (--sequence-input), a file containing information on the taxonomic ID for each sequence (--taxonomy-input), the format of this file in case it is a BLAST file (--blast-format), a count table output file (--frequency-output), an output file name for the list of new ASVs (--sequence-output), a taxonomy output file name (--taxonomy-output), a log output file name (--log), the datatype to take into account (--occurrence-type), the ratio of the co-occurrence model (--occurrence-ratio), a parameter if the data needs sorting (--sort), and the sequence similarity threshold (--similarity).

So, let’s open the taxonomy_assignment.txt file and copy-paste the code below. For this tutorial, we will incorporate taxonomic ID and co-occurrence patterns into the algorithm’s decision tree to speed up the process. However, for your own data, I suggest to also include sequence similarity. One way to speed up the algorithm when including sequence similarity is to provide a multiple sequence alignment in nexus format. I suggest looking into the AlignSeqs() function of the DECIPHER R package (something we will cover when we will generate a phylogenetic tree in our dataset).

tombRaider --criteria 'taxId;coOccur' --frequency-input count_table.txt --sequence-input asvs.fasta --taxonomy-input blast_results.txt --blast-format '6 qaccver saccver staxid sscinames length pident mismatch qcovs evalue bitscore qstart qend sstart send gapopen' --frequency-output count_table_new.txt --sequence-output asvs_new.fasta --taxonomy-output blast_results_new.txt --log tombraider_log.txt --occurrence-type abundance --occurrence-ratio 'count;1' --sort 'total read count'

From tombRaider’s output, we can see that 5,585 ASVs or 67.18% of the data are identified as artefacts. Again, this number is likely inflated, as we did not take into account sequence similarity between ASVs, but we’ll continue for now. tombRaider will have merged the artefacts with the biologically correct sequence, hence, no data was lost during this step.

Coding in R

Similar to the bioinformatic pipeline, all code for this section will be executed in RStudio unless specified otherwise!

3. Contamination removal#

The next step of data curation is to investigate what is observed in our negative control samples and remove any potential contamination in our data.

As of yet, there is no consensus in the metabarcoding research community on how to properly remove contamination from a data set. In this workshop, I will make use of the Decontam R package, introduced in the Davis et al., 2018 manuscript. This package provides a statistical test to identify contaminants based on two widely reproduced patterns, including (i) contaminants appear at higher frequencies in low-concentration samples, and (ii) contaminants are often found in negative controls.

Warning

It should be noted that this approach of filtering contamination is more suitable when DNA concentrations, measured by fluorescent intensity, are available for all samples in the metadata file. However, in our case, we will use the prevalence method of the algorithm, rahter than the more restrictive and accurate frequency method.

Other options for contamination removal, besides the Decontam R package, are (i) the microDecon R package, (ii) applying a strict filtering threshold by removing any ASV present in negative control samples, and (iii) applying an abundance threshold, whereby you calcuate the highest row-wise relative abundance for a positive detection in the negative control samples and remove all positive detection sthat fall below this relative abundance threshold.

3.1 Read data into R#

To start, let’s start RStudio, open a new R script, save it as data_curation.R in the bootcamp directory, prepare the R environment, and read the data into R.

## set up the working environment
setwd(file.path(Sys.getenv("HOME"), "Desktop", "bootcamp"))
list.files()

## load necessary R packages
for (pkg in c("Biostrings", "tidyverse", "decontam", "scales", "ggplot2", "patchwork", 
              "lattice", "car", "agricolae", "ggExtra", "ampvis2", "DivE")) {
  if (!require(pkg, character.only = TRUE)) install.packages(pkg)
  library(pkg, character.only = TRUE)
  cat(pkg, "version:", as.character(packageVersion(pkg)), "\n")
}

For reading in the four files we will be working with (metadata.txt, count_table_new.txt, asvs_new.fasta, blast_parsed.txt), we can use the following code, with one exception (note the error received for the taxonomy file!).

## read data into R
metadata <- read.table(file.path("4.results", "metadata.txt"), header = TRUE, sep = '\t', row.names = 1, check.names = FALSE, comment.char = '')
count_table <- read.table(file.path("4.results", "count_table_new.txt"), header = TRUE, sep = '\t', row.names = 1, check.names = FALSE, comment.char = '')
sequences <- readDNAStringSet(file.path("4.results", 'asvs_new.fasta'))
taxonomy <- read.table(file.path("4.results", "blast_parsed.txt"), header = TRUE, sep = '\t', row.names = 1, check.names = FALSE, comment.char = '')

For the taxonomy file, we need to use a custom function. Normally, this shouldn’t be the case. However, there seems to be a parsing issues with the output received from ALEX. Lines for which ALEX cannot generate a taxonomic lineage seem to be incorrectly formatted, as you can see in the screenshot below. Also, the error message in the Console window tells us about the issue.

_images/alex_output_error.png

Fig. 39 : A screenshot of the parsing error in the ALEX output file#

When encountering such an issue, standard functions likely are not applicable and custom functions might need to be written.

Luckily, in this case, the custom function is quite “simple”. Namely, we will read the file line by line, rather than as an R dataframe. We then loop over the line and count the number of columns in the line. If the line has the correct number of columns (11 in our case), we can keep the line as is, otherwise, we will rework the line by adding NA in between the ASV name and the BLAST values. Next, we transform the lines as a dataframe and set the first column as row names and first row as column names.

## set function to read in the taxonomy file due to formatting error in ALEX
process_line <- function(line) {
  elements <- unlist(strsplit(line, "\t"))
  if (length(elements) == 11) {
    return(elements)
  }
  if (length(elements) != 11) {
    return(c(elements[1], rep(NA, 7), elements[3:5]))
  }
}

# read and format taxonomy file with ASV IDs as row names and metadata as column names
lines <- readLines(file.path("4.results", "blast_parsed.txt"))
processed_data <- lapply(lines, process_line)
taxonomy <- as.data.frame(do.call(rbind, processed_data), stringsAsFactors = FALSE)
colnames(taxonomy) <- taxonomy[1,]
taxonomy <- taxonomy[-1,]
rownames(taxonomy) <- taxonomy[,1]
taxonomy <- taxonomy[,-1]

3.2 Updating files#

Linked dataframes

It is important to note that the 4 files we just read into R should be seen as “linked together”, meaning that if we were to change one file, we need to keep in mind to alter the other documents as well. The reason for this is that sample and sequence names need to be matching between the documents. Otherwise errors will occur. So, for example, if we remove a sequence from the count table, as it is deemed a contaminant, we need to remove the sequence in the ASV sequence file and taxonomic ID from the taxonomy table as well to keep the sequence lists between these 3 files identical (the fourth file, the metadata file, does not contain information on sequences, so won’t need to be updated in this case).

On the note of the warning, we can see that most dimensions between the 4 dataframes are identical. The count_table has 2,729 obs. (rows as ASV IDs) and 45 variables (columns as sample IDs). sequences is a list of 2,729 elements, identical to the row count of count_table. metadata, on the other hand, has 45 obs. (rows as sample IDs), identical to the column count of count_table. taxonomy, however, has 8,315 obs (rows as ASV IDs)! Remember that this is the original number of ASVs we observed and we still need to remove the rows that were deemed artefacts by tombRaider. So, let’s do that now.

_images/taxonomy_raw.png

Fig. 40 : Linking files before updating taxonomy#

## update the taxonomy dataframe to remove ASVs deemed artefact by tombRaider
taxonomy <- taxonomy[rownames(count_table), ]

After updating taxonomy, you should now see that the dataframe only consists of 2,729 obs., identical to count_table.

_images/taxonomy_updated.png

Fig. 41 : Linking files after updating taxonomy#

3.3 Contamination investigation#

Before we remove the contaminant signals from our data, it can be useful to print them out, to determine the seriousness of the contamination in your data. For our tutorial dataset, we have 6 negative control samples, including 3 Field controls (FieldControl_112122, FieldControl_112222, and FieldControl_112422) and 3 Filter controls (FilterControl_112122, FilterControl_112222, and FilterControl_112422). It is possible to link negative control samples to specific samples in your dataset, based on when samples and negative controls were collected, and process them separately. However, in our workshop, we will take a global approach to keep things simple and treat the 6 negative controls as controls for all eDNA samples.

To investigate contaminant signals, it is easier to create a new dataframe that contains all the potential information we might need, including:

  1. Only ASVs that have a positive signal in the negative control samples to keep the new dataframe small;

  2. A column with the total read count across all 6 negative control samples;

  3. A column with the number of times the ASV had a positive detection in the control sample;

  4. A column containing the species ID from the taxonomy assignment;

  5. A column containing the actual ASV sequence, in case we want to investigate further and BLAST the sequence manually.

## create negative control dataframe
negative_controls <- count_table %>%
  select(contains("Control")) %>%
  transmute(read_count = rowSums(.), detections = rowSums(. > 0)) %>%
  filter(read_count != 0 & detections != 0) %>%
  mutate(matching_species_IDs = taxonomy[rownames(.), "matching species IDs"],
         sequence = as.character(sequences[rownames(.)]))

Once we have this new dataframe created, we can use the summarize() function in the tidyverse R package to gather some details.

## explore negative_controls
negative_controls %>%
  summarize(
    neg_asvs = nrow(negative_controls),
    asvs_pct = round(nrow(negative_controls) / nrow(count_table) * 100, 2),
    neg_reads = sum(read_count),
    read_pct = round(sum(read_count) / sum(count_table) * 100, 2),
    single_detections = sum(detections == 1),
    max_detections = sum(detections == 6))

Based on the output, we can see that we detected 36 ASVs in negative control samples, accounting for 1.32% of total ASVs. We also observed a total of 39,737 (1.1%) reads across all negative controls. Most detections (27 ASVs), however, were only observed once across all 6 negative control samples, while only 2 ASVs were observed in all 6 negative control samples.

3.4 The Decontam R package#

Once we have explored what is in our negative controls, we are ready to identify actual contaminant signals and remove them based on the Decontam R package.

The first step of the Decontam documentation is to inspect the library sizes for each sample. Luckily, we already have all the necessary information stored in our metadata dataframe, whereby metadata$nonchimera_reads contains the number of reads per sample and metadata$sampleType holds the information if a sample is a negative control or actual eDNA sample.

## inspect library sizes
# calculate medians for each group
medians <- metadata %>%
  group_by(sampleType) %>%
  summarise(median_reads = median(nonchimera_reads))

# plot data
ggplot(data = metadata, aes(x = reorder(rownames(metadata), nonchimera_reads), y = nonchimera_reads, color = sampleType)) +
  geom_point(size = 3, alpha = 0.8) +
  geom_hline(data = medians, aes(yintercept = median_reads, color = sampleType),
             linetype = "dashed", linewidth = 0.7, show.legend = FALSE) +
  geom_text(data = medians, aes(x = nrow(metadata) * 0.9, y = median_reads, 
                                label = paste("Median:", comma(median_reads)), color = sampleType),
            hjust = 0, vjust = -0.5, size = 3.5, show.legend = FALSE) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0.05, 0.15))) +
  scale_x_discrete(expand = expansion(add = c(0.5, 0.05 * nrow(metadata)))) +
  scale_color_manual(values = c("bottled water" = "#E41A1C", "seawater" = "#377EB8")) +
  labs(title = "Library Size Distribution by Sample Type",
       x = "Samples (ordered by library size)", y = "Total read count", color = "Sample Type") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 8),
    panel.grid.major.x = element_blank(), legend.position = "top",
    plot.title = element_text(hjust = 0.5, face = "bold")) +
  guides(color = guide_legend(override.aes = list(size = 5)))
ggsave(filename = file.path("4.results", "library_read_count.png"), plot = last_plot(), bg = "white", dpi = 300)
_images/library_size.png

Fig. 42 : Plot depicting read count for each sample in increasing order#

The next step is to create a vector that holds the negative control sample information as a logical variable, with TRUE for negative control samples. It is important that this vector is ordered in the same way as our count table. Rather than manually creating this vector, we can make use of the grepl() function in R.

## determine contaminants using Decontam
# create logical vector
is_control <- grepl("Control", names(count_table), ignore.case = TRUE)

Once we have this vector, we can run the isContaminant() function with method = 'prevalence' to identify contaminants according to the Decontam algorithm. We will set the threshold = 0.51, which will allow Decontam to identify contaminants as all ASVs that are more or equally prevalent in negative control samples than in our eDNA samples. The isContaminant() function will return a dataframe with several columns, with the most important being $p containing the probability for classifying contaminants and $contaminant which contains a logical variable, classifying contaminant ASVs as TRUE based on our threshold value.

# identify contaminants
contaminants <- isContaminant(t(count_table), method = 'prevalence', neg = is_control, threshold = 0.51)
table(contaminants$contaminant)

Our output shows that Decontam identified 15 ASVs as contaminants, out of the 36 ASVs with a positive identification in negative control samples.

Interestingly, there is currently a bug in the Decontam R package, whereby ASVs present in a single negative control, but absent in all eDNA samples are not identified as contaminants. To fix this issue, let’s build a new dataframe that contains information on (i) the number of detections in negative controls, (ii) the number of detections in our samples, and (iii) the logical variable in column contaminants$contaminant providing the information if Decontam identified an ASV as a contaminant or not. Next, we can set the ASVs with a prevalence of 1 in negative controls and 0 in eDNA samples as TRUE in the contaminant column. Finally, we can print out the results by using the table() function.

# generate summary table and fix bug in Decontam
detection_summary <- count_table %>%
  mutate(detections_in_controls = rowSums(select(., which(is_control)) > 0),
    detections_in_samples = rowSums(select(., which(!is_control)) > 0)) %>%
  select(detections_in_controls, detections_in_samples) %>%
  mutate(contaminant = contaminants[rownames(.), "contaminant"])
detection_summary <- detection_summary %>%
  mutate(contaminant = ifelse(detections_in_controls == 1 & detections_in_samples == 0, TRUE, contaminant))
table(detection_summary$contaminant)

Once we have fixed the bug in the Decontam output, we can now see that we have identified 26 out of 36 ASVs with a positive identification in negative control samples as contaminant signals. To visualise the process of the Decontam algorithm, we can make use of our newly-created detection_summary dataframe and plot the prevalence of ASV detection between negative controls and eDNA samples.

# plot prevalence and colour by logical contaminant variable
ggplot(data = detection_summary, aes(x = detections_in_controls, y = detections_in_samples, color = contaminant)) +
  geom_abline(intercept = 0, slope = sum(is_control == FALSE) / sum(is_control == TRUE), color = "gray40", linetype = "dashed", linewidth = 0.7) +
  geom_point(size = 3, alpha = 0.7, shape = 16) +
  scale_x_continuous(limits = c(0, 6), breaks = seq(0, 6, by = 1)) +
  scale_y_continuous(limits = c(0, 39), breaks = seq(0, 40, by = 5)) +
  scale_color_manual(values = c("TRUE" = "#E41A1C", "FALSE" = "#377EB8"), name = "Contaminant Status",
                     labels = c("TRUE" = "Contaminant", "FALSE" = "Not Contaminant")) +
  labs(title = "ASV Prevalence in Samples vs Negative Controls", subtitle = "Diagonal line indicates equal prevalence in both groups",
       x = "Prevalence in Negative Controls (n detections)", y = "Prevalence in True Samples (n detections)") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(linewidth = 0.2),
    legend.position = "top", plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5, color = "gray40"),
    axis.line = element_line(color = "black"), aspect.ratio = 1) +
  annotate("text", x = 4, y = 28, label = "Equal prevalence line", color = "gray40", size = 3.5, angle = 45)
ggsave(filename = file.path("4.results", "asv_prevalence.png"), plot = last_plot(), bg = "white", dpi = 300)
_images/asv_prevalence.png

Fig. 43 : The prevalence of detections of sequences in negative controls and true samples#

The final step is to subset the count_table, sequences, and taxonomy dataframes to exclude the contaminant signals from our data and explore the data lost by contamination filtering.

# remove contaminant signals and negative controls from dataframes
contaminant_asvs <- row.names(detection_summary[detection_summary$contaminant == TRUE, ])
count_table_clean <- count_table[!row.names(count_table) %in% contaminant_asvs, ]
count_table_clean <- count_table_clean[, !is_control]
taxonomy_clean <- taxonomy[!row.names(taxonomy) %in% contaminant_asvs, ]
sequences_clean <- sequences[row.names(count_table_clean)]
metadata_clean <- metadata[colnames(count_table_clean), ]

# explore the cleaned count table
sum(count_table) - sum(count_table_clean)
sum(count_table_clean) / sum(count_table) * 100
names(count_table_clean)[colSums(count_table_clean) == 0]
names(count_table_clean)[rowSums(count_table_clean) == 0]

The Decontam R package identified 26 ASVs as contaminant signals. After removing them from count_table, as well as the negative control samples, we lost 171,482 (4.7%) reads. No sample drop out was observed during contaminant filtering, i.e., no samples obtained 0 reads after removing contaminant signals.

4. Amplicon length filtering#

For amplicons with highly-conserved fragment sizes, such as the 313 bp fragment of the COI amplicon we are analysing in this dataset, we can plot the distribution of amplicons lengths and, potentially, filter out any signals that are outside the range of the expected fragment size. Please note that this is only possible for primer sets that amplify a consistent fragment size across their target taxonomic groups. For some metabarcoding primer sets, this might not be the case. If you are unsure about the amplicon size range for your primer set, you can investigate the amplicon length of the barcodes in your custom reference database using CRABS’ --amplicon-length-figure function, which provides a line graph of the amplicon sizes per taxonomic group.

To visualise the amplicon size distribution of our ASVs, we can extract the sequence length from the sequences object and add the length information to our taxonomy dataframe.

## amplicon length filtering
# extract amplicon lengths and add to taxonomy
taxonomy_clean$sequence_length <- setNames(width(sequences_clean), names(sequences_clean))[rownames(taxonomy_clean)]

Next, we can plot the amplicon size distribution using the geom_density() function in ggplot2, as well as print the values of the number of ASVs shorter than 311 bp or longer than 315 bp.

# plot the amplicon length distribution using a density plot
ggplot(taxonomy_clean, aes(x = sequence_length)) +
  geom_density(fill = "#377EB8", alpha = 0.3, color = "#377EB8", linewidth = 0.5) +
  geom_vline(aes(xintercept = mean(sequence_length)), color = "gray40", linetype = "dashed", linewidth = 0.5) +
  geom_vline(aes(xintercept = min(sequence_length)), color = "gray40", linetype = "dashed", linewidth = 0.5) +
  geom_vline(aes(xintercept = max(sequence_length)), color = "gray40", linetype = "dashed", linewidth = 0.5) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(linewidth = 0.2, color = "gray90"),
    plot.title = element_text(hjust = 0.5, face = "bold"), axis.line = element_line(color = "black")) +
  labs(title = "Distribution of Amplicon Sequence Lengths", x = "Sequence Length (bp)", y = "Density") +
  annotate("text", x = mean(taxonomy_clean$sequence_length), y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste("Mean:", round(mean(taxonomy_clean$sequence_length)), "bp"), color = "gray40") +
  annotate("text", x = min(taxonomy_clean$sequence_length), y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste("Min:", min(taxonomy_clean$sequence_length), "bp"), color = "gray40") +
  annotate("text", x = max(taxonomy_clean$sequence_length), y = Inf, vjust = 1.5, hjust = 1.1,
           label = paste("Max:", max(taxonomy_clean$sequence_length), "bp"), color = "gray40") +
  scale_x_continuous(breaks = pretty_breaks(n = 10)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
ggsave(filename = file.path("4.results", "amplicon_length_distribution.png"), plot = last_plot(), bg = "white", dpi = 300)
sum(taxonomy_clean$sequence_length < 311)
sum(taxonomy_clean$sequence_length > 315)
_images/amplicon_length_distribution.png

Fig. 44 : A density graph depicting the amplicon length distribution of our COI ASVs#

The output and graph show us that most amplicons are indeed of length 313 bp. However, we have 62 amplicons shorter than 311 bp, with the shortest amplicon length observed as 260 bp, and 50 amplicons longer than 315 bp, with the longest amplicon length observed as 376 bp. Let’s remove those short and long ASVs and update our files again before continuing with the next step.

# remove short and long ASVs, as well as update all dataframes
taxonomy_length <- taxonomy_clean %>%
  filter(sequence_length >= 311, sequence_length <= 315)
count_table_length <- count_table_clean[row.names(taxonomy_length), ]
names(count_table_length)[colSums(count_table_length) == 0]
names(count_table_length)[rowSums(count_table_length) == 0]
sequences_length <- sequences_clean[row.names(count_table_length)]
metadata_length <- metadata_clean
sum(count_table_clean) - sum(count_table_length)
sum(count_table_length) / sum(count_table_clean) * 100

Filtering the data based on amplicon length removed an additional 87,644 (2.5%) reads from the data, though no samples dropped out after this filtering step.

5. Abundance filtering#

5.1 Low abundant detections#

One frequently-included processing step for metabarcoding data is to remove low abundant detections, i.e., detections that are made up by a single or a few reads. This pre-processing step is conducted to minimise the risk of cross-contamination between samples or to elimitate the effect of tag jumps. Tag jumps are sequences that have been assigned to the wrong sample, i.e., contain the wrong tags or barcodes. For more information, I suggest reading this excellent paper by Schnell et al., 2015. Currently, there is no consensus within the metabarcoding research community on the thresholds for abundance filtering and, therefore, is mainly user-defined. Ideally, positive controls are included in the experiment, which can provide insights into the threshold at which background noise enters the dataset. Alternative approaches are to remove singleton detections or detections based on the total read abundance for each ASV independently. Before filtering low-abundant detections, let’s first plot the distribution of detections to investigate the detections present in our dataset.

## abundance filtering
# pivot data to long format and count frequencies
detection_abundance <- count_table_length %>%
  pivot_longer(everything()) %>%
  filter(value > 0) %>%
  count(value) %>%
  uncount(weights = n)

# plot using geom_density
ggplot(detection_abundance, aes(x = value)) +
  geom_density(fill = "#377EB8", alpha = 0.3, color = "#377EB8", linewidth = 0.5, adjust = 1) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(linewidth = 0.2, color = "gray90"),
    plot.title = element_text(hjust = 0.5, face = "bold"), axis.line = element_line(color = "black")) + 
  labs(title = "Density Distribution of Non-Zero Values", x = "Value", y = "Density") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  scale_x_log10(labels = trans_format("log10", math_format(10^.x)))
ggsave(filename = file.path("4.results", "detection_frequencies.png"), plot = last_plot(), bg = "white", dpi = 300)
sum(count_table_length == 1)
sum(count_table_length == 2)
_images/detection_frequencies.png

Fig. 45 : A density graph depicting frequency of detections based on read count#

The output shows us that the number of singleton detections in our data is quite low and only appears 11 times in our dataset. The highest number of times a detection was recorded was a read count of 3 with 1,158 detections, followed by a read count of 2 with 1,078 detections, and a read count of 4 with 1,034 detections.

For this workshop, we will remove singleton detections from the dataset.

# remove singleton detections
count_table_length[count_table_length < 2] <- 0
min(count_table_length[count_table_length != 0], na.rm = TRUE)
names(count_table_length)[colSums(count_table_length) == 0]
names(count_table_length)[rowSums(count_table_length) == 0]

5.2 Low abundant ASVs#

Besides low abundant detections, low abundant ASVs are also frequently removed from the data. For pipelines employing USEARCH/VSEARCH during denoising, ASVS with an abundance less than 8 are by default removed by the algorithm. Other applications, such as SWARM and dada2 do not have this abundance threshold built in by default. Hence, let’s investigate the abundance distribution of ASVs in our data.

# reformat data to fit geom_density
asv_abundance <- count_table_length %>%
  mutate(value = rowSums(.)) %>%
  select(value)

# plot using geom_density
ggplot(asv_abundance, aes(x = value)) + 
  geom_density(fill = "#377EB8", alpha = 0.3, color = "#377EB8", linewidth = 0.5, adjust = 1) +
  geom_vline(aes(xintercept = 8), color = "gray40", linetype = "dashed", linewidth = 0.5) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(linewidth = 0.2, color = "gray90"),
        plot.title = element_text(hjust = 0.5, face = "bold"), axis.line = element_line(color = "black")) + 
  labs(title = "Density Distribution of Non-Zero Values", x = "Value", y = "Density") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  scale_x_log10(labels = trans_format("log10", math_format(10^.x))) +
  annotate("text", x = 8, y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste("Traditional ASV cutoff\nat read abundance = 8"), color = "gray40")
sum(asv_abundance < 8)  
round(sum(asv_abundance < 8)  / nrow(asv_abundance) * 100, 2)
_images/asv_abundance.png

Fig. 46 : A density graph depicting ASV abundance in our COI dataset#

The output shows that 859 ASVs or 33.15% have an abundance lower than 8 reads. As it is a standard approach to remove these low-abundant signals, let’s do this as well for our tutorial data. Note that for your own data, you might want to explore the taxonomic ID of these signals more in depth before deciding to remove them.

# remove low abundant asvs, update the tables, and explore lost reads
count_table_high <- count_table_length[rowSums(count_table_length) > 7, ]
names(count_table_length)[colSums(count_table_length) == 0]
names(count_table_length)[rowSums(count_table_length) == 0]
metadata_high <- metadata_length
sequences_high <- sequences_length[row.names(count_table_high)]
taxonomy_high <- taxonomy_length[row.names(count_table_high), ]
sum(count_table_length) - sum(count_table_high)
sum(count_table_high) / sum(count_table_length) * 100

Removing the 859 low-abundant ASVs, only resulted in the loss of 3,398 (0.1%) reads, while no samples dropped out during this filtering step.

6. Rarefaction#

The second-to-last data curation step we will introduce during this workshop is how to deal with differences in sequencing depth between samples and assess if sufficient sequencing depth was achieved for each sample. With the realisation that a higher read count could lead to a larger number of ASVs being detected, there has been much debate in the scientific community on how to deal with this issue. Some studies argue that it is essential to “level the playing field” and use the same number of sequences across all samples. This random subsampling of the data is called rarefaction. See Cameron et al., 2021 as an example. Other studies, on the other hand, argue vehemently agains such an approach. For example, you can read McMurdie et al., 2014 for more on this topic. While this debate is still ongoing, I suggest to run a couple of statistical tests and visualisations that can help determine the need to rarefy your data. Afterwards, you can make an informed decision and clearly state why or why not you have decided to process your data in a certain way.

I suggest to run 4 tests to determine if rarefying your data might be necessary, including:

  1. determine if there is a big difference in read count between samples;

  2. determine if there is a positive correlation between sequencing depth and number of observed ASVs;

  3. draw so-called rarefaction curves to determine if sufficient sequencing coverage has been achieved; and

  4. calculate curvature indices to provide a statistical test regarding the observed patterns in rarefaction curves.

6.1 Read distribution#

As a first test, we can plot the read distribution for samples as a visualisation. For easy viewing, we can colour the bars and boxes based on the metadata file. As a statistical test, we can run a one-way ANOVA to determine significant differences in read count between sample locations. However, with a replication of n = 3, even the non-parametric alternative might not be highly accurate.

## rarefaction
# read distribution
# add column sums of count_table_high to metadata_table_high
metadata_high$total_read_count <- colSums(count_table_high)[rownames(metadata_high)]

# plot total read count in decreasing order with transformed and untransformed y-axes
sample_colors <- c("Akakina" = "#EAE0DA", "Asan" = "#A0C3D2", "Ashiken" = "#E7C8C0", "Edateku" = "#B5C7A4",
  "Gamo" = "#D4BBDD", "Heda" = "#F2D7A7", "Imaizaki" = "#A2D9D1", "Kurasaki" = "#DFA7A7", "Sakibaru" = "#C5D0E6", 
  "Sani" = "#D9C5A0", "Taen" = "#B7D9C4", "Tatsugo" = "#E3B7D1", "Yadon" = "#A8C6D9")
plot_log <- ggplot(metadata_high, aes(x = reorder(row.names(metadata_high), -total_read_count), y = total_read_count, fill = sampleLocation)) +
  geom_col(width = 0.8) +  
  scale_fill_manual(values = sample_colors, name = "Sample Location") +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)), expand = expansion(mult = c(0, 0.1))) +
  labs(y = "Total Reads (log10)", x = NULL) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_blank(), axis.line = element_line(color = "black"), legend.position = "none")
plot_raw <- ggplot(metadata_high, aes(x = reorder(row.names(metadata_high), -total_read_count), y = total_read_count, fill = sampleLocation)) +
  geom_col(width = 0.8) +
  scale_fill_manual(values = sample_colors, guide = "none") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.1))) +
  labs(x = "Sample Location", y = "Total Reads") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), axis.line = element_line(color = "black"))
combined_plot <- (plot_log / plot_raw) + plot_layout(guides = "collect") +
  plot_annotation(title = "Sample Sequencing Depth", theme = theme(plot.title = element_text(hjust = 0.5, face = "bold"))) +
  theme(legend.position = "bottom", legend.box.margin = margin(t = 10))
ggsave(filename = file.path("4.results", "read_distribution.png"), plot = combined_plot, bg = "white", dpi = 300)
_images/read_distribution.png

Fig. 47 : The read distribution for samples with and without y-axis log transformation#

Besides a bar graph, a boxplot might be easier to visualise differences between sample locations in read count.

# plot boxplots of read count per group
ggplot(metadata_high, aes(x = sampleLocation, y = total_read_count, fill = sampleLocation)) +
  geom_boxplot() +
  labs(x = "Location", y = "Total Read Count") +
  theme_classic() +
  scale_fill_manual(values = sample_colors) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.1))) +
  labs(x = "Sample Location", y = "Total Reads") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), 
        axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), 
        axis.line = element_line(color = "black"), legend.position = "none")
_images/read_distribution_box.png

Fig. 48 : The read distribution for samples via a boxplot visualisation#

Based on this figure, we can clearly see that samples from Tatsugo have fewer reads compared to the average, while samples from Asan and Imaizaki have a higher than average read count. Besides a figure, we can also print out the basic information on read count to get an even clearer view.

## print basic stats of read distribution across samples
cat("Maximum value:", max(metadata_high$total_read_count, na.rm = TRUE), "in sample:", rownames(metadata_high)[which.max(metadata_high$total_read_count)], "\n")
cat("Minimum value:", min(metadata_high$total_read_count, na.rm = TRUE), "in sample:", rownames(metadata_high)[which.min(metadata_high$total_read_count)], "\n")
cat("Mean value:", mean(metadata_high$total_read_count, na.rm = TRUE), "\n")
cat("Standard error:", sd(metadata_high$total_read_count, na.rm = TRUE) / sqrt(sum(!is.na(metadata_high$total_read_count))), "\n")

## print basic stats of read distribution between groups
metadata_high %>%
  group_by(sampleLocation) %>%
  summarise(mean_count = mean(total_read_count, na.rm = TRUE), se_count = sd(total_read_count, na.rm = TRUE) / sqrt(n()))

Finally, we can test if there is a significant difference in read count between locations. As mentioned before, ideally, the level of replication would be higher than 3, especially for statistical analyses. However, collecting 3 replicates is a standard approach for eDNA metabarcoding surveys.

# run one-way ANOVA
metadata_high$total_read_count[is.na(metadata_high$total_read_count)] <- 0
bartlett.test(total_read_count ~ sampleLocation, data = metadata_high)
leveneTest(total_read_count ~ sampleLocation, data = metadata_high)
histogram(~ total_read_count | sampleLocation, data = metadata_high, layout = c(5,3))
model <- lm(total_read_count ~ sampleLocation, data = metadata_high)
Anova(model, type="II")
Anova(model, Type="II", white.adjust=TRUE)
summary(model)
(HSD.test(model, "sampleLocation"))

Both our graph and the statistical tests show us a significant difference in read count between locations. The Tukey HSD multiple comparison identified Asan, Imaizaki, and Gamo to have a significant higher number of reads compared to Tatsugo.

6.2 Read count and ASV number correlation#

To visualise a potential positive correlation between total read count and ASV sequence richness, we can plot the total read count against the total number of observed ASV sequences within samples. To test the significance, we can run a spearman rank corralation or pearson’s correlation test. However, better yet is to fit a GLM model with a poisson log-link function. The reason for this is that a Poisson distribution is more appropriate for count data, which both of our variables are, while a log link function ensures predictions are non-negative.

# correlation
# add ASV count to metadata_high
metadata_high$total_observations <- colSums(count_table_high > 0)[rownames(metadata_high)]

# fit the GLM model
glm_model <- glm(total_observations ~ total_read_count, 
                 family = poisson(link = "log"), data = metadata_high)
summary(glm_model)

# make predictions
metadata_high$predicted <- predict(glm_model, type = "response")

# plot results
metadata_high$sampleLocation <- as.factor(metadata_high$sampleLocation)
p <- ggplot(metadata_high, aes(x = total_read_count, y = total_observations, color = sampleLocation)) +
  geom_point(size = 8, alpha = 0.7) +
  scale_color_manual(values = sample_colors) +
  geom_smooth(aes(y = predicted), method = "glm", method.args = list(family = "poisson"), 
              se = TRUE, linetype = 'dotdash', color = "black", linewidth = 0.3) +
  labs(color = "Location", x = "Total read count", 
       y = "Number of observed ASVs") +
  theme_bw() +
  theme(legend.position = "bottom") 
p
p1 <- ggMarginal(p, type = 'densigram')
p1
ggsave(filename = file.path("4.results", "read_asv_correlation.png"), plot = p1, bg = "white", dpi = 300)
_images/read_asv_correlation.png

Fig. 49 : The correlation between read count and number of detected ASVs#

From both the GLM and the figure, we can see that there is a significant positive correlation between total read count and number of detected ASVs.

From the plot, we can determine that this significant correlation is likely driven by the low read count samples of Tatsugo, as well as a single outlier sample in Gamo.

From the GLM output, we can determine a highly significant correlation (p < 0.001) between total read count and number of observed ASVs, suggesting a strong positive relationship between the two. More practically, we can infer that for every 1-unit increase in the total read count, the log-count of ASVs increases by 7.564 x 10^-6, or for every 10,000 reads, we expect ASV count to increase by ~1.0785. Furthermore, the large drop between null deviance and residual deviance suggests the predictor (total read count) improves the model. One potential issue with the GLM, however, is that the residual deviance is much larger than the degrees of freedom (1430.3 >> 37). This could mean that the data may be overdispersed and a negative binomial GLM might be more appropriate, as this version accounts for overdispersion, which Poisson regression does not. Another option could be to remove the outliers from the data and see if this fixes the issue.

optimising the GLM

Try re-running the GML with the outliers removed, as well as run a negative binomial GML, to see if this gives us better results.

6.3 Rarefaction curves#

The second-to-last check to see if we might want to rarefy our data, we can draw rarefaction curves and determine if they plateau. These graphs can also be used to determine if sufficient sequencing depth was achieved within our experiment. Such plots, however, need to be conducted on the unfiltered frequency table, as data processing has a major effect on the results of rarefaction curves and skews the data in favour of sufficient sequencing depth. Rarefaction curve plots are frequently included in the supplemental files of metabarcoding publications and referenced to in the manuscript. We can use the amp_rarecurve function in the ampvis2 R package to draw such plots.

# rarefaction plots
# read in the data
count_raw <- read.table(file.path("4.results", "count_table.txt"), header = TRUE, sep = '\t', check.names = FALSE, comment.char = '')
colnames(count_raw)[1] <- "V1"
taxonomy <- as.data.frame(do.call(rbind, processed_data), stringsAsFactors = FALSE)
colnames(taxonomy) <- taxonomy[1,]
taxonomy <- taxonomy[-1,]
colnames(taxonomy)[1] <- "V1"
meta_raw <- read.table(file.path("4.results", "metadata.txt"), header = TRUE, sep = '\t', check.names = FALSE, comment.char = '')

# prepare data for analysis and plotting
count_raw <- merge(count_raw, taxonomy[, c("V1", "matching species IDs")], by = "V1", all.x = TRUE)
colnames(count_raw)[1] <- "ASV"
colnames(count_raw)[ncol(count_raw)] <- "Species"
meta_raw$sampleLocation <- factor(meta_raw$sampleLocation)

# load dataframes into ampvis2 format
ampvis_df <- amp_load(count_raw, meta_raw)

# generate rarefaction curves
rarefaction_curves <- amp_rarecurve(ampvis_df, stepsize = 100, color_by = "sampleLocation") +
  ylab("Number of observed ASVs") +
  theme_classic() +
  scale_color_manual(values = sample_colors) +
  theme(legend.position = "bottom") +
  facet_wrap(~ sampleLocation)
rarefaction_curves
ggsave(filename = file.path("4.results", "rarefaction_curves.png"), plot = rarefaction_curves, bg = "white", dpi = 300)
_images/rarefaction_curves.png

Fig. 50 : Rarefaction curves drawn with the amp_rarecurve() function in the ampvis2 R package.#

From the plot, we can see that most rarefaction curves plateau, indicating sufficient sequencing depth is achieved for this experiment and that additional sequencing might not result in a significantly higher number of ASVs detected.

6.4 Curvature indices#

The final check I recommend running is to help interpret the rarefaction curves. Based on the dimensions of the rarefaction plot used, rarefaction curves might appear as if they flatten while in theory they do not. Calculating curvature indices, i.e., the area under the curve divided by the total plot area, will help mitigate this issue and provide a value for which we can place a threshold to state if a rarefaction curve has flattened or not. We can use the DivSubsamples() and Curvature() functions in the DivE R package to calculate the curvature indices. For more information, I recommend reading the associated paper by Laydon et al., 2014.

# curvature indices
# read in the data
count_raw <- read.table(file.path("4.results", "count_table.txt"), header = TRUE, sep = '\t', check.names = FALSE, comment.char = '')

# create an empty dataframe to place the curvature indices in
curvature_df <- data.frame("sampleID" = numeric(0), "curvature_index" = numeric(0))

# iterate over columns and create new data frames
for (i in 2:ncol(count_raw)) {
  col1_name <- names(count_raw)[1]
  col2_name <- names(count_raw)[i]
  new_df <- data.frame(count_raw[, 1], count_raw[, i])
  names(new_df) <- c(col1_name, col2_name)
  ## function to generate the rarefaction data from a given sample
  dss <- DivSubsamples(new_df, nrf=100, minrarefac=1, NResamples=10)
  ## calculate curvature index
  curvature_value <- Curvature(dss)
  # add info to empty dataframe
  new_row <- data.frame("sampleID" = col2_name, "curvatureIndex" = curvature_value)
  cat("item:", i-1, "/", ncol(count_table), "; sample:", col2_name, "; curvature index: ", curvature_value, "\n")
  curvature_df <- rbind(curvature_df, new_row)
}

Based on the output, we can see that all values are in the high 60% or larger, indicating that the rarefaction curves indeed plateau and sufficient sequencing was achieved.

We can conclude from the 4 tests that there is a significant difference in total reads between sampling locations, as well as a highly significant positive correlation between total read count and number of detected ASVs. The rarefaction curves, however, indicate sufficient sequencing was achieved and an increased total read count might not be associated with an increase in number of ASVs being detected. The accuracy of rarefaction curves, however, has been questioned, especially for metabarcoding data. For our tutorial data, there could be a case made to either:

  1. rarefy your data;

  2. resequence the 4 outlier samples; or

  3. remove the outliers from the analysis.

If you decided to move ahead without data curation or new data generation, you would need to state the significant correlation, as well as posit if the results observed during the statistical analysis might have been driven by this significant difference in read count between location (e.g., if Tatsugo samples clustered separately in ordination plots when investigating β-diversity patterns).

Lab notes

For your own data, it would be useful to go back to your field and lab notes to determine if there is an explanation for the low read count for a particular sample, e.g., did certain samples fail to properly amplify, etc., as this will aid in the decision-making process for data curation.

For this tutorial and for simplicity, we will remove all samples collected at Tatsugo. Furthermore, we will remove all samples collected at Gamo, as this will keep the replication the same across all sites (compared to only removing the one outlier sample).

# discard Tatsugo and Gamo samples
metadata_final <- metadata_high[!metadata_high$sampleLocation %in% c('Tatsugo', 'Gamo'), ]
count_table_final <- count_table_high[, row.names(metadata_final)]
names(count_table_final)[colSums(count_table_final) == 0]
rownames(count_table_final)[rowSums(count_table_final) == 0]
count_table_final <- count_table_final[rowSums(count_table_final) != 0, ]
sequences_final <- sequences_high[row.names(count_table_final)]
taxonomy_final <- taxonomy_high[row.names(count_table_final), ]
sum(count_table_high) - sum(count_table_final)
sum(count_table_final) / sum(count_table_high) * 100

7. Summary#

At this point, we have conducted 5 data curation steps, including:

  1. The removal of spurious artefacts using tombRaider.

  2. The removal of contaminating signals.

  3. The removal of ASVs outside the expected amplicon length range.

  4. The removal of low-abundant ASVs.

  5. Identified sequencing coverage was sufficient per sample and determined if rarefying data might be necessary.

After these data curation steps, we now have a clean data set ready for statistical analysis!

All we have to do now is to export the updated files.

## export updated files
write.table(count_table_final, file.path("4.results", "count_table_filtered.txt"), append = FALSE, sep = '\t', dec = '.', row.names = TRUE, col.names = NA)
write.table(metadata_final, file.path("4.results", "metadata_filtered.txt"), append = FALSE, sep = '\t', dec = '.', row.names = TRUE, col.names = NA)
write.table(taxonomy_final, file.path("4.results", "taxonomy_filtered.txt"), append = FALSE, sep = '\t', dec = '.', row.names = TRUE, col.names = NA)
writeXStringSet(sequences_final, file = file.path("4.results", "asvs_filtered.fasta"), width = 350)