Setting up the environment#
1. Folder structure#
First, we will have a look at the folder structure we will use during the workshop. For any project, it is important to organise your various files into subfolders. Through the course you will be navigating between folders to run your analyses. It may seem confusing or tedious at first, especially as there are only a few files we will be dealing with during this workshop. However, remember that for many projects you can easily generate hundreds of files, so it is best to start with good practices from the beginning.
We will be using the following folder structure for this experiment:
└── sequenceData
├── 0-metadata
├── 1-scripts
├── 2-raw
├── 3-fastqc
├── 4-demux
├── 5-filter
├── 6-quality
├── 7-refdb
└── 8-final
To set up this folder structure, let us navigate to the starting directory of this workshop. We can do this using the cd
command, which stands for change directory.
Warning
The code below assumes you are in the home folder on the HKU supercomputer. Make sure to alter the path that specifies the location of the starting folder on your system if this is not your starting point.
cd ednaw01/
It is always a good idea to check if the code executed as expected. In this case, we can verify we are in the correct working directory by using the pwd
command, which stands for print working directory.
pwd
Output
/home/ednaw01/ednaw01
We can also list all the documents and subfolders in our working directory using the ls
command. Additionally, we will specify the -ltr
parameters to the ls
command. The l
parameter is providing a list with long format that includes the permissions, which will come in handy when we are creating our scripts (more on that later). The t
parameter is sorting the output by time and date, while the r
parameter is printing the list in reverse order, i.e, the newest files will be printed as the latest in the list.
ls -ltr
Output
-rw-r--r-- 1 ednaw01 others 12097 Oct 18 10:31 metadata-COI-selected-updated.txt
-rw-r--r-- 1 ednaw01 others 112435863 Oct 18 10:36 newDBcrabsCOIsintax.fasta
-rw-r--r-- 1 ednaw01 others 568074346 Oct 18 10:40 HKeDNAworkshop2023.zip
-rw-r--r-- 1 ednaw01 others 169 Oct 18 11:36 module_load
-rw-r--r-- 1 ednaw01 others 776 Oct 18 11:37 example-script.sh
The output of the ls
command provides us with a list of the documents in our starting folder, which are the documents we need to complete the tutorial. More on this in the next section.
Before we go more in detail about the starting files, let us first set up the abovementioned folder structure. Within the command line interface (CLI), we can use the mkdir
command to set up all the folders and subfolders in one go. Additionally, we will use the -p
parameter to automatically make any necessary parent directories that might not yet exist.
mkdir -p sequenceData/0-metadata sequenceData/1-scripts sequenceData/2-raw sequenceData/3-fastqc sequenceData/4-demux sequenceData/5-filter sequenceData/6-quality sequenceData/7-refdb sequenceData/8-final
When we now use the ls -ltr
command again, we see that the directory sequenceData has been added to the output list.
ls -ltr
Output
-rw-r--r-- 1 ednaw01 others 776 Oct 21 10:52 example-script.sh
-rw-r--r-- 1 ednaw01 others 568074346 Oct 21 10:53 HKeDNAworkshop2023.zip
-rw-r--r-- 1 ednaw01 others 12097 Oct 21 10:53 metadata-COI-selected-updated.txt
-rw-r--r-- 1 ednaw01 others 169 Oct 21 10:53 module_load
-rw-r--r-- 1 ednaw01 others 112435863 Oct 21 10:53 newDBcrabsCOIsintax.fasta
drwxr-xr-x 11 ednaw01 others 11 Oct 21 10:53 sequenceData
Note that the sequenceData
directory has appeared last in the printed list due to the r
parameter we passed to the ls
command, as it was the most recently created. When we’ll be going through our bioinformatic pipeline and create multiple files at various stages, the r
parameter will make it easy to see which files were recently created, as they will appear at the bottom and will avoid us having to scroll through the list of files.
Also note the d
at the beginning of the line of sequenceData
. These are the permissions that the l
parameter that we passed to the ls
command has given us access to. Several other letters are shown for sequenceData
and other files as well. Below is a list of what they mean:
d
: directory - indicate that this is in fact a folder or directory, rather than a file.r
: read - specifies that we have access to read the file.w
: write - specifies that we can write to the file.x
: execute - specifies that we have permission to execute the file.@
: extended - a novel symbol for MacOS indicating that the file has extended attributes (MacOS specific).
2. Starting files#
For the bioinformatic and statistical analysis, we need several starting files, including a zipped file containing all the sequencing data (HKeDNAworkshop2023.zip), a sample metadata file for the statistical analysis (metadata-COI-selected-updated.txt), our reference database for the taxonomy assignment (newDBcrabsCOIintax.fasta), a template script file where we will paste our code to run on the supercomputer (example-script.sh), and a script file that automatically loads all the necessary software on the supercomputer. The last two files are only necessary when working on the HKU supercomputer.
Starting files
HKeDNAworkshop2023.zip
metadata-COI-selected-updated.txt
newDBcrabsCOIsintax.fasta
module_load
With the folder structure set up, let’s move the starting files to their respective subfolders using the cp
command.
cp HKeDNAworkshop2023.zip sequenceData/2-raw
cp metadata-COI-selected-updated.txt sequenceData/0-metadata
cp newDBcrabsCOIsintax.fasta sequenceData/7-refdb
cp example-script.sh sequenceData/1-scripts
cp module_load sequenceData/1-scripts
Again, we can check if the command executed as expected by listing the files within the subfolder 2-raw using the ls -ltr
command. Note that we do not need to first use the cd
command to move to the directory for which we would like to list all of the files, but that we can specify which directory we would like to list by referring to it after the ls -ltr
command.
ls -ltr sequenceData/2-raw
Output
-rw-r--r-- 1 ednaw01 others 568074346 Oct 21 10:59 HKeDNAworkshop2023.zip
Exercise 1
Moving files around using the Terminal can also be accomplished through the mv
command. What is the difference between mv
and cp
? Why is it better to use cp
?
Answer 1
The mv
command will move the files from one folder to another. Hence, the files will be removed from the initial directory. If something goes wrong during the process or during the bioinformatic pipeline, such actions cannot be reverted back and data will be lost.
With the current setup, if anything will go wrong during our pipeline, we can simply remove everything within the sequenceData directory using the rm -r sequenceData*
command and start over, since we still have all the information and files within our initial starting folder.
Now that we have everything set up, we can get started with the bioinformatic processing of our data!