Usage
Basic jist:
nextflow run ksumngs/v-met -profile <docker,podman,singularity> --platform <illumina,nanopore> --kraken2_db <kraken2_db> --blast_db <blast_db>
Input Preparation
Using a Directory as Input
If the –input parameter is not passed, then
v-met assumes reads files are located in the current directory. You can also
pass a directory that contains reads files to --input
. v-met will not look
in subdirectories of --input
. When running v-met on a directory, Reads
files must have one of one of the following extensions:
.fastq
.fq
.fastq.gz
.fq.gz
There must be only one file per sample for single-end and inteleaved paired-end
reads, and one pair of files (two files) for regular paired-end reads. Regular
paired-end files must have either the characters _1
and _2
or _R1
and _R2
in the filename to be identified as a pair. The sample name is taken
as the all the characters before the first underscore in the file name.
For example, a folder with the following files
.
├── pig-serum_S4_L001_R1_001.fastq.gz
├── pig-serum_S4_L001_R2_001.fastq.gz
├── pig-feces_S5_L001_R1_001.fastq.gz
├── pig-feces_S5_L001_R2_001.fastq.gz
├── mosquito1_S6_L001_R1_001.fastq.gz
├── mosquito1_S6_L001_R2_001.fastq.gz
├── mosquito2_S7_L001_R1_001.fastq.gz
└── mosquito2_S7_L001_R2_001.fastq.gz
will produce a sample list of
pig-serum
pig-feces
mosquito1
mosquito2
Paired-endedness is determined based on the –paired flag. By default --paired
is true
when –platform is illumina
and false
for nanopore
, but this
can be overridden. To use interleaved paired-end reads, use the
–interleaved flag.
This file naming system is very common for Illumina output, however for Oxford Nanopore reads, the default file structure breaks this in several ways, and it is often better to use a samplesheet.
Using a Samplesheet as Input
A samplesheet allows you to pass multiple reads files in as a single sample, which can be useful for e.g. Oxford Nanopore reads. When using a samplesheet, simply pass the path to the sheet to –input.
Samplesheets are tab-delimited files. A header is optional, and must be
delineated by beginning with the pound (#
) symbol. The first column contains
sample names, and each subsequent column contains a path to reads files. All
sample names will be cleaned of any shell metacharacters, and will be truncated
at the first underscore. Paths are resolved relative to the current directory.
Path names can contain wildcard characters, and will resolve as glob patterns
identical to how they would in Nextflow’s Channel.fromPath
constructor. For paired-end reads, forward reads should be specified in the
even-number columns (2,4,6
), and reverse reads should be specified in the
odd-number columns (3,5,7
).
Single-end example
#samplename |
||
---|---|---|
pig-serum |
|
|
pig-feces |
|
|
mosquito1 |
|
|
mosquito2 |
|
Paired-end example
#Sample |
Forward1 |
Reverse1 |
Forward2 |
Reverse2 |
---|---|---|---|---|
pig-serum |
|
|
|
|
pig-feces |
|
|
||
mosquito1 |
|
|
||
mosquito2 |
|
|
Once the samplesheet is constructed, pass it on the command line as:
nextflow run ksumngs/v-met --input /path/to/sheet.tsv ...
Profile Selection
Profiles allow for unique combinations of settings within a Nextflow pipeline.
For the purposes of v-met, they reconfigure the pipeline to run on a particular
container engine. Whichever engine you choose must be installed and available
(e.g. module load
) on each node that pipeline processes are running on. The
available options are
- (none)
Don’t use a container engine. Requires that every tool and the right version of every tool be installed on each machine the pipeline is running on. Don’t use this one.
- docker
Use Docker as the container engine. Note that Docker often requires root or nearly-root permissions that usually aren’t available on HPCs, and has a weird license that might forbid its use in commercial settings. Works well on local machines, though.
- podman
Use Podman instead of Docker. Podman is similar enough to Docker that they can often be used interchangably, but doesn’t require root permissions and has a free license. Some files might not be accessible via container on RHEL-based distros thanks to their particular SELinux implementation.
- singularity
Recommended
Use the Singularity container engine. This engine was built with HPC use in mind, and doesn’t require any special permissions to run. Singularity’s downfall is how it will expose your home directory to the container, resulting in rare, but difficult to explain bugs when files conflict. Every effort has been made to minimize the effects of Singularity’s file mounting in this pipeline.
- INSTITUTE
If your computing center is listed in nf-core configs then you can pass that name to
-profile
to have the container engine and resource limits configured automatically.- test
Download a test dataset from nf-core and run a sample to ensure your machine configuration is correct. Must be used with one of the container engine profiles.
- test_nanopore
Download a MinION test dataset and run just like
test
.- gh
Used to limit resource usage during continuous integration on GitHub Actions. You should never have to use these for real-life analysis.
To select a profile, you must pass the desired profile name to Nextflow’s
-profile
flag. Note that this is a Nextflow flag, and not a pipeline flag,
so it is a single dash (-profile
), not a double dash (--profile
).
Mandatory Parameters
See the page on parameters for the complete lowdown on parameters.
The pipeline is pretty much set up to run itself. So long as you have your input
reads formatted correctly, it doesn’t need much input from you. These are the
bare minimum parameters that you must provide on the command-line for the
pipeline to complete. Note that there has been mixed success with placing these
parameters in a nextflow.config
file, so keeping them on the command-line is
best.
- --kraken2_db
The path to a Kraken2 database. See –kraken2_db.
- --platform
Must be set to
illumina
ornanopore
, depending on the type of reads you are analyzing. See –platform.- --blast_db
The path to an NCBI nt BLAST database. See –blast_db.
- -profile
So, this isn’t really a parameter, but the container engine needs to be set using Nextflow’s
-profile
flag. See Profile Selection.
Setting up for HPC Job Schedulers
v-met comes preconfigured for local use only. Yes, that’s about as ridiculous as it sounds. What’s even more ridiculous is trying to make a configuration that can easily be adapted to multiple HPCs and job-scheduler frameworks. There is a compromise, however.
Process Labels
Rather than provide hard-coded configurations that will certainly break, there are several Nextflow ‘labels,’ that can be used for assigning processes to specific node queues or partitions. The labels are
- process_low
For processes with low resource usage that take a short time
- process_medium
For processes with moderate resource usage that take a moderate amount of time
- process_high
For processes with high resource usage that take a moderately high amount of time
- process_long
For processes that take a long time
- process_high_memory
For processes that use a lot of memory
- run_local
For processes that have to be run on the login node. This label was created specially for processes that download resources from the internet on HPCs where compute nodes do not have internet access
Using a custom nextflow.config
and these process labels, you can construct a
setup for your own HPC needs.
Example
As an example, here is a guide on how to set up a configuration for the USDA’s SCINet Ceres cluster, using the publically available info on their website.
First, we see that Ceres uses SLURM and Singularity. Excellent. Let’s set Nextflow up to use SLURM:
process {
executor = 'slurm'
}
Some SLURM systems require an account to be passed with every job submission. Let’s add ours just to be safe.
process {
executor = 'slurm'
clusterOptions = '--account=ksumngs'
}
For this example, I don’t think we’ll need to do anything special with the low, medium, and high processes, but let’s make sure that the long and high memory processes get submitted to partitions that can handle them.
process {
executor = 'slurm'
clusterOptions = '--account=ksumngs'
module = 'singularity'
withLabel: process_long {
queue = 'long'
}
withLabel: process_high_memory {
queue = 'mem'
}
}
Now, we can place this file in $HOME/.nextflow/nextflow.config
, and these
settings will be applied every time we run the pipeline.