Title: | Quality Control of Sequencing Data |
---|---|
Description: | 'FASTQC' is the most widely used tool for evaluating the quality of high throughput sequencing data. It produces, for each sample, an html report and a compressed file containing the raw data. If you have hundreds of samples, you are not going to open up each 'HTML' page. You need some way of looking at these data in aggregate. 'fastqcr' Provides helper functions to easily parse, aggregate and analyze 'FastQC' reports for large numbers of samples. It provides a convenient solution for building a 'Multi-QC' report, as well as, a 'one-sample' report with result interpretations. |
Authors: | Alboukadel Kassambara [aut, cre] |
Maintainer: | Alboukadel Kassambara <[email protected]> |
License: | GPL-2 |
Version: | 0.1.3.999 |
Built: | 2024-11-22 04:52:16 UTC |
Source: | https://github.com/kassambara/fastqcr |
Run FastQC Tool
fastqc( fq.dir = getwd(), qc.dir = NULL, threads = 4, fastqc.path = "~/bin/FastQC/fastqc" )
fastqc( fq.dir = getwd(), qc.dir = NULL, threads = 4, fastqc.path = "~/bin/FastQC/fastqc" )
fq.dir |
path to the directory containing fastq files. Default is the current working directory. |
qc.dir |
path to the FastQC result directory. If NULL, a directory named fastqc_results is created in the current working directory. |
threads |
the number of threads to be used. Default is 4. |
fastqc.path |
path to fastqc program |
Create a directory containing the reports
## Not run: # Run FastQC: generates a QC directory fastqc(fq.dir) ## End(Not run)
## Not run: # Run FastQC: generates a QC directory fastqc(fq.dir) ## End(Not run)
Install the FastQC Tool. To be used only on Unix system.
fastqc_install(url, dest.dir = "~/bin")
fastqc_install(url, dest.dir = "~/bin")
url |
url to download the latest version. If missing, the function will try to install the latest version from https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc. |
dest.dir |
destination directory to install the tool. |
Aggregate multiple FastQC reports into a data frame.
qc_aggregate(qc.dir = ".", progressbar = TRUE) ## S3 method for class 'qc_aggregate' summary(object, ...) qc_stats(object)
qc_aggregate(qc.dir = ".", progressbar = TRUE) ## S3 method for class 'qc_aggregate' summary(object, ...) qc_stats(object)
qc.dir |
path to the FastQC result directory to scan. |
progressbar |
logical value. If TRUE, shows a progress bar. |
object |
an object of class qc_aggregate. |
... |
other arguments. |
qc_aggregate() returns an object of class qc_aggregate which is a (tibble) data frame with the following column names:
sample: sample names
module: fastqc modules
status: fastqc module status for each sample
tot.seq: total sequences (i.e.: the number of reads)
seq.length: sequence length
pct.gc: % of GC content
pct.dup: % of duplicate reads
summary: Generates a summary of qc_aggregate. Returns a data frame with the following columns:
module: fastqc modules
nb_samples: the number of samples tested
nb_pass, nb_fail, nb_warn: the number of samples that passed, failed and warned, respectively.
failed, warned: the name of samples that failed and warned, respectively.
qc_stats: returns a data frame containing general statistics of fastqc reports. columns are: sample, pct.dup, pct.gc, tot.seq and seq.length.
qc_aggregate()
: Aggregate FastQC Reports for Multiple Samples
qc_stats()
: Creates general statistics of fastqc reports.
# Demo QC dir qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.dir # List of files in the directory list.files(qc.dir) # Aggregate the report qc <- qc_aggregate(qc.dir, progressbar = FALSE) qc # Generates a summary of qc_aggregate summary(qc) # General statistics of fastqc reports. qc_stats(qc)
# Demo QC dir qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.dir # List of files in the directory list.files(qc.dir) # Aggregate the report qc <- qc_aggregate(qc.dir, progressbar = FALSE) qc # Generates a summary of qc_aggregate summary(qc) # General statistics of fastqc reports. qc_stats(qc)
Inspect problems in aggregated FastQC reports.
qc_fails(object, element = c("sample", "module"), compact = TRUE) qc_warns(object, element = c("sample", "module"), compact = TRUE) qc_problems( object, element = c("sample", "module"), name = NULL, status = c("FAIL", "WARN"), compact = TRUE )
qc_fails(object, element = c("sample", "module"), compact = TRUE) qc_warns(object, element = c("sample", "module"), compact = TRUE) qc_problems( object, element = c("sample", "module"), name = NULL, status = c("FAIL", "WARN"), compact = TRUE )
object |
an object of class qc_aggregate. |
element |
character vector specifying which element to check for inspecting problems. Allowed values are one of c("sample", "module"). Default is "sample".
|
compact |
logical value. If TRUE, returns a compact output format; otherwise, returns a stretched format. |
name |
character vector containing the names of modules and/or samples of interest. See qc_read for valid module names. If name specified, a stretched output format is returned by default unless you explicitly indicate compact = TRUE. |
status |
character vector specifying the module status. Allowed values includes one or the combination of c("FAIL", "WARN"). If status = "FAIL", only modules with failed status are returned. |
qc_problems(), qc_fails(), qc_warns(): returns a tibble (data frame) containing samples that had one or more modules with failure or warning. The format and the interpretation of the results depend on the argument 'element', which value is one of c("sample", "module").
If element = "sample" (default), results are samples with failed and/or warned modules. The results contain the following columns: sample (sample names), nb_problems (the number of modules with problems), module (the name of modules with problems).
If element = "module", results are modules that failed and/or warned in the most samples. The results contain the following columns: module (the name of module with problems), nb_problems (the number of samples with problems), sample (the name of samples with problems)
qc_fails()
: Displays which samples had one or more failed modules. Use
qc_fails(qc, "module") to see which modules failed in the most samples.
qc_warns()
: Displays which samples had one or more warned modules. Use
qc_warns(qc, "module") to see which modules warned in the most samples.
qc_problems()
: Union of qc_fails()
and qc_warns()
.
Display which samples or modules that failed or warned.
# Demo QC dir qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.dir # List of files in the directory list.files(qc.dir) # Aggregate the report qc <- qc_aggregate(qc.dir, progressbar = FALSE) # Display samples with failed modules qc_fails(qc) qc_fails(qc, compact = FALSE) # Display samples with warned modules qc_warns(qc) # Module failed in the most samples qc_fails(qc, "module") qc_fails(qc, "module", compact = FALSE) # Specify a module of interest qc_problems(qc, "module", name = "Per sequence GC content")
# Demo QC dir qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.dir # List of files in the directory list.files(qc.dir) # Aggregate the report qc <- qc_aggregate(qc.dir, progressbar = FALSE) # Display samples with failed modules qc_fails(qc) qc_fails(qc, compact = FALSE) # Display samples with warned modules qc_warns(qc) # Module failed in the most samples qc_fails(qc, "module") qc_fails(qc, "module", compact = FALSE) # Specify a module of interest qc_problems(qc, "module", name = "Per sequence GC content")
Plot FastQC data
qc_plot(qc, modules = "all") ## S3 method for class 'qctable' print(x, ...)
qc_plot(qc, modules = "all") ## S3 method for class 'qctable' print(x, ...)
qc |
An object of class qc_read or a path to the sample zipped fastqc result file. |
modules |
Character vector containing the names of fastqc modules for which you want to import the data. Default is all. Allowed values include one or the combination of:
Partial match of module names allowed. For example, you can use modules = "GC content", instead of the full names modules = "Per sequence GC content". |
x |
an object of class qctable. |
... |
other arguments. |
Returns a list of ggplots containing the plot for specified modules..
# Demo file qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc.file # Read all modules qc <- qc_read(qc.file) # Plot per sequence GC content qc_plot(qc, "Per sequence GC content") # Per base sequence quality qc_plot(qc, "Per base sequence quality") # Per sequence quality scores qc_plot(qc, "Per sequence quality scores") # Per base sequence content qc_plot(qc, "Per base sequence content") # Sequence duplication levels qc_plot(qc, "Sequence duplication levels")
# Demo file qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc.file # Read all modules qc <- qc_read(qc.file) # Plot per sequence GC content qc_plot(qc, "Per sequence GC content") # Per base sequence quality qc_plot(qc, "Per base sequence quality") # Per sequence quality scores qc_plot(qc, "Per sequence quality scores") # Per base sequence content qc_plot(qc, "Per base sequence content") # Sequence duplication levels qc_plot(qc, "Sequence duplication levels")
Plot FastQC data of multiple samples
qc_plot_collection(qc, modules = "all")
qc_plot_collection(qc, modules = "all")
qc |
An object of class qc_read_collection or a path to the sample zipped fastqc result files. |
modules |
Character vector containing the names of fastqc modules for which you want to import the data. Default is all. Allowed values include one or the combination of:
Partial match of module names allowed. For example, you can use modules = "GC content", instead of the full names modules = "Per sequence GC content". |
Returns a list of ggplots containing the plot for specified modules..
Mahmoud Ahmed, [email protected]
qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.files <- list.files(qc.dir, full.names = TRUE)[1:2] nb_samples <- length(qc.files) # read specific modules in all files # To read all modules, specify: modules = "all" modules <- c( "Per sequence GC content", "Per base sequence quality", "Per sequence quality scores" ) qc <- qc_read_collection(qc.files, sample_names = paste('S', 1:nb_samples, sep = ''), modules = modules) # Plot per sequence GC content qc_plot_collection(qc, "Per sequence GC content") # Per base sequence quality qc_plot_collection(qc, "Per base sequence quality") # Per sequence quality scores qc_plot_collection(qc, "Per sequence quality scores")
qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.files <- list.files(qc.dir, full.names = TRUE)[1:2] nb_samples <- length(qc.files) # read specific modules in all files # To read all modules, specify: modules = "all" modules <- c( "Per sequence GC content", "Per base sequence quality", "Per sequence quality scores" ) qc <- qc_read_collection(qc.files, sample_names = paste('S', 1:nb_samples, sep = ''), modules = modules) # Plot per sequence GC content qc_plot_collection(qc, "Per sequence GC content") # Per base sequence quality qc_plot_collection(qc, "Per base sequence quality") # Per sequence quality scores qc_plot_collection(qc, "Per sequence quality scores")
Read FastQC data into R.
qc_read(file, modules = "all", verbose = TRUE)
qc_read(file, modules = "all", verbose = TRUE)
file |
Path to the file to be imported. Can be the path to either :
|
modules |
Character vector containing the names of FastQC modules for which you want to import/inspect the data. Default is all. Allowed values include one or the combination of:
Partial match of module names allowed. For example, you can use modules = "GC content", instead of the full names modules = "Per sequence GC content". |
verbose |
logical value. If TRUE, print filename when reading. |
Returns a list of tibbles containing the data for specified modules.
# Demo file qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc.file # Read all modules qc_read(qc.file) # Read a specified module qc_read(qc.file,"Per base sequence quality")
# Demo file qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc.file # Read all modules qc_read(qc.file) # Read a specified module qc_read(qc.file,"Per base sequence quality")
A wrapper function around qc_read to read multiple FastQC data files at once.
qc_read_collection(files, sample_names, modules = "all", verbose = TRUE)
qc_read_collection(files, sample_names, modules = "all", verbose = TRUE)
files |
A |
sample_names |
A |
modules |
Character vector containing the names of FastQC modules for which you want to import/inspect the data. Default is all. Allowed values include one or the combination of:
Partial match of module names allowed. For example, you can use modules = "GC content", instead of the full names modules = "Per sequence GC content". |
verbose |
logical value. If TRUE, print filename when reading. |
A list
of tibbles
containing the data of specified modules form each file.
Mahmoud Ahmed, [email protected]
# extract paths to the demo files qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.files <- list.files(qc.dir, full.names = TRUE)[1:2] nb_samples <- length(qc.files) # read a specified module in all files # To read all modules, specify: modules = "all" qc <- qc_read_collection(qc.files, sample_names = paste('S', 1:nb_samples, sep = ''), modules = "Per base sequence quality")
# extract paths to the demo files qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.files <- list.files(qc.dir, full.names = TRUE)[1:2] nb_samples <- length(qc.files) # read a specified module in all files # To read all modules, specify: modules = "all" qc <- qc_read_collection(qc.files, sample_names = paste('S', 1:nb_samples, sep = ''), modules = "Per base sequence quality")
Create an HTML file containing FastQC reports of one or multiple files. Inputs can be either a directory containing multiple FastQC reports or a single sample FastQC report.
qc_report( qc.path, result.file, experiment = NULL, interpret = FALSE, template = NULL, preview = TRUE )
qc_report( qc.path, result.file, experiment = NULL, interpret = FALSE, template = NULL, preview = TRUE )
qc.path |
path to the FastQC reports. Allowed values include:
|
result.file |
path to the result file prefix (e.g., path/to/qc-result). Don't add the file extension. |
experiment |
text specifying a short description of the experiment. For example experiment = "RNA sequencing of colon cancer cell lines". |
interpret |
logical value. If TRUE, adds the interpretation of each module. |
template |
a character vector specifying the path to an Rmd template. file. |
preview |
logical value. If TRUE, shows a preview of the report. |
## Not run: # Demo QC Directory qc.path <- system.file("fastqc_results", package = "fastqcr") qc.path # List of files in the directory list.files(qc.path) # Multi QC report qc_report(qc.path, result.file = "~/Desktop/result") # QC Report of one sample with plot interpretation qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc_report(qc.file, result.file = "~/Desktop/result", interpret = TRUE) ## End(Not run)
## Not run: # Demo QC Directory qc.path <- system.file("fastqc_results", package = "fastqcr") qc.path # List of files in the directory list.files(qc.path) # Multi QC report qc_report(qc.path, result.file = "~/Desktop/result") # QC Report of one sample with plot interpretation qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc_report(qc.file, result.file = "~/Desktop/result", interpret = TRUE) ## End(Not run)
Unzip all files in the FastQC result directory. Default is the current working directory.
qc_unzip(qc.dir = ".", rm.zip = TRUE)
qc_unzip(qc.dir = ".", rm.zip = TRUE)
qc.dir |
Path to the FastQC result directory. |
rm.zip |
logical. If TRUE, remove zipped files after extraction. Default is TRUE. |
## Not run: qc_unzip("FASTQC") ## End(Not run)
## Not run: qc_unzip("FASTQC") ## End(Not run)