isoSeqQC
Quality assessment and re-mapping of poorly mapped isoSeq read segments
Loading...
Searching...
No Matches
Overview

A C++17 library and software to assess the quality of iso-Seq read alignments.

Dependencies

The project requires a C++17 compiler and depends on htslib for handling BAM files. A cmake script is included in the repository for easy building and installation. It requires cmake version 3.21 or later and automatically downloads htslib. If one wants to optionally build tests, Catch2 is also required and is automatically downloaded using the cmake script.

Download and install

Clone the repository:

git clone https://github.com/tonymugen/isoSeqQC

Next, create a build directory

cd isoSeqQC
mkdir build

Finally, run cmake to build and install the software

cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
cmake --install .

Installation may require root privileges.

Use exoncoverage

exoncoverage is a command line tool that takes a BAM alignment and a GFF annotation file and reports exon coverage statistics for each BAM read record, taking into account secondary alignments. Running it without any command line flags prints usage information. The flags are

--input-bam bam_file_name (input BAM file name; required).
--input-gff gff_file_name (input GFF file name; required).
--out out_file_name (output file name; required).
--threads number_of_threads (maximal number of threads to use; defaults to maximal available).

and can be specified in any order.

The output is a tab-delimited text file with largely self-explanatory column names. Columns that require special explanation are:

best_alignment_start          -- start position of best alignment, incorporating secondary alignments.
best_alignment_end            -- the same for the end position of the alignment.
n_good_secondary_alignments   -- number of secondary alignments that are cover the same region as the primary and are on the same strand.
n_local_reversed_strand       -- number of secondary alignments that are on the opposite strand but still in vicinity of the primary.
alignment_quality_string      -- a string of alignment match fractions for each exon, by position on the reference regardless of transcipt arrangement or strand.
best_alignment_quality_string -- the same, but taking into account secondary alignments.

Use partialmaps

partialmaps is a command line tool that takes a BAM alignment file and a GFF annotation file, identifies poorly mapped regions, exports data on these alignment gaps, and optionally saves the unmapped sequences to a FATSQ file for subsequent realignment. Alignment gap identification relies on a change point detection algorithm that uses sliding windows. Sliding window size is a user-controlled parameter. Running paritalmaps without any command line flags prints usage information. The flags are

--input-bam bam_file_name (input BAM file name; required).
--input-gff gff_file_name (input GFF file name; required).
--out out_file_name (output file name; required).
--out-fastq out_fastq_file_name (output FASTQ file name; optional, no FASTQ file created if omitted).
--window-size sliding window size for poorly aligned region identification (defaults to 75).
--threads number_of_threads (maximal number of threads to use; defaults to maximal available).

and can be specified in any order.

The output is a tab-delimited text file with the following columns:

read_name      -- read name.
read_length    -- read length.
unmapped_start -- 0-base index of the unmapped region start in the read.
unmapped_end   -- 0-base index of the unmapped region end in the read.
window_size    -- sliding window size.

FASTQ sequence identifier fields are read names, with read portion base-0 start and stop positions added, separated by underscores.

addremaps

addremaps is a command line tool that takes the initial alignment and a file with re-mapped read portions and adds successful re-maps to the original BAM records as a secondary alignment. CIGAR strings are modified to mark re-mapped regions as unaligned. Primary alignments from records without re-maps are transferred to the output unchanged. The output file is sorted by default, but this can be turned off. The CLI flags are

--input-bam bam_file_name (input BAM file name; required).
--remapped-bam remapped_bam_file_name (BAM file with remapped read portions; required).
--out out_file_name (output BAM file name; required).
--remap-cutoff identity cutoff for read portion remapping (defaults to 0.99).
--unsorted-output if set (with no value), the output BAM file is unsorted (otherwise, sorted).

As with the other tools, running addremaps without flags prints the above list.

Tests and documentation

Unit tests can be optionally built, without installing the software on the system:

mkdir buildTest
cd buildTest
cmake -DCMAKE_BUILD_TYPE=Test ..
cmake --build .
./tests

Library API documentation is in the header files, and can be optionally built using Doxygen:

cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_DOCS=ON
cmake --build .

in the build directory. This requires that doxygen and pdflatex are installed in the execution path.

Funding

This project is funded in part by an NIH NIGMS MIRA R35 GM133376 grant to Rebekah L. Rogers.