|
isoSeqQC
Quality assessment and re-mapping of poorly mapped isoSeq read segments
|
A C++17 library and software to assess the quality of iso-Seq read alignments.
The project requires a C++17 compiler and depends on htslib for handling BAM files. A cmake script is included in the repository for easy building and installation. It requires cmake version 3.21 or later and automatically downloads htslib. If one wants to optionally build tests, Catch2 is also required and is automatically downloaded using the cmake script.
Clone the repository:
Next, create a build directory
Finally, run cmake to build and install the software
Installation may require root privileges.
exoncoverage is a command line tool that takes a BAM alignment and a GFF annotation file and reports exon coverage statistics for each BAM read record, taking into account secondary alignments. Running it without any command line flags prints usage information. The flags are
and can be specified in any order.
The output is a tab-delimited text file with largely self-explanatory column names. Columns that require special explanation are:
best_alignment_start -- start position of best alignment, incorporating secondary alignments. best_alignment_end -- the same for the end position of the alignment. n_good_secondary_alignments -- number of secondary alignments that are cover the same region as the primary and are on the same strand. n_local_reversed_strand -- number of secondary alignments that are on the opposite strand but still in vicinity of the primary. alignment_quality_string -- a string of alignment match fractions for each exon, by position on the reference regardless of transcipt arrangement or strand. best_alignment_quality_string -- the same, but taking into account secondary alignments.
partialmaps is a command line tool that takes a BAM alignment file and a GFF annotation file, identifies poorly mapped regions, exports data on these alignment gaps, and optionally saves the unmapped sequences to a FATSQ file for subsequent realignment. Alignment gap identification relies on a change point detection algorithm that uses sliding windows. Sliding window size is a user-controlled parameter. Running paritalmaps without any command line flags prints usage information. The flags are
and can be specified in any order.
The output is a tab-delimited text file with the following columns:
read_name -- read name. read_length -- read length. unmapped_start -- 0-base index of the unmapped region start in the read. unmapped_end -- 0-base index of the unmapped region end in the read. window_size -- sliding window size.
FASTQ sequence identifier fields are read names, with read portion base-0 start and stop positions added, separated by underscores.
addremaps is a command line tool that takes the initial alignment and a file with re-mapped read portions and adds successful re-maps to the original BAM records as a secondary alignment. CIGAR strings are modified to mark re-mapped regions as unaligned. Primary alignments from records without re-maps are transferred to the output unchanged. The output file is sorted by default, but this can be turned off. The CLI flags are
As with the other tools, running addremaps without flags prints the above list.
Unit tests can be optionally built, without installing the software on the system:
Library API documentation is in the header files, and can be optionally built using Doxygen:
in the build directory. This requires that doxygen and pdflatex are installed in the execution path.
This project is funded in part by an NIH NIGMS MIRA R35 GM133376 grant to Rebekah L. Rogers.