vash
Fast genetic similarity estimation with hash tables
Loading...
Searching...
No Matches
Overview

A C++14 library and software to efficiently summarize genetic polymorphism statistics from binary variant files. Linkage disequilibrium (LD) between pairs of loci is estimated using Locality-Sensitive Hashing (LSH), in particular a variant of the one-permutation hash Mai et al., 2020. These hashes can be used to efficiently group loci by similarity in hash tables. Jaccard similarity is used as the LD statistic because this is the measure approximated by LSH collision probability.

The gist of the method is that each locus (or individual, although the latter functionality is not yet available) data are converted to one-bit encoding by setting the major allele (as well as missing genotype values) to 0, minor allele to 1, and heterozygotes to 1 with 50% probability. The resulting binary data are hashed, the hashes banded (as described, e.g., in the Mining massive data sets book, Chapter 3.4), and loci assigned to buckets in an unordered hash table. Loci end up in the same bucket if at least one hash band is identical between them. LD is then estimated within buckets only. Thus, if high-LD pairs are rare in a data set, many fewer than \(N(N - 1)/2 \) pairwise LD values are estimated.

Dependencies

Building the library and binaries requires a C++14 compiler. The build process requires cmake version 3.11 or later, but everything can be compiled by hand if desired. Current implementation requires x86_64 processors and has only been tested on Linux. I plan to add 64-bit ARM (e.g., Apple M processor) support. The project uses my utilities repository. It is automatically downloaded and included in build process if using cmake.

Download and install

Clone the repository:

git clone https://github.com/tonymugen/vash

Next, create a build directory

cd vash
mkdir build

Finally, run cmake to build and install the software

cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
cmake --install .

Installation may require root privileges.

Building tests and documentation

Building and running unit tests is optional. To do so, after creating the build directory as above, run

cmake -DCMAKE_BUILD_TYPE=Test -DBUILD_TESTS=ON
cmake --build .
./tests

Tests require Catch2, which is installed as part of the build.

Building documentation is also optional, and can be done using

cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_DOCS=ON
cmake --build .

in the build directory. This requires that Doxygen and pdflatex are installed in the execution path.

Use ldblocks

lblocks is a command line tool that estimates LD among loci in a plink .bed file. Running it without command line flags prints the flags and their possible values. Most flags are self-explanatory, but setting values to some of them requires special consideration.

--hash-size       - bigger values result in better estimates of LD, but at the expense of speed.
    Based on some experimentation, minimal recommended value is 40, and 100 seems sufficient.
    Setting this flag to 0 leads to full (not hash-based) Jaccard estimates among all pairs.

--n-rows-per-band - is the banding parameter that controls sparsity.
    Setting it high (1/5 the hash size or more) leads to lower probability of
    inclusion of low to moderately similar pairs.
    Setting this flag 0 leads to all pairwise estates to be calculated.
    Beware, since this can result in huge output files.
    The software does its best to not use up RAM, but this has only been tested on Linux.

--only-groups     - no LD calculations are performed and only groups
    and the loci they contain are saved to a file.

--add-locus-names - locus names from the corresponding `.bim` file are outputinstead
    instead the default (base-1) locus indexes. This may result in larger output files.

Output files are tab-delimited and include locus pair indexes and Jaccard similarity estimates. The range of Jaccard estimates (0.0 to 1.0) is discretized into 256 bins, so not all floating point values are represented.

Running the software on whole genomes with millions of loci should not tax RAM (the software keeps track of free memory and only uses about half available RAM), but can still tax disk space. In addition, some pairs can be assigned to more than one group. Removal of these duplicates requires in-memory operations, so if partial results are written to disk some of these duplicates are retained. I recommend using the --only-groups flags in preliminary runs to get a sense of the number of locus pairs that will result given a set of parameters.

Library interface documentation can be found here.