vash
Fast genetic similarity estimation with hash tables
Loading...
Searching...
No Matches
BayesicSpace::GenoTableHash Class Reference

Class to store compressed genotype tables. More...

#include <gvarHash.hpp>

Public Member Functions

 GenoTableHash ()
 Default constructor.
 
 GenoTableHash (const std::string &inputFileName, const IndividualAndSketchCounts &indivSketchCounts, const size_t &nThreads, std::string logFileName)
 Constructor with input file name and thread number.
 
 GenoTableHash (const std::string &inputFileName, const IndividualAndSketchCounts &indivSketchCounts, std::string logFileName)
 Constructor with input file name.
 
 GenoTableHash (const std::vector< int > &maCounts, const IndividualAndSketchCounts &indivSketchCounts, const size_t &nThreads, std::string logFileName)
 Constructor with count vector and thread number.
 
 GenoTableHash (const std::vector< int > &maCounts, const IndividualAndSketchCounts &indivSketchCounts, std::string logFileName)
 Constructor with count vector.
 
 GenoTableHash (const GenoTableHash &toCopy)=delete
 Copy constructor (deleted)
 
GenoTableHashoperator= (const GenoTableHash &toCopy)=delete
 Copy assignment operator (deleted)
 
 GenoTableHash (GenoTableHash &&toMove) noexcept=default
 Move constructor.
 
GenoTableHashoperator= (GenoTableHash &&toMove) noexcept=default
 Move assignment operator.
 
 ~GenoTableHash ()=default
 Destructor.
 
void allHashLD (const float &similarityCutOff, const InOutFileNames &bimAndLDnames, const size_t &suggestNchunks=static_cast< size_t >(1)) const
 All by all LD from hashes.
 
std::vector< HashGroupmakeLDgroups (const size_t &nRowsPerBand) const
 Assign groups by linkage disequilibrium (LD)
 
void makeLDgroups (const size_t &nRowsPerBand, const InOutFileNames &bimAndGroupNames) const
 Assign groups by LD and save to a file with locus names.
 
void ldInGroups (const SparsityParameters &sparsityValues, const InOutFileNames &bimAndLDnames, const size_t &suggestNchunks=static_cast< size_t >(1)) const
 LD in groups.
 
void saveLogFile () const
 Save the log to a file.
 

Detailed Description

Class to store compressed genotype tables.

Provides facilities to store and manipulate compressed genotype tables. Genotypes are stored in a one-bit format: bit set for the minor allele, unset for the major. Bits corresponding to missing data are unset (this is the same as mean imputation), heterozygotes are set with a 50% probability.

Constructor & Destructor Documentation

◆ GenoTableHash() [1/5]

BayesicSpace::GenoTableHash::GenoTableHash ( const std::string & inputFileName,
const IndividualAndSketchCounts & indivSketchCounts,
const size_t & nThreads,
std::string logFileName )

Constructor with input file name and thread number.

The file should be in the plink .bed format. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. If necessary, alleles are re-coded so that the set bit is always the minor allele. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm. The number of threads specified is the maximal that will be used. Actual number depends on system resources.

Parameters
[in]inputFileNameinput file name
[in]indivSketchCountsnumber of individuals and sketches
[in]nThreadsmaximal number of threads to use
[in]logFileNamename of the log file

◆ GenoTableHash() [2/5]

BayesicSpace::GenoTableHash::GenoTableHash ( const std::string & inputFileName,
const IndividualAndSketchCounts & indivSketchCounts,
std::string logFileName )
inline

Constructor with input file name.

The file should be in the plink .bed format. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. If necessary, alleles are re-coded so that the set bit is always the minor allele. The input is a vectorized matrix of genotypes. The original matrix has individuals on rows, and is vectorized by row. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm.

Parameters
[in]inputFileNameinput file name
[in]indivSketchCountsnumber of individuals and sketches
[in]logFileNamename of the log file

◆ GenoTableHash() [3/5]

BayesicSpace::GenoTableHash::GenoTableHash ( const std::vector< int > & maCounts,
const IndividualAndSketchCounts & indivSketchCounts,
const size_t & nThreads,
std::string logFileName )

Constructor with count vector and thread number.

Input is a vector of minor allele counts (0, 1, or 2) or -9 for missing data. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. The counts are checked and re-coded if necessary so that set bits represent the minor allele. This function should run faster if the 0 is the major allele homozygote. While the above values are the norm, any negative number will be interpreted as missing, any odd number as 1, and any (non-0) even number as 2. The input is a vectorized matrix of genotypes. The original matrix has individuals on rows, and is vectorized by row. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm. The number of threads specified is the maximal that will be used. Actual number depends on system resources.

Parameters
[in]maCountsvector of minor allele numbers
[in]indivSketchCountsnumber of individuals and sketches
[in]nThreadsmaximal number of threads to use
[in]logFileNamename of the log file

◆ GenoTableHash() [4/5]

BayesicSpace::GenoTableHash::GenoTableHash ( const std::vector< int > & maCounts,
const IndividualAndSketchCounts & indivSketchCounts,
std::string logFileName )
inline

Constructor with count vector.

Input is a vector of minor allele counts (0, 1, or 2) or -9 for missing data. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. The counts are checked and re-coded if necessary so that set bits represent the minor allele. This function should run faster if the 0 is the major allele homozygote. While the above values are the norm, any negative number will be interpreted as missing, any odd number as 1, and any (non-0) even number as 2. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm.

Parameters
[in]maCountsvector of minor allele numbers
[in]indivSketchCountsnumber of individuals and sketches
[in]logFileNamename of the log file

◆ GenoTableHash() [5/5]

BayesicSpace::GenoTableHash::GenoTableHash ( GenoTableHash && toMove)
defaultnoexcept

Move constructor.

Parameters
[in]toMoveobject to move

Member Function Documentation

◆ allHashLD()

void BayesicSpace::GenoTableHash::allHashLD ( const float & similarityCutOff,
const InOutFileNames & bimAndLDnames,
const size_t & suggestNchunks = static_cast< size_t >(1) ) const

All by all LD from hashes.

Calculates linkage disequilibrium among all loci using a modified OPH. Result is a vectorized lower triangle of the symmetric \(N \times N\) similarity matrix, where \(N\) is the number of loci. All values belong to the same group. Row and column locus names are also included in the tab-delimited output file. The lower triangle is vectorized by column (i.e. all correlations of the first locus, then all remaining correlations of the second, etc.). If suggestNchunks is set, processing the data at least in the given number of chunks even if everything fits in RAM. If the resulting chunks are still too big to fit in RAM, the number is adjusted up. Otherwise, set the number of chunks automatically. If the .bim file name is left blank or the file does not exist, base-1 locus indexes are used instead of locus names.

Parameters
[in]similarityCutOffonly save pairs with at least this similarity
[in]bimAndLDnamesname of the .bim file with locus names and the output LD results file
[in]suggestNchunksforce processing in chunks

◆ ldInGroups()

void BayesicSpace::GenoTableHash::ldInGroups ( const SparsityParameters & sparsityValues,
const InOutFileNames & bimAndLDnames,
const size_t & suggestNchunks = static_cast< size_t >(1) ) const

LD in groups.

Group loci according to LD using the algorithm for makeLDgroups and calculate similarity within groups. Output LD (Jaccard similarity) estimates with group IDs and locus names. If suggestNchunks is set, processing the data at least in the given number of chunks even if everything fits in RAM. If the resulting chunks are still too big to fit in RAM, the number is adjusted up. Otherwise, set the number of chunks automatically. If the .bim file name is left blank or the file does not exist, base-1 locus indexes are used instead of locus names.

Parameters
[in]sparsityValuesSparsityParameters object that controls output matrix sparsity
[in]bimAndLDnames.bim and output LD file names
[in]suggestNchunksforce processing in chunks

◆ makeLDgroups() [1/2]

std::vector< HashGroup > BayesicSpace::GenoTableHash::makeLDgroups ( const size_t & nRowsPerBand) const

Assign groups by linkage disequilibrium (LD)

The sketch matrix is divided into bands, nRowsPerBand rows per band (must be 1 or greater). Locus pairs are included in the pair hash table if all rows in at least one band match. The resulting hash table has groups with at least two loci per group (indexed by a hash of the index vector in the group). Locus indexes are in increasing order within each group. Groups are sorted by first and second locus indexes. Some locus pairs may end up in more than one group, but no groups are completely identical in locus composition.

Parameters
[in]nRowsPerBandnumber of rows per sketch matrix band
Returns
locus index hash table

◆ makeLDgroups() [2/2]

void BayesicSpace::GenoTableHash::makeLDgroups ( const size_t & nRowsPerBand,
const InOutFileNames & bimAndGroupNames ) const

Assign groups by LD and save to a file with locus names.

Assign groups as above and save locus names with their group IDs to a file. If the .bim file name is left blank or the file does not exist, base-1 locus indexes are used instead of locus names.

Parameters
[in]nRowsPerBandnumber of rows per sketch matrix band
[in]bimAndGroupNames.bim and output group file name

◆ operator=()

GenoTableHash & BayesicSpace::GenoTableHash::operator= ( GenoTableHash && toMove)
defaultnoexcept

Move assignment operator.

Parameters
[in]toMoveobject to be moved
Returns
GenoTableHash object

◆ saveLogFile()

void BayesicSpace::GenoTableHash::saveLogFile ( ) const

Save the log to a file.

Log file name provided at construction.


The documentation for this class was generated from the following file: