vash
Fast genetic similarity estimation with hash tables
Loading...
Searching...
No Matches
BayesicSpace::GenoTableBin Class Reference

Class to store binary compressed genotype tables. More...

#include <gvarHash.hpp>

Public Member Functions

 GenoTableBin ()
 Default constructor.
 
 GenoTableBin (const std::string &inputFileName, const uint32_t &nIndividuals, std::string logFileName)
 Constructor with input file name.
 
 GenoTableBin (const std::string &inputFileName, const uint32_t &nIndividuals, std::string logFileName, const size_t &nThreads)
 Constructor with input file name and thread count.
 
 GenoTableBin (const std::vector< int > &maCounts, const uint32_t &nIndividuals, std::string logFileName)
 Constructor with count vector.
 
 GenoTableBin (const std::vector< int > &maCounts, const uint32_t &nIndividuals, std::string logFileName, const size_t &nThreads)
 Constructor with count vector and thread count.
 
 GenoTableBin (const GenoTableBin &toCopy)=delete
 Copy constructor (deleted)
 
GenoTableBinoperator= (const GenoTableBin &toCopy)=delete
 Copy assignment operator (deleted)
 
 GenoTableBin (GenoTableBin &&toMove) noexcept=default
 Move constructor.
 
GenoTableBinoperator= (GenoTableBin &&toMove) noexcept=default
 Move assignment operator.
 
 ~GenoTableBin ()=default
 Destructor.
 
void saveGenoBinary (const std::string &outFileName) const
 Save the binary genotype file.
 
void allJaccardLD (const InOutFileNames &bimAndLDnames, const size_t &suggestNchunks=static_cast< size_t >(1)) const
 All by all Jaccard similarity LD with locus names.
 
void saveLogFile () const
 Save the log to a file.
 

Detailed Description

Class to store binary compressed genotype tables.

Converts genotype data to a lossy compressed binary code. Genotypes are stored in memory in a one-bit format: bit set for the minor allele, unset for the major. Bits corresponding to missing data are unset (this is the same as mean imputation), heterozygotes are set with a 50% probability.

Constructor & Destructor Documentation

◆ GenoTableBin() [1/5]

BayesicSpace::GenoTableBin::GenoTableBin ( const std::string & inputFileName,
const uint32_t & nIndividuals,
std::string logFileName )
inline

Constructor with input file name.

The file should be in the plink .bed format. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. If necessary, alleles are re-coded so that the set bit is always the minor allele.

Parameters
[in]inputFileNameinput file name
[in]nIndividualsnumber of genotyped individuals
[in]logFileNamename of the log file

◆ GenoTableBin() [2/5]

BayesicSpace::GenoTableBin::GenoTableBin ( const std::string & inputFileName,
const uint32_t & nIndividuals,
std::string logFileName,
const size_t & nThreads )

Constructor with input file name and thread count.

The file should be in the plink .bed format. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. If necessary, alleles are re-coded so that the set bit is always the minor allele. The number of threads requested is maximum to be used, depending on available system resources.

Parameters
[in]inputFileNameinput file name
[in]nIndividualsnumber of genotyped individuals
[in]logFileNamename of the log file
[in]nThreadsmaximal number of threads to use

◆ GenoTableBin() [3/5]

BayesicSpace::GenoTableBin::GenoTableBin ( const std::vector< int > & maCounts,
const uint32_t & nIndividuals,
std::string logFileName )
inline

Constructor with count vector.

Input is a vector of minor allele counts (0, 1, or 2) or -9 for missing data. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. The counts are checked and re-coded if necessary so that set bits represent the minor allele. This function should run faster if the 0 is the major allele homozygote. While the above values are the norm, any negative number will be interpreted as missing, any odd number as 1, and any (non-0) even number as 2. The input is a vectorized matrix of genotypes. The original matrix has individuals on rows, and is vectorized by row.

Parameters
[in]maCountsvector of minor allele numbers
[in]nIndividualsnumber of genotyped individuals
[in]logFileNamename of the log file

◆ GenoTableBin() [4/5]

BayesicSpace::GenoTableBin::GenoTableBin ( const std::vector< int > & maCounts,
const uint32_t & nIndividuals,
std::string logFileName,
const size_t & nThreads )

Constructor with count vector and thread count.

Input is a vector of minor allele counts (0, 1, or 2) or -9 for missing data. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. The counts are checked and re-coded if necessary so that set bits represent the minor allele. This function should run faster if the 0 is the major allele homozygote. While the above values are the norm, any negative number will be interpreted as missing, any odd number as 1, and any (non-0) even number as 2. The input is a vectorized matrix of genotypes. The original matrix has individuals on rows, and is vectorized by row. The number of threads requested is maximum to be used, depending on available system resources.

Parameters
[in]maCountsvector of minor allele numbers
[in]nIndividualsnumber of genotyped individuals
[in]logFileNamename of the log file
[in]nThreadsmaximal number of threads to use

◆ GenoTableBin() [5/5]

BayesicSpace::GenoTableBin::GenoTableBin ( GenoTableBin && toMove)
defaultnoexcept

Move constructor.

Parameters
[in]toMoveobject to move

Member Function Documentation

◆ allJaccardLD()

void BayesicSpace::GenoTableBin::allJaccardLD ( const InOutFileNames & bimAndLDnames,
const size_t & suggestNchunks = static_cast< size_t >(1) ) const

All by all Jaccard similarity LD with locus names.

Calculates linkage disequilibrium among all loci using Jaccard similarity and \(r^2\) as the statistics. Result is a vectorized lower triangle of the symmetric \(N \times N\) similarity matrix, where \(N\) is the number of loci. Row and column locus names are also included in the tab-delimited output file. The lower triangle is vectorized by column (i.e. all correlations of the first locus, then all remaining correlations of the second, etc.). If the result does not fit in RAM, calculates in blocks and saves to disk periodically.

Parameters
[in]bimAndLDnamesname of the input .bim file that has locus names and the output LD value file name
[in]suggestNchunksforce processing in chunks

◆ operator=()

GenoTableBin & BayesicSpace::GenoTableBin::operator= ( GenoTableBin && toMove)
defaultnoexcept

Move assignment operator.

Parameters
[in]toMoveobject to be moved
Returns
GenoTableBin object

◆ saveGenoBinary()

void BayesicSpace::GenoTableBin::saveGenoBinary ( const std::string & outFileName) const

Save the binary genotype file.

Saves the binary approximate genotype data to a binary file.

Parameters
[in]outFileNameoutput file name

◆ saveLogFile()

void BayesicSpace::GenoTableBin::saveLogFile ( ) const

Save the log to a file.

Log file name provided at construction.


The documentation for this class was generated from the following file: