Class to store compressed genotype tables. More...

#include <gvarHash.hpp>

Public Member Functions
	GenoTableHash ()
	Default constructor.

	GenoTableHash (const std::string &inputFileName, const IndividualAndSketchCounts &indivSketchCounts, const size_t &nThreads, std::string logFileName)
	Constructor with input file name and thread number.

	GenoTableHash (const std::string &inputFileName, const IndividualAndSketchCounts &indivSketchCounts, std::string logFileName)
	Constructor with input file name.

	GenoTableHash (const std::vector< int > &maCounts, const IndividualAndSketchCounts &indivSketchCounts, const size_t &nThreads, std::string logFileName)
	Constructor with count vector and thread number.

	GenoTableHash (const std::vector< int > &maCounts, const IndividualAndSketchCounts &indivSketchCounts, std::string logFileName)
	Constructor with count vector.

	GenoTableHash (const GenoTableHash &toCopy)=delete
	Copy constructor (deleted)

GenoTableHash &	operator= (const GenoTableHash &toCopy)=delete
	Copy assignment operator (deleted)

	GenoTableHash (GenoTableHash &&toMove) noexcept=default
	Move constructor.

GenoTableHash &	operator= (GenoTableHash &&toMove) noexcept=default
	Move assignment operator.

	~GenoTableHash ()=default
	Destructor.

void	allHashLD (const float &similarityCutOff, const InOutFileNames &bimAndLDnames, const size_t &suggestNchunks=static_cast< size_t >(1)) const
	All by all LD from hashes.

std::vector< HashGroup >	makeLDgroups (const size_t &nRowsPerBand) const
	Assign groups by linkage disequilibrium (LD)

void	makeLDgroups (const size_t &nRowsPerBand, const InOutFileNames &bimAndGroupNames) const
	Assign groups by LD and save to a file with locus names.

void	ldInGroups (const SparsityParameters &sparsityValues, const InOutFileNames &bimAndLDnames, const size_t &suggestNchunks=static_cast< size_t >(1)) const
	LD in groups.

void	saveLogFile () const
	Save the log to a file.

Detailed Description

Class to store compressed genotype tables.

Provides facilities to store and manipulate compressed genotype tables. Genotypes are stored in a one-bit format: bit set for the minor allele, unset for the major. Bits corresponding to missing data are unset (this is the same as mean imputation), heterozygotes are set with a 50% probability.

Constructor & Destructor Documentation

◆ GenoTableHash() [1/5]

BayesicSpace::GenoTableHash::GenoTableHash	(	const std::string &	inputFileName,
		const IndividualAndSketchCounts &	indivSketchCounts,
		const size_t &	nThreads,
		std::string	logFileName )

Constructor with input file name and thread number.

The file should be in the plink .bed format. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. If necessary, alleles are re-coded so that the set bit is always the minor allele. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm. The number of threads specified is the maximal that will be used. Actual number depends on system resources.

Parameters

[in]	inputFileName	input file name
[in]	indivSketchCounts	number of individuals and sketches
[in]	nThreads	maximal number of threads to use
[in]	logFileName	name of the log file

◆ GenoTableHash() [2/5]

BayesicSpace::GenoTableHash::GenoTableHash	(	const std::string &	inputFileName,
		const IndividualAndSketchCounts &	indivSketchCounts,
		std::string	logFileName )

inline

Constructor with input file name.

The file should be in the plink .bed format. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. If necessary, alleles are re-coded so that the set bit is always the minor allele. The input is a vectorized matrix of genotypes. The original matrix has individuals on rows, and is vectorized by row. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm.

Parameters

[in]	inputFileName	input file name
[in]	indivSketchCounts	number of individuals and sketches
[in]	logFileName	name of the log file

◆ GenoTableHash() [3/5]

BayesicSpace::GenoTableHash::GenoTableHash	(	const std::vector< int > &	maCounts,
		const IndividualAndSketchCounts &	indivSketchCounts,
		const size_t &	nThreads,
		std::string	logFileName )

Constructor with count vector and thread number.

Input is a vector of minor allele counts (0, 1, or 2) or -9 for missing data. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. The counts are checked and re-coded if necessary so that set bits represent the minor allele. This function should run faster if the 0 is the major allele homozygote. While the above values are the norm, any negative number will be interpreted as missing, any odd number as 1, and any (non-0) even number as 2. The input is a vectorized matrix of genotypes. The original matrix has individuals on rows, and is vectorized by row. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm. The number of threads specified is the maximal that will be used. Actual number depends on system resources.

Parameters

[in]	maCounts	vector of minor allele numbers
[in]	indivSketchCounts	number of individuals and sketches
[in]	nThreads	maximal number of threads to use
[in]	logFileName	name of the log file

◆ GenoTableHash() [4/5]

BayesicSpace::GenoTableHash::GenoTableHash	(	const std::vector< int > &	maCounts,
		const IndividualAndSketchCounts &	indivSketchCounts,
		std::string	logFileName )

inline

Constructor with count vector.

Input is a vector of minor allele counts (0, 1, or 2) or -9 for missing data. Heterozygotes are assigned the major or minor allele at random, missing genotypes are assigned the major allele. The counts are checked and re-coded if necessary so that set bits represent the minor allele. This function should run faster if the 0 is the major allele homozygote. While the above values are the norm, any negative number will be interpreted as missing, any odd number as 1, and any (non-0) even number as 2. The binary stream is then hashed using a one-permutation hash (OPH; one sketch per locus). Bits are permuted using the Fisher-Yates-Durstenfeld algorithm. Filling in empty bins using the Mai et al. (2020) algorithm.

Parameters

[in]	maCounts	vector of minor allele numbers
[in]	indivSketchCounts	number of individuals and sketches
[in]	logFileName	name of the log file

◆ GenoTableHash() [5/5]

BayesicSpace::GenoTableHash::GenoTableHash ( GenoTableHash && toMove )

defaultnoexcept

Move constructor.

Parameters

[in] toMove object to move

Member Function Documentation

◆ allHashLD()

void BayesicSpace::GenoTableHash::allHashLD	(	const float &	similarityCutOff,
		const InOutFileNames &	bimAndLDnames,
		const size_t &	suggestNchunks = static_cast< size_t >(1) ) const

All by all LD from hashes.

Calculates linkage disequilibrium among all loci using a modified OPH. Result is a vectorized lower triangle of the symmetric \(N \times N\) similarity matrix, where \(N\) is the number of loci. All values belong to the same group. Row and column locus names are also included in the tab-delimited output file. The lower triangle is vectorized by column (i.e. all correlations of the first locus, then all remaining correlations of the second, etc.). If suggestNchunks is set, processing the data at least in the given number of chunks even if everything fits in RAM. If the resulting chunks are still too big to fit in RAM, the number is adjusted up. Otherwise, set the number of chunks automatically. If the .bim file name is left blank or the file does not exist, base-1 locus indexes are used instead of locus names.

Parameters

[in]	similarityCutOff	only save pairs with at least this similarity
[in]	bimAndLDnames	name of the .bim file with locus names and the output LD results file
[in]	suggestNchunks	force processing in chunks

◆ ldInGroups()

void BayesicSpace::GenoTableHash::ldInGroups	(	const SparsityParameters &	sparsityValues,
		const InOutFileNames &	bimAndLDnames,
		const size_t &	suggestNchunks = static_cast< size_t >(1) ) const

LD in groups.

Group loci according to LD using the algorithm for makeLDgroups and calculate similarity within groups. Output LD (Jaccard similarity) estimates with group IDs and locus names. If suggestNchunks is set, processing the data at least in the given number of chunks even if everything fits in RAM. If the resulting chunks are still too big to fit in RAM, the number is adjusted up. Otherwise, set the number of chunks automatically. If the .bim file name is left blank or the file does not exist, base-1 locus indexes are used instead of locus names.

Parameters

[in]	sparsityValues	`SparsityParameters` object that controls output matrix sparsity
[in]	bimAndLDnames	.bim and output LD file names
[in]	suggestNchunks	force processing in chunks

◆ makeLDgroups() [1/2]

std::vector< HashGroup > BayesicSpace::GenoTableHash::makeLDgroups ( const size_t & nRowsPerBand ) const

Assign groups by linkage disequilibrium (LD)

The sketch matrix is divided into bands, nRowsPerBand rows per band (must be 1 or greater). Locus pairs are included in the pair hash table if all rows in at least one band match. The resulting hash table has groups with at least two loci per group (indexed by a hash of the index vector in the group). Locus indexes are in increasing order within each group. Groups are sorted by first and second locus indexes. Some locus pairs may end up in more than one group, but no groups are completely identical in locus composition.

Parameters

[in] nRowsPerBand number of rows per sketch matrix band

Returns: locus index hash table

◆ makeLDgroups() [2/2]

void BayesicSpace::GenoTableHash::makeLDgroups	(	const size_t &	nRowsPerBand,
		const InOutFileNames &	bimAndGroupNames ) const

Assign groups by LD and save to a file with locus names.

Assign groups as above and save locus names with their group IDs to a file. If the .bim file name is left blank or the file does not exist, base-1 locus indexes are used instead of locus names.

Parameters

[in]	nRowsPerBand	number of rows per sketch matrix band
[in]	bimAndGroupNames	.bim and output group file name

◆ operator=()

GenoTableHash & BayesicSpace::GenoTableHash::operator= ( GenoTableHash && toMove )

defaultnoexcept

Move assignment operator.

Parameters

[in] toMove object to be moved

Returns: GenoTableHash object

◆ saveLogFile()

void BayesicSpace::GenoTableHash::saveLogFile ( ) const

Save the log to a file.

Log file name provided at construction.

The documentation for this class was generated from the following file:

gvarHash.hpp

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ GenoTableHash() [1/5]

◆ GenoTableHash() [2/5]

◆ GenoTableHash() [3/5]

◆ GenoTableHash() [4/5]

◆ GenoTableHash() [5/5]

Member Function Documentation

◆ allHashLD()

◆ ldInGroups()

◆ makeLDgroups() [1/2]

◆ makeLDgroups() [2/2]

◆ operator=()

◆ saveLogFile()