Sample SNPs
Fast ordered sampling of rows from large text or binary files. Special cases for DNA variant files (.bed, VCF, HapMap, etc).
Public Member Functions | Protected Member Functions | List of all members
sampFiles::BedFileI Class Reference

BED file input class. More...

#include <varfiles.hpp>

Inheritance diagram for sampFiles::BedFileI:
[legend]
Collaboration diagram for sampFiles::BedFileI:
[legend]

Public Member Functions

 BedFileI ()
 Default constructor.
 
 BedFileI (const string &stubName)
 File name constructor. More...
 
 BedFileI (const BedFileI &in)=default
 Copy constructor.
 
BedFileIoperator= (const BedFileI &in)=default
 Copy assignment.
 
 BedFileI (BedFileI &&in)=default
 Move constructor.
 
BedFileIoperator= (BedFileI &&in)=default
 Move assignment.
 
 ~BedFileI ()
 Destructor.
 
void open ()
 Open stream to read.
 
void sample (BedFileO &out, const uint64_t &n)
 Sample SNPs and save to BED file. More...
 
void sampleLD (const uint64_t &n)
 Linkage disequilibrium among sampled sites. More...
 
void sampleLD (const PopIndex &popID, const uint64_t &n)
 LD among sampled sites within populations. More...
 
uint64_t nsnp ()
 Number of SNPs in the object.
 
uint64_t nindiv ()
 Number of individuals in the object.
 
- Public Member Functions inherited from sampFiles::BedFile
 BedFile ()
 Default constructor.
 
 BedFile (const string &stubName)
 File name constructor. More...
 
 BedFile (const BedFile &in)=default
 Copy constructor.
 
BedFileoperator= (const BedFile &in)=default
 Copy assignment.
 
 BedFile (BedFile &&in)=default
 Move constructor.
 
BedFileoperator= (BedFile &&in)=default
 Move assignment.
 
 ~BedFile ()
 Destructor.
 
void close ()
 Close stream.
 
- Public Member Functions inherited from sampFiles::GbinFile
 GbinFile ()
 Default constructor.
 
 GbinFile (const string &fileName, const size_t &nCols, const size_t &elemSize)
 Constructor with file name. More...
 
 GbinFile (const GbinFile &in)=default
 Copy constructor.
 
GbinFileoperator= (const GbinFile &in)=default
 Copy assignment.
 
 GbinFile (GbinFile &&in)=default
 Move constructor.
 
GbinFileoperator= (GbinFile &&in)=default
 Move assignment.
 
 ~GbinFile ()
 Destructor.
 
- Public Member Functions inherited from sampFiles::VarFile
 VarFile (const VarFile &in)=default
 Copy constructor.
 
VarFileoperator= (const VarFile &in)=default
 Copy assignment.
 
 VarFile (VarFile &&in)=default
 Move constructor.
 
VarFileoperator= (VarFile &&in)=default
 Move assignment.
 
 ~VarFile ()
 Destructor.
 

Protected Member Functions

uint64_t _numLines ()
 Get number of lines in the _bimFile More...
 
uint64_t _famLines ()
 Get number of lines in the _famFile More...
 
uint64_t _famLines (fstream &fam)
 Copy the .fam file and count number of lines. More...
 
void _ld (const char *snp1, const char *snp2, const size_t &N, const unsigned short &pad, double &rSq, double &Dprime, double &dcnt1, double &dcnt2)
 Between-SNP linkage disequilibrium (LD) More...
 
void _ld (const char *snp1, const char *snp2, const PopIndex &popID, vector< double > &rSq, vector< double > &Dprime, vector< double > &dcnt1, vector< double > &dcnt2)
 Between-SNP LD within populations. More...
 
- Protected Member Functions inherited from sampFiles::VarFile
 VarFile ()
 Default constructor (protected)
 

Additional Inherited Members

- Protected Attributes inherited from sampFiles::BedFile
fstream _famFile
 Corresponding .fam file stream.
 
fstream _bimFile
 Corresponding .bim file stream.
 
string _fileStub
 File name stub (minus the extension)
 
- Protected Attributes inherited from sampFiles::GbinFile
string _fileName
 File name.
 
size_t _nCols
 Number of elements in a row.
 
size_t _elemSize
 Size of each element in bytes.
 
- Protected Attributes inherited from sampFiles::VarFile
fstream _varFile
 Variant file stream.
 
- Static Protected Attributes inherited from sampFiles::BedFile
static const vector< char > _masks = {static_cast<char>(0x03), static_cast<char>(0x0C), static_cast<char>(0x30), static_cast<char>(0xC0)}
 Genotype bit masks. More...
 
static const unordered_map< char, string > _tests
 Genotype bit tests. More...
 

Detailed Description

BED file input class.

Reads BED files and the auxiliary files that come with them (.fam and .bim) as necessary. Only the SNP-major version is supported.

Constructor & Destructor Documentation

◆ BedFileI()

sampFiles::BedFileI::BedFileI ( const string &  stubName)
inline

File name constructor.

Parameters
[in]stubNamefile name minus the extension

Member Function Documentation

◆ _famLines() [1/2]

uint64_t BedFileI::_famLines ( )
protected

Get number of lines in the _famFile

Assumes Unix-like line endings. The result is equal to the number of individuals.

Returns
number of lines in _famFile

◆ _famLines() [2/2]

uint64_t BedFileI::_famLines ( fstream &  fam)
protected

Copy the .fam file and count number of lines.

Assumes Unix-like line endings. The result is equal to the number of individuals. The current object's .fam file is copied to the provided file stream, which should be open for raading. If not, the function throws a string object `‘Output .fam filestream not open’'.

Parameters
[in]fam.fam file stream
Returns
number of lines in _famFile

◆ _ld() [1/2]

void BedFileI::_ld ( const char *  snp1,
const char *  snp2,
const PopIndex popID,
vector< double > &  rSq,
vector< double > &  Dprime,
vector< double > &  dcnt1,
vector< double > &  dcnt2 
)
protected

Between-SNP LD within populations.

Calculates two LD statistics ( \( r^2 \) and \( D' \)) between two SNPs from a BED file. Missing values are ignored. If there are fewer than three haplotypes with data present at both loci, the return values are -9. This value is also returned if one of the loci is monomorphic after taking out missing data at the other SNP. Minor (not necessarily derived) allele counts are also reported to enable downstream filtering. Note that the populations are assumed diploid and the counts are of haploid chromosomes (i.e. one homozygote yields count of 2). The values are calculted within each population as indicated by the PopIndex object. The results are returned in the supplied vectors, which are assumed to be of correct size. Since this is an internal function unexposed to the user, this is not chaecked to save on compuation steps. Care must be taken that the char arrays passed to the function have lengths compatible with the number of individuals indexed by PopIndex. This is not checked.

Parameters
[in]snp1first SNP
[in]snp2second SNP
[in]popIDpopulation index
[out]rSqvector of \( r^2 \) estimates
[out]Dprimevector of \( D' \) estimates
[out]dcnt1vector of minor allele counts at locus 1
[out]dcnt2vector of minor allele counts at locus 2

◆ _ld() [2/2]

void BedFileI::_ld ( const char *  snp1,
const char *  snp2,
const size_t &  N,
const unsigned short &  pad,
double &  rSq,
double &  Dprime,
double &  dcnt1,
double &  dcnt2 
)
protected

Between-SNP linkage disequilibrium (LD)

Calculates two LD statistics ( \( r^2 \) and \( D' \)) between two SNPs from a BED file. Missing values are ignored. If there are fewer than three haplotypes with data present at both loci, the return values are -9. This value is also returned if one of the loci is monomorphic after taking out missing data at the other SNP. Minor (not necessarily derived) allele counts are also reported to enable downstream filtering. Note that the populations are assumed diploid and the counts are of haploid chromosomes (i.e. one homozygote yields count of 2).

Parameters
[in]snp1first SNP
[in]snp2second SNP
[in]Nlength of the genotype vector in bytes (four genotypes per byte)
[in]padnumber of bit pairs of padding in the last byte
[out]rSqthe \( r^2 \) estimate
[out]Dprimethe \( D' \) estimate
[out]dcnt1minor allele count at locus 1
[out]dcnt2minor allele count at locus 2

◆ _numLines()

uint64_t BedFileI::_numLines ( )
protected

Get number of lines in the _bimFile

Assumes Unix-like line endings. The result is equal to the number of SNPs.

Returns
number of lines in _bimFile

◆ sample()

void BedFileI::sample ( BedFileO out,
const uint64_t &  n 
)

Sample SNPs and save to BED file.

Sample \(n\) SNPs without replacement from the file represented by the current object and save to the out object. Uses Vitter's [3] method. Number of samples has to be smaller that the number of SNPs in the file.

Parameters
[in]outoutput object
[in]nnumber of SNPs to sample

◆ sampleLD() [1/2]

void BedFileI::sampleLD ( const PopIndex popID,
const uint64_t &  n 
)

LD among sampled sites within populations.

Samples sequential pairs of SNPs and calculates two LD measures ( \( r^2 \) and \( D' \)) within populations indicated by PopIndex. Saves to a file with the same name as the one preceding the .bed etc extensions, but adds _LD.tsv at the end. Each line is tab-delimited with the chromosome number (from the .bim file), between-SNP distance, non-reference allele count for each SNP, \( r^2 \), and \( D' \). Missing data are ignored (only pairwise-complete observations are included). If one of the SNPs is monomorphic or if the total number of pairwise present genotypes is fewer than three (exclusive), the LD measures are returned as -9 to indicate missing values.

Parameters
[in]popIDpopulation index
[in]nnumber of SNP pairs to sample

◆ sampleLD() [2/2]

void BedFileI::sampleLD ( const uint64_t &  n)

Linkage disequilibrium among sampled sites.

Samples sequential pairs of SNPs and calculates two LD measures ( \( r^2 \) and \( D' \)). Saves to a file with the same name as the one preceding the .bed etc extensions, but adds _LD.tsv at the end. Each line is tab-delimited with the chromosome number (from the .bim file), between-SNP distance, non-reference allele count for each SNP, \( r^2 \), and \( D' \). Missing data are ignored (only pairwise-complete observations are included). If one of the SNPs is monomorphic or if the total number of pairwise present genotypes is fewer than three (exclusive), the LD measures are returned as -9 to indicate missing values.

Parameters
[in]nnumber of SNP pairs to sample

The documentation for this class was generated from the following files: