TCudaMatrix Class.
The TCudaMatrix class represents matrices on a CUDA device. The elements of the matrix are stored in a TCudaDeviceBuffer object which takes care of the allocation and freeing of the device memory. TCudaMatrices are lightweight object, that means on assignment and copy creation only a shallow copy is performed and no new element buffer allocated. To perform a deep copy use the static Copy method of the TCuda architecture class.
The TCudaDeviceBuffer has an associated cuda stream, on which the data is transferred to the device. This stream can be accessed through the GetComputeStream member function and used to synchronize computations.
The TCudaMatrix class also holds static references to CUDA resources. Those are the cublas handle, a buffer of curand states for the generation of random numbers as well as a vector containing ones, which is used for summing column matrices using matrix-vector multiplication. The class also has a static buffer for returning results from the device.
Definition at line 105 of file CudaMatrix.h.
Public Member Functions | |
TCudaMatrix () | |
TCudaMatrix (const TCudaMatrix &)=default | |
TCudaMatrix (const TMatrixT< AFloat > &) | |
TCudaMatrix (size_t i, size_t j) | |
TCudaMatrix (TCudaDeviceBuffer< AFloat > buffer, size_t m, size_t n) | |
TCudaMatrix (TCudaMatrix &&)=default | |
~TCudaMatrix ()=default | |
cudaStream_t | GetComputeStream () const |
const cublasHandle_t & | GetCublasHandle () const |
AFloat * | GetDataPointer () |
const AFloat * | GetDataPointer () const |
TCudaDeviceBuffer< AFloat > | GetDeviceBuffer () const |
size_t | GetNcols () const |
size_t | GetNoElements () const |
size_t | GetNrows () const |
operator TMatrixT< AFloat > () const | |
Convert cuda matrix to Root TMatrix. More... | |
TCudaDeviceReference< AFloat > | operator() (size_t i, size_t j) const |
Access to elements of device matrices provided through TCudaDeviceReference class. More... | |
TCudaMatrix & | operator= (const TCudaMatrix &)=default |
TCudaMatrix & | operator= (TCudaMatrix &&)=default |
void | Print () const |
void | SetComputeStream (cudaStream_t stream) |
void | Synchronize (const TCudaMatrix &) const |
Blocking synchronization with the associated compute stream, if it's not the default stream. More... | |
void | Zero () |
Static Public Member Functions | |
static curandState_t * | GetCurandStatesPointer () |
static AFloat | GetDeviceReturn () |
Transfer the value in the device return buffer to the host. More... | |
static AFloat * | GetDeviceReturnPointer () |
Return device pointer to the device return buffer. More... | |
static size_t | GetNDim () |
static AFloat * | GetOnes () |
static void | ResetDeviceReturn (AFloat value=0.0) |
Set the return buffer on the device to the specified value. More... | |
Static Public Attributes | |
static Bool_t | gInitializeCurand |
Private Member Functions | |
void | InitializeCuda () |
Initializes all shared devices resource and makes sure that a sufficient number of curand states are allocated on the device and initialized as well as that the one-vector for the summation over columns has the right size. More... | |
void | InitializeCurandStates () |
Private Attributes | |
TCudaDeviceBuffer< AFloat > | fElementBuffer |
size_t | fNCols |
size_t | fNRows |
Static Private Attributes | |
static cublasHandle_t | fCublasHandle |
static curandState_t * | fCurandStates |
static AFloat * | fDeviceReturn |
Buffer for kernel return values. More... | |
static size_t | fInstances |
Current number of matrix instances. More... | |
static size_t | fNCurandStates |
static size_t | fNOnes |
Current length of the one vector. More... | |
static AFloat * | fOnes |
Vector used for summations of columns. More... | |
#include <TMVA/DNN/Architectures/Cuda/CudaMatrix.h>
TMVA::DNN::TCudaMatrix< AFloat >::TCudaMatrix | ( | ) |
TMVA::DNN::TCudaMatrix< AFloat >::TCudaMatrix | ( | size_t | i, |
size_t | j | ||
) |
TMVA::DNN::TCudaMatrix< AFloat >::TCudaMatrix | ( | const TMatrixT< AFloat > & | ) |
TMVA::DNN::TCudaMatrix< AFloat >::TCudaMatrix | ( | TCudaDeviceBuffer< AFloat > | buffer, |
size_t | m, | ||
size_t | n | ||
) |
|
default |
|
default |
|
default |
|
inline |
Definition at line 271 of file CudaMatrix.h.
|
inline |
Definition at line 168 of file CudaMatrix.h.
|
inlinestatic |
Definition at line 155 of file CudaMatrix.h.
|
inline |
Definition at line 167 of file CudaMatrix.h.
|
inline |
Definition at line 166 of file CudaMatrix.h.
|
inline |
Definition at line 170 of file CudaMatrix.h.
|
inlinestatic |
Transfer the value in the device return buffer to the host.
This tranfer is synchronous
Definition at line 304 of file CudaMatrix.h.
|
inlinestatic |
Return device pointer to the device return buffer.
Definition at line 154 of file CudaMatrix.h.
|
inline |
Definition at line 163 of file CudaMatrix.h.
|
inlinestatic |
Definition at line 161 of file CudaMatrix.h.
|
inline |
Definition at line 164 of file CudaMatrix.h.
|
inline |
Definition at line 162 of file CudaMatrix.h.
|
inlinestatic |
Definition at line 128 of file CudaMatrix.h.
|
private |
Initializes all shared devices resource and makes sure that a sufficient number of curand states are allocated on the device and initialized as well as that the one-vector for the summation over columns has the right size.
|
private |
TMVA::DNN::TCudaMatrix< AFloat >::operator TMatrixT< AFloat > | ( | ) | const |
Convert cuda matrix to Root TMatrix.
Performs synchronous data transfer.
TCudaDeviceReference< AFloat > TMVA::DNN::TCudaMatrix< AFloat >::operator() | ( | size_t | i, |
size_t | j | ||
) | const |
Access to elements of device matrices provided through TCudaDeviceReference class.
Note that access is synchronous end enforces device synchronization on all streams. Only used for testing.
Definition at line 313 of file CudaMatrix.h.
|
default |
|
default |
|
inline |
Definition at line 177 of file CudaMatrix.h.
|
inlinestatic |
Set the return buffer on the device to the specified value.
This is required for example for reductions in order to initialize the accumulator.
Definition at line 296 of file CudaMatrix.h.
|
inline |
Definition at line 278 of file CudaMatrix.h.
|
inline |
Blocking synchronization with the associated compute stream, if it's not the default stream.
Definition at line 285 of file CudaMatrix.h.
|
inline |
Definition at line 182 of file CudaMatrix.h.
|
staticprivate |
Definition at line 112 of file CudaMatrix.h.
|
staticprivate |
Definition at line 116 of file CudaMatrix.h.
|
staticprivate |
Buffer for kernel return values.
Definition at line 113 of file CudaMatrix.h.
|
private |
Definition at line 122 of file CudaMatrix.h.
|
staticprivate |
Current number of matrix instances.
Definition at line 111 of file CudaMatrix.h.
|
private |
Definition at line 121 of file CudaMatrix.h.
|
staticprivate |
Definition at line 117 of file CudaMatrix.h.
|
staticprivate |
Current length of the one vector.
Definition at line 115 of file CudaMatrix.h.
|
private |
Definition at line 120 of file CudaMatrix.h.
|
staticprivate |
Vector used for summations of columns.
Definition at line 114 of file CudaMatrix.h.
|
static |
Definition at line 126 of file CudaMatrix.h.