"New Neural" is an Open Source neural network engine project for the MetaTrader 5 platform. - page 59

 

... And genetics is not part of the network, but a separate external algorithm.

People here want to parallelize networks.

 
One does not interfere with the other.
 
TheXpert:

... And genetics is not part of the network, but a separate external algorithm.

People here want to parallelize networks.

Write something working at first, and then parallelize it. KUDav's code takes a long time to debug because of possible errors: you need to know how much memory to allocate for different arrays, write commands to load and unload these arrays from memory, synchronize threads, etc. Here's a chunk of Kuda code (1/20 of the total code). Note that none of the commands are directly related to the network learning algorithm itself. It's all Chinese to me.

#define SHARED_BUFFER_SIZE 512

#define MAX_DATA_SIZE 227

#define MAX_FILTER_33_SIZE 73

#define MAX_MASK_23_SIZE 27

#define MAX_MASK_34_SIZE 27

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

__constant__ float M33_Filter[ MAX_FILTER_33_SIZE ][ MAX_FILTER_33_SIZE ];

__constant__ float M23_Mask [ MAX_MASK_23_SIZE ][ MAX_MASK_23_SIZE ];

__constant__ float M34_Mask [ MAX_MASK_34_SIZE ][ MAX_MASK_34_SIZE ];

__shared__ float SharedBuf[ SHARED_BUFFER_SIZE ];

#define ALLIGN 32

#define ROUND_UP( val ) ( ( ( ( val ) + ALLIGN - 1 ) / ALLIGN ) * ALLIGN )

__host__ __device__ int Get2DIndex( int dataSize, int row, int col )

{

int dataSizeP = ROUND_UP( dataSize );

int idx = row * dataSizeP + col;

//assert( idx >= 0 && idx < dataSize * dataSizeP );

return idx;

}

__host__ __device__ float Get2D( float *data, int dataSize, int row, int col )

{

int idx = Get2DIndex( dataSize, row, col );

return data[ idx ];

}

__host__ __device__ void Set2D( float *data, int dataSize, int row, int col, float val )

{

int idx = Get2DIndex( dataSize, row, col );

data[ idx ] = val;

}

__host__ __device__ int Get4DIndex( int dataSize, int filtSize, int row, int col, int rowF, int colF )

{

int dataSizeP = ROUND_UP( dataSize );

int filtSizeP = ROUND_UP( filtSize );

int idx;

idx = row;

idx = idx * filtSizeP + rowF;

idx = idx * filtSizeP + colF;

idx = idx * dataSizeP + col;

//assert( idx >= 0 && idx < dataSize * dataSizeP * filtSizeP * filtSizeP );

return idx;

}

__host__ __device__ float Get4D( float *filt, int dataSize, int filtSize, int row, int col, int rowF, int colF )

{

int idx = Get4DIndex( dataSize, filtSize, row, col, rowF, colF );

return filt[ idx ];

}

__host__ __device__ void Set4D( float *filt, int dataSize, int filtSize, int row, int col, int rowF, int colF, float val )

{

int idx = Get4DIndex( dataSize, filtSize, row, col, rowF, colF );

filt[ idx ] = val;

}

__global__ void Calc_1_kernel( float* o2, float* i3, float* fi3, float* o3, float* o3tmp, float *w23, int n2, int n3, int n23, int h23, int ntot23, float a3 )

{

int numBlocks = gridDim.x;

int numThreads = blockDim.x;

int blockId = blockIdx.x;

int threadId = threadIdx.x;

///* DEBUG */for( int blockId = 0; blockId < numBlocks; ++blockId )

{

for( int i = blockId; i < n3; i += numBlocks )

{

///* DEBUG */for( int threadId = 0; threadId < numThreads; ++threadId )

{

// clear output

for( int j = threadId; j < n3; j += numThreads )

{

Set2D( fi3, n3, i, j, 0 );

}

}

__syncthreads();

// process 'n23' rows

for( int dx = 0; dx < n23; ++dx )

{

int x;

if( n2 == n3 )

{

x = i + dx - h23;

if( x < 0 ) x += n2;

if( x >= n2 ) x -= n2;

}

else

{

x = i + dx;

}

///* DEBUG */for( int threadId = 0; threadId < numThreads; ++threadId )

{

// read one row of input data to shared memory ( row 'x' )

int dj;

if( n2 == n3 )

{

dj = h23;

}

else

{

dj = 0;

}

for( int jj = threadId; jj < n2; jj += numThreads )

{

float o2Val = Get2D( o2, n2, x, jj );

SharedBuf[ jj + dj ] = o2Val;

}

if( n2 == n3 )

{

for( int dj = threadId; dj < h23; dj += numThreads )

{

SharedBuf[ dj ] = SharedBuf[ n2 + dj ];

}

for( int dj = threadId; dj < h23 + 1; dj += numThreads )

{

SharedBuf[ n2 + h23 + dj ] = SharedBuf[ h23 + dj ];

}

}

}

__syncthreads();

///* DEBUG */for( int threadId = 0; threadId < numThreads; ++threadId )

{

// filter one row

for( int j = threadId; j < n3; j += numThreads )

{

float fi3Val = Get2D( fi3, n3, i, j );

for( int dy = 0; dy < n23; ++dy )

{

float w23Val = Get4D( w23, n3, n23, i, j, dx, dy );

float o2Val = SharedBuf[ j + dy ];

fi3Val += o2Val * w23Val;

}

Set2D( fi3, n3, i, j, fi3Val );

}

}

__syncthreads();

}

__syncthreads();

 
gpwr:

Folks, if you decide to mess with the GPU, consider the following. It is easy to parallelize learning algorithms where one iteration is independent of the other. For example, genetic, where computing hundreds of networks in a generation does not depend on each other, Manhattan (dumb enumeration of all solutions), or optimization of independent networks in a committee. But there are few such algorithms. Methods based on the gradient descent will be harder to parallelize because one iteration depends on the other. In such cases it will be necessary to split the computation of neurons in the same layer into parallel threads, which is not always trivial.

On GPU it will be possible to parallel only simple calculations, simple calculations within one cycle, so it will be very difficult (and I doubt that it will be effective) to implement calculations of different phases of genetics on GPU.

The nature of GPU acceleration lies in the independence of inputs from outputs within a single iteration and the uniformity of operations (as gpwr noted above). Isn't that very similar to the definition of layer that we collegially derived above. That's why I suggested to distinguish the layer as the main working object, and the neuron object only as an informational entity attached to the layer.

 
gpwr:

Write something that works at first, and then parallelize it. KUDAV code takes a long time to debug because of possible errors: you need to know how much memory to allocate for different arrays, write commands to load and unload these arrays from memory, synchronize threads, etc. Here's a chunk of Kuda code (1/20 of the total code). Note that none of the commands are directly related to the network learning algorithm itself. It's all Chinese to me.

...

Such fragments can be marked right away giving names to functions with appendix "GPU" or marked with comment at the beginning of a cycle. Just to let people know here is the opportunity to use the graphical processor.

 
gpwr:

Write something that works at first, and then parallelize it. KUDAv code takes a long time to debug because of possible errors: you need to know how much memory to allocate for different arrays, write commands to load and unload these arrays from memory, synchronize threads, etc. Here's a chunk of Kuda code (1/20 of the total code). Note that none of the commands are directly related to the network learning algorithm itself. It's all Chinese to me.

CUDA supports OPP to the fullest extent

Here is part of the code for the hyp.tangent

#ifndef LAYER_TANH_H
#define LAYER_TANH_H

#ifndef CUDA
        #include <map>
        #include <boost/regex.hpp>
        extern std::map<std::string,int> mapObjects;
#endif
//---------------------------------------------------------------------------//
//                               CLayerTanh                                  //
//---------------------------------------------------------------------------//

namespace
#ifdef CUDA
        cuda
#else
        host
#endif
{

class CLayerTanh : public CLayerWSum
{
public:

#ifdef CUDA     
        __device__ static void run ( CLayer* layer )
        {
                CLayerTanh* L = static_cast<CLayerTanh*>(layer);
                if ( threadIdx.x < L->neuronCount )
                {
                        CLayerWSum::run(layer);
                        float * v = L->getNeuronPtr (threadIdx.x);
                        *v = tanh(*v);
                };
        }
#endif

};

} // namespace

#endif // LAYER_TANH

It's good that there is an OpenCL specialist here - a substantive dialogue and discussion of technologies will benefit the entire project

I am sure that neural networks will be available for all possible technologies (MQL5, C++, OpenCL, CUDA) and users will be able to choose by their taste and hardware capabilities

 

Rebyata, ya budu not often syuda zahodit'. Esli est' voprosi or interes k sovmestnim razrabotkam, pishite na moy yahoo email (ukazan v profile).

Good luck with the EngiNeuro project!

 

Options for submitting examples to the learning algorithm:

  • Piece by piece
  • By random groups
  • Sliding groups
  • All at once

Aren't you forgetting anything?


I've prepared a project framework, I'll take a couple of days and put it up for discussion...
 
Urain:

Options for submitting examples to the learning algorithm:

  • Piece by piece
  • Random groups.
  • In sliding groups.
  • All at once.

Aren't you forgetting something?


Prepared the framework of the project, a couple of days to shake things up and will post in the discussion...

One at a time, I see.

I don't understand the rest of the options. How can I submit all the examples to study at once? - Or am I dumb?

 
joo:

Piece by piece - I understand.

I don't understand the other options. How can I submit all the examples to study at once? - Or am I stupid?

Well you submitted all examples at once (by calculating the total value of FF on all examples), training algorithms are different, bekprop for example submits one example at a time in random order, but scrolls the entire list of examples several times, so why not give the algorithm all examples at once, and it already by its logic will give them to the grid.

ZZY Not dumb and just not on what I was thinking there, I'll shake it all out with explanations.

ZZZY Just now is not quite ready to explain everything, I would myself to understand :)