c - wrong in initialize shared memory with global memory in CUDA -


I have recently written a simple program, the kernel function is below:

  #BLOCK_SIZE16 #define RADIOUS 7 #define SM_SIZE BLOCK_SIZE + 2 * RADIOUS __global__ static void DarkChannelPriorCUDA (constant name * R, size_t LDR, constant name * g, size_t ldg, constant name * b, ldb size_t, name * d, size_t Define ldd, int n, int m) {__shared__ float R [SM_SIZE] [SM_SIZE]; __shared__ Float G [SM_ SIZE] [SM_SIZE]; __shared__ float B [SM_SIZE] [SM_SIZE]; Const int tidr = threadIdx.x; Const int tidc = threadIdx.y; Const int bidr = blockIdx.x * BLOCK_SIZE; Const int bidc = blockIdx.y * BLOCK_SIZE; Int i, j, tr, tc; For (i = 0; i & lt; SM_SIZE; i + = BLOCK_SIZE) {tr = bdr-radia = i + tider; For (j = 0; j & lt; SM_SIZE; j + = BLOCK_SIZE) {tc = bidc- radius + j + tidc; If (tr  lt; 0; tr & gt; = n || tc & gt; = i) {r [i] [j] = 1e20; J [i] [j] = 1e20; B [i] [j] = 1e20; } Other {r [i] [j] = r [tr * ldr + tc]; G [i] [j] = g [tr * ldg + tc]; B [i] [j] = b [tr * ldb + tc]; }}} __syncthreads (); Float results = 1e20; (J = tidc; j & lt; = tidc + 2 * RADIOUS; J ++) for {i = tidr; i & lt; = tidr + 2 * RADIOUS; i ++} {result = results & lt; [Ii] [ja]? Results: r [i] [j]; Results = result & lt; Jee [ii] [ja]? The result: j [i] [j]; Results = result & lt; B [ii] [ja]? The result: b [i] [j]; } D [(TD + bid) * LLD + TIDC + BDC] = Results; }  

In this form of input as R, G, B3D matrix, each element of the output matrix DN * M, D [i] [J] '' 'S value R, g, b is equal to minimum value between three matrix, which is covered with window (2 * radius + 1) * (2 * radius + 1) with the center (i, j).

To accelerate, I used a shared memory to store the value in a small quantity for each block, each block has 16 * 16 threads, one for each single thread Max Matrix D Calculates the result for the element. Shared memory needs to be stored (BLOCK_SIZE + 2 * RADIOUS) * (BLOCK_SIZE + 2 * RADIOUS) elements of R, G, B

But the result is wrong, the value in shared memory R, ​​G and B is different from R, G and B in global memory. It seems that is never successful to share data in global memory, I can not understand why this happens.

itemprop = "text">

You should know what is inside the global level, according to each thread. When you type:

  r [i] [j] = r [tr * ldr + tc]; G [i] [j] = g [tr * ldg + tc]; B [i] [j] = b [tr * ldb + tc];  

Different threads in each block are overwriting the [i] [j] component of R, G and B, which are shared between threads.


Comments

Popular posts from this blog

import - Python ImportError: No module named wmi -

Editing Python Class in Shell and SQLAlchemy -

lua - HowTo create a fuel bar -