<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for The Adventures of a Systems Engineer</title>
	<atom:link href="http://jamesdevine.info/index.php/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://jamesdevine.info</link>
	<description>James Devine&#039;s Website</description>
	<lastBuildDate>Sun, 26 Jun 2011 11:07:41 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
	<item>
		<title>Comment on CUDA Parallel Merge Sort by Bizz</title>
		<link>http://jamesdevine.info/index.php/projects/cuda-parallel-merge-sort/comment-page-1/#comment-2833</link>
		<dc:creator>Bizz</dc:creator>
		<pubDate>Sun, 26 Jun 2011 11:07:41 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=186#comment-2833</guid>
		<description>just right where the font turned bold there&#039;s an invisible &quot;is smaller than&quot; sign.</description>
		<content:encoded><![CDATA[<p>just right where the font turned bold there&#8217;s an invisible &#8220;is smaller than&#8221; sign.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on CUDA Parallel Merge Sort by Bizz</title>
		<link>http://jamesdevine.info/index.php/projects/cuda-parallel-merge-sort/comment-page-1/#comment-2832</link>
		<dc:creator>Bizz</dc:creator>
		<pubDate>Sun, 26 Jun 2011 11:04:43 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=186#comment-2832</guid>
		<description>Wow! it was almost two years ago! I&#039;ve completed that task; it turned out some algorithms just doesn&#039;t belong to parallel processing world! However I changed it a bit and then parallelized it (or whatever the word is!) and here is the code:
(i used bcc5.5 with cuda sdk 3)

cls
nvcc -run --use_fast_math -arch=compute_11 mergec.cu --output-file fileName.exe

//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//------------------------------------    Kernel CODE     --------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------



#ifndef _MERGEK_CU_
#define _MERGEK_CU_
#include &quot;device_functions.h&quot;
#define BLIND_UNROLL_COUNT 255
#define T (*((unsigned short*) (&amp;Shared_Ptrs)))
#define M (*(((unsigned short*) (&amp;Shared_Ptrs)+1)))
#define F (*((unsigned char*) (&amp;composite)))
#define SIDE (*(((unsigned char*) (&amp;composite)+1)))
#define J1 (*(((unsigned char*) (&amp;composite)+2)))
#define J2 (*(((unsigned char*) (&amp;composite)+3)))

//A: input array from host
//B: (shared memory) calculation area
//C: output array to host

//	 i didn&#039;t use binary search for the searching part, because the algorithm has been changed
//	 and the &quot;binary search&quot; part of it has been replaced by a &quot;pairwise swap sort&quot; (MergeSortK2Pairs)
//	 and an improved version of &quot;sequensial search&quot; specially for cuda (MergeSortK2Seqns)
//	 because finding the exact location of each item is easier when &quot;pairwise swap sort&quot; is aleady applied
//	 but the algorithm could still be improved by adding a cuda-friendly search method that mixes up
//	 binary (to be faster) and sequensial (to protect the coalescing of reads and writes)


//Output of this function will be the full array but only sorted in multiple 512-elements arrays
//There must be an optimum value to this thing but I just used 512
//this funtion has almost got no conditional branches. &quot;for&quot;s are fully unrolled and most
//of the &quot;if&quot;s here also don&#039;t count, because the program runs only through one of them in each warp.
//after that the second and the third functions run one after another in a loop
//until all the array elements are sorted.
__global__ static void MergeSortK1(int* A, unsigned int blockSize)
{
	extern __shared__ int B[];
	unsigned int Shared_Ptrs;//contains T and M (T is current location, M is T+L)
	unsigned int composite = 0;//contains block offset(=F), J1 and J2
	int val1,val2;//temp
	//Fill B from A
	//to prevent conflicts and also to keep the coalescing, two sides are exactly one blockSize apart
	val1 = (blockIdx.x+blockIdx.y*gridDim.x)*(blockSize&lt;&lt;1)+threadIdx.x;
	B[threadIdx.x] = A[val1];
	__syncthreads();
	B[threadIdx.x+blockSize] = A[val1 + blockSize];
	__syncthreads();
	//Sort B in segments of 512
	//L is the distance between T and its M, and L*2 is the distance between T and next T
	//L=1
	//if B[T] and B[M] are not in an incremental order then swap them
	if(B[threadIdx.x&lt;B[1+(threadIdx.x&lt;&lt;1)])
	{
  		B[threadIdx.x&lt;&lt;1] ^= B[1+(threadIdx.x&lt;&lt;1)];
		B[1+(threadIdx.x&lt;&lt;1)] ^= B[threadIdx.x&lt;&lt;1];
  		B[threadIdx.x&lt;&lt;1] ^= B[1+(threadIdx.x&lt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	M = T+1;
	if ((T&amp;1)==1 &amp;&amp; B[M]&lt;B&gt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	val1 = B[T];
	val2 = B[M];
	#pragma unroll _L
	for (int j=1; j&lt;_L; j++)//nothing happens if j=0. so the loop starts from j=1
	{
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;
		if (J2==0 &amp;&amp; j=B[M]) J2=j;
	}
	__syncthreads();
	if (J1&gt;0) B[T+F-J1+1] = val1;
	if (J2&gt;0) B[T+F+J2] = val2;
	__syncthreads();
	//L=8
	#define _L 8
	#define _l 7
	composite = 0;
	T = (threadIdx.x&amp;-_L) + threadIdx.x;
	M = T + _L;
	F = T &amp; _l;
	if(B[T]&gt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	val1 = B[T];
	val2 = B[M];
	#pragma unroll _L
	for (int j=1; j&lt;_L; j++)
	{
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;
		if (J2==0 &amp;&amp; j=B[M]) J2=j;
	}
	__syncthreads();
	if (J1&gt;0) B[T+F-J1+1] = val1;
	if (J2&gt;0) B[T+F+J2] = val2;
	__syncthreads();
	//L=16
	#define _L 16
	#define _l 15
	composite = 0;
	T = (threadIdx.x&amp;-_L) + threadIdx.x;
	M = T + _L;
	F = T &amp; _l;
	if(B[T]&gt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	val1 = B[T];
	val2 = B[M];
	#pragma unroll _L
	for (int j=1; j&lt;_L; j++)
	{
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;
		if (J2==0 &amp;&amp; j=B[M]) J2=j;
	}
	__syncthreads();
	if (J1&gt;0) B[T+F-J1+1] = val1;
	if (J2&gt;0) B[T+F+J2] = val2;
	__syncthreads();
	//L=32
	#define _L 32
	#define _l 31
	composite = 0;
	T = (threadIdx.x&amp;-_L) + threadIdx.x;
	M = T + _L;
	F = T &amp; _l;
	if(B[T]&gt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	val1 = B[T];
	val2 = B[M];
	#pragma unroll _L
	for (int j=1; j&lt;_L; j++)
	{
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;
		if (J2==0 &amp;&amp; j=B[M]) J2=j;
	}
	__syncthreads();
	if (J1&gt;0) B[T+F-J1+1] = val1;
	if (J2&gt;0) B[T+F+J2] = val2;
	__syncthreads();
	//L=64
	#define _L 64
	#define _l 63
	composite = 0;
	T = (threadIdx.x&amp;-_L) + threadIdx.x;
	M = T + _L;
	F = T &amp; _l;
	if(B[T]&gt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	val1 = B[T];
	val2 = B[M];
	#pragma unroll _L
	for (int j=1; j&lt;_L; j++)
	{
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;
		if (J2==0 &amp;&amp; j=B[M]) J2=j;
		if (J1!=0 &amp;&amp; J2!=0) break;//from now on we have this condition-break in loops which reduces time to 219 from 255
	}
	__syncthreads();
	if (J1&gt;0) B[T+F-J1+1] = val1;
	if (J2&gt;0) B[T+F+J2] = val2;
	__syncthreads();
	//L=128
	#define _L 128
	#define _l 127
	composite = 0;
	T = (threadIdx.x&amp;-_L) + threadIdx.x;
	M = T + _L;
	F = T &amp; _l;
	if(B[T]&gt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	val1 = B[T];
	val2 = B[M];
	#pragma unroll _L
	for (int j=1; j&lt;_L; j++)
	{
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;
		if (J2==0 &amp;&amp; j=B[M]) J2=j;
		if (J1!=0 &amp;&amp; J2!=0) break;
	}
	__syncthreads();
	if (J1&gt;0) B[T+F-J1+1] = val1;
	if (J2&gt;0) B[T+F+J2] = val2;
	__syncthreads();
	//L=256
	#define _L 256
	#define _l 255
	composite = 0;
	T = (threadIdx.x&amp;-_L) + threadIdx.x;
	M = T + _L;
	F = T &amp; _l;
	if(B[T]&gt;B[M])
	{
  		B[T] ^= B[M];
		B[M] ^= B[T];
  		B[T] ^= B[M];
	}
	__syncthreads();
	val1 = B[T];
	val2 = B[M];
	#pragma unroll _L
	for (int j=1; j&lt;_L; j++)
	{
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;
		if (J2==0 &amp;&amp; j=B[M]) J2=j;
		if (J1!=0 &amp;&amp; J2!=0) break;
	}
	__syncthreads();
	if (J1&gt;0) B[T+F-J1+1] = val1;
	if (J2&gt;0) B[T+F+J2] = val2;
	//Update A from B
	__syncthreads();
	val1 = (blockIdx.x+blockIdx.y*gridDim.x)*(blockSize&lt;&lt;1)+threadIdx.x;
	A[val1] = B[threadIdx.x];
	__syncthreads();
	A[val1 + blockSize] = B[threadIdx.x+blockSize];
	__syncthreads();
}


//A and C cannot be a single array because due to the parallel limit of 3584 (256 here) threads,
//original values of A is needed for next 256 threads (512 elements) and sortings should not
//be made directly to A and therefor A is treated as read-only until all threads have completed one level
//this kernel performs one level of pairwise Sort (or you should name it Dual Swap sort)
__global__ static void MergeSortK2Pairs(int* A, unsigned int blockSize, unsigned int L)
{
	unsigned int gid = (threadIdx.x+blockSize*(blockIdx.x+blockIdx.y*gridDim.x));
	unsigned int t = ((gid&amp;(-L))&lt;&lt;1) +(gid&amp;(L-1));//so this kernel will run as many as half of all elements
	int temp = A[t+L];
	if (temp&lt;A[t])
	{
		A[t+L] = A[t];
		A[t] = temp;
	}
	//This Method is much slower:
	//atomicMin(&amp;A[t],atomicMax(&amp;A[t+L],A[t]));
}

//this kernel performs one level of Sequential Sort (Loop search through the memory) one sided
__global__ static void MergeSortK2Seqns(int* A, int* C, unsigned int blockSize, unsigned int L)
{
	unsigned int gid = (threadIdx.x+blockSize*(blockIdx.x+blockIdx.y*gridDim.x));
	int cur = A[gid];//current value
	unsigned int t = gid &amp; (L-1);//index in current subarray
	unsigned int s;//start of other subarray
	unsigned int h;//start of block (for left side) or end of block (for right side) which have to be searched
	unsigned int f;//number of blocks to subarray&#039;s starting (ending) point from current block starting (ending) point
	unsigned int j;//simple counter
	int J;//correct offset relative to gid
	if ((gid&amp;L)==0)
	{
		s = (gid+L)&amp;(-L);
		h = (gid+L)&amp;(-blockSize);
		f = (h-s)/blockSize;
		//sort
		if (A[h]=h; j--)
			{
				//search current block from t to start
				if (A[j]cur)
			{
				//target is less than every element in the other side
				J = -1;
			}
			else
			{
				//target is in range (s..h)
				for (j=0; j&lt;f; j++)
				{
					//quickly search previous blocks in the subarray
					h-=blockSize;
					if (A[h]&lt;=cur) break;
				}
				//target is in the block with starting point &#039;h&#039;
				J = h+blockSize-1;
				for (j=0; j&lt;blockSize; j++)
				{
					if(A[J]=cur)
		{
			//target is in current block
			J = L-t;
			for (j=gid-L; j=cur) break;
				J--;
			}
		}
		else
		{
			//don&#039;t search current block, instead search one other block entirely
			if (A[s]&lt;cur)
			{
				//target is more than every element in the other side
				J = 0;
			}
			else
			{
				//target is in range (h..s)
				for (j=0; j=cur) break;
				}
				//target is in the block with ending point &#039;h&#039;
				J = h-blockSize+1;
				for (j=0; j=cur) break;
					J++;
				}
				J = s-J+1;
			}
		}
		//place the cur value in the correct index of C
		C[gid-J] = cur;
	}
	__syncthreads();
}

#endif

//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//------------------------------------    CPP CODE     -----------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------

#include 
#include 
#include 
#include 
#include &quot;mergeK.cu&quot;
#include 

//Forward declaration
void SetDimensions();
void DeclareDimensions();
void StartKernel();
bool VerifyResults();
void MakeArraysReady();
void InitializeProg();
void FinalizeProg(bool passed);
int main ();

//Cuda-enabled device which is in use
cudaDeviceProp prop;
//grid and block size for three kernels&#039; exec config
dim3 halfGrid, grid, block;
//Number of elements to sort
unsigned int NUM;
//Arrays
int *host_input_arr;
int *dev_arr1;
int *dev_arr2;
int *host_output_arr;
//Reporting
clock_t start, end;
cudaError_t err;

void SetDimensions()
{
	//Set grid and block sizes for all three Kernels&#039; execution configuration
	//blockSize = warpSize * number of blocks per multiprocessor
	block.x = 256;
	block.y = 1;
	block.z = 1;

	halfGrid.x = 16384;//lower this value if gpu terminated the program
	halfGrid.y = 4;
	halfGrid.z = 1;

	grid.x = halfGrid.x*2;
	grid.y = halfGrid.y;
	grid.z = 1;

	NUM = block.x*grid.x*grid.y;
}
void MakeArraysReady()
{
	//Input data is loaded into host_input_arr, then copied to dev_arr.
	//And after the execution of the Kernel,
	//the result will be copied back to host_output_arr
	host_input_arr = new int[NUM];
	host_output_arr = new int[NUM];
	//Fill host_input_arr with random data
	for(int i=0; i&lt;NUM; i++) 
	{
		host_input_arr[i] = rand()%10;
	}
	//Allocate dev_arr&#039;s on the device memory (Global)
	cutilSafeCall(cudaMalloc((void**)&amp;dev_arr1, sizeof(int)*NUM));
	cutilSafeCall(cudaMalloc((void**)&amp;dev_arr2, sizeof(int)*NUM));
	//Copy host_input_arr into dev_arr
	cutilSafeCall(cudaMemcpy(dev_arr1, host_input_arr, sizeof(int)*NUM, cudaMemcpyHostToDevice));
}

void StartKernel()
{
	printf(&quot;Working...\n&quot;);
	start = clock();
	//the following kernel returns an array in which elements are sorted in 512(=2*block.x) subarrays
	MergeSortK1&lt;&lt;&gt;&gt;(dev_arr1,block.x);
	//Wait until Kernel is done
	cudaThreadSynchronize();
	int SubArrayHalfLength = block.x&lt;&lt;1;
	int counter = 0;
	while (SubArrayHalfLength&lt;NUM)
	{
		if (counter%2==0)
		{
			//each thread of the following kernel, swap-sorts two memory addresses: A[T],A[T+512] then A[T],A[T+1024]...
			MergeSortK2Pairs&lt;&lt;&gt;&gt;(dev_arr1,block.x,SubArrayHalfLength);//no shared size
			//Wait until Kernel is done
			cudaThreadSynchronize();
			//this kernel read one elem and find its position (subarray size starts from 2x512)
			MergeSortK2Seqns&lt;&lt;&gt;&gt;(dev_arr1,dev_arr2,block.x,SubArrayHalfLength);
			//Wait until Kernel is done
			cudaThreadSynchronize();
		}
		else
		{
			MergeSortK2Pairs&lt;&lt;&gt;&gt;(dev_arr2,block.x,SubArrayHalfLength);//no shared size
			//Wait until Kernel is done
			cudaThreadSynchronize();
			MergeSortK2Seqns&lt;&lt;&gt;&gt;(dev_arr2,dev_arr1,block.x,SubArrayHalfLength);
			//Wait until Kernel is done
			cudaThreadSynchronize();
		}
		SubArrayHalfLength&lt;1)
	{
		printf(&quot;\n%i CUDA-Enabled devices found.\nFirst device is used.\n&quot;,num_devices);
	}
	//Set active device to dev#0 and read its features and properties
	cudaSetDevice(0);
	err = cudaGetDeviceProperties(&amp;prop, 0);
	//Show GPU device features on the screen
	printf(&quot;-----------------------------------------------------------------\n&quot;);
	printf(&quot;&gt; GPU Basic Info  GPU Memory Sizes  Hierarchy Sizes  Input Data &lt;\n&quot;);
	printf(&quot;-----------------------------------------------------------------\n&quot;);
	printf(&quot;Number of input data: %i\n&quot;,NUM);
	printf(&quot;Input data type:      int\n&quot;);
	printf(&quot;Input data pattern:   Random 1..10\n&quot;);
	printf(&quot;Grid Size:            %i\n&quot;,grid.x);
	printf(&quot;Block Size:           %i x %i\n&quot;,block.x,block.y);
	printf(&quot;-----------------------------------------------------------------\n&quot;);

}
bool VerifyResults()
{
	for (int i=0; i0) if (host_output_arr[i]&lt;host_output_arr[i-1])
			return false;
	return true;
}
void FinalizeProg(bool passed)
{
	//Print the outcome
	printf(&quot;\n-----------------------------------------------------------------\n&quot;);
	printf(&quot;%s\n&quot;, passed ? &quot;Pass!&quot; : &quot;Fail!&quot;);
	printf(&quot;NUM:\t\t2^%i =\t%i\nGridSize:\t2^%i x\t2^%i\n&quot;, (int)log2((double)NUM), NUM, (int)log2((double)grid.x), (int)log2((double)grid.y));
	printf(&quot;Time:\t\t%i ms\n&quot;, end-start);
	printf(&quot;-----------------------------------------------------------------\n&quot;);
	printf(&quot;Press ENTER to exit...\n&quot;);
	fflush(stdout);
	fflush(stderr);
	getchar();
	exit(EXIT_SUCCESS);
}</description>
		<content:encoded><![CDATA[<p>Wow! it was almost two years ago! I&#8217;ve completed that task; it turned out some algorithms just doesn&#8217;t belong to parallel processing world! However I changed it a bit and then parallelized it (or whatever the word is!) and here is the code:<br />
(i used bcc5.5 with cuda sdk 3)</p>
<p>cls<br />
nvcc -run &#8211;use_fast_math -arch=compute_11 mergec.cu &#8211;output-file fileName.exe</p>
<p>//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;    Kernel CODE     &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</p>
<p>#ifndef _MERGEK_CU_<br />
#define _MERGEK_CU_<br />
#include &#8220;device_functions.h&#8221;<br />
#define BLIND_UNROLL_COUNT 255<br />
#define T (*((unsigned short*) (&amp;Shared_Ptrs)))<br />
#define M (*(((unsigned short*) (&amp;Shared_Ptrs)+1)))<br />
#define F (*((unsigned char*) (&amp;composite)))<br />
#define SIDE (*(((unsigned char*) (&amp;composite)+1)))<br />
#define J1 (*(((unsigned char*) (&amp;composite)+2)))<br />
#define J2 (*(((unsigned char*) (&amp;composite)+3)))</p>
<p>//A: input array from host<br />
//B: (shared memory) calculation area<br />
//C: output array to host</p>
<p>//	 i didn&#8217;t use binary search for the searching part, because the algorithm has been changed<br />
//	 and the &#8220;binary search&#8221; part of it has been replaced by a &#8220;pairwise swap sort&#8221; (MergeSortK2Pairs)<br />
//	 and an improved version of &#8220;sequensial search&#8221; specially for cuda (MergeSortK2Seqns)<br />
//	 because finding the exact location of each item is easier when &#8220;pairwise swap sort&#8221; is aleady applied<br />
//	 but the algorithm could still be improved by adding a cuda-friendly search method that mixes up<br />
//	 binary (to be faster) and sequensial (to protect the coalescing of reads and writes)</p>
<p>//Output of this function will be the full array but only sorted in multiple 512-elements arrays<br />
//There must be an optimum value to this thing but I just used 512<br />
//this funtion has almost got no conditional branches. &#8220;for&#8221;s are fully unrolled and most<br />
//of the &#8220;if&#8221;s here also don&#8217;t count, because the program runs only through one of them in each warp.<br />
//after that the second and the third functions run one after another in a loop<br />
//until all the array elements are sorted.<br />
__global__ static void MergeSortK1(int* A, unsigned int blockSize)<br />
{<br />
	extern __shared__ int B[];<br />
	unsigned int Shared_Ptrs;//contains T and M (T is current location, M is T+L)<br />
	unsigned int composite = 0;//contains block offset(=F), J1 and J2<br />
	int val1,val2;//temp<br />
	//Fill B from A<br />
	//to prevent conflicts and also to keep the coalescing, two sides are exactly one blockSize apart<br />
	val1 = (blockIdx.x+blockIdx.y*gridDim.x)*(blockSize&lt;&lt;1)+threadIdx.x;<br />
	B[threadIdx.x] = A[val1];<br />
	__syncthreads();<br />
	B[threadIdx.x+blockSize] = A[val1 + blockSize];<br />
	__syncthreads();<br />
	//Sort B in segments of 512<br />
	//L is the distance between T and its M, and L*2 is the distance between T and next T<br />
	//L=1<br />
	//if B[T] and B[M] are not in an incremental order then swap them<br />
	if(B[threadIdx.x&lt;B[1+(threadIdx.x&lt;&lt;1)])<br />
	{<br />
  		B[threadIdx.x&lt;&lt;1] ^= B[1+(threadIdx.x&lt;&lt;1)];<br />
		B[1+(threadIdx.x&lt;&lt;1)] ^= B[threadIdx.x&lt;&lt;1];<br />
  		B[threadIdx.x&lt;&lt;1] ^= B[1+(threadIdx.x&lt;B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	M = T+1;<br />
	if ((T&amp;1)==1 &amp;&amp; B[M]<b>B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	val1 = B[T];<br />
	val2 = B[M];<br />
	#pragma unroll _L<br />
	for (int j=1; j&lt;_L; j++)//nothing happens if j=0. so the loop starts from j=1<br />
	{<br />
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;<br />
		if (J2==0 &amp;&amp; j=B[M]) J2=j;<br />
	}<br />
	__syncthreads();<br />
	if (J1&gt;0) B[T+F-J1+1] = val1;<br />
	if (J2&gt;0) B[T+F+J2] = val2;<br />
	__syncthreads();<br />
	//L=8<br />
	#define _L 8<br />
	#define _l 7<br />
	composite = 0;<br />
	T = (threadIdx.x&amp;-_L) + threadIdx.x;<br />
	M = T + _L;<br />
	F = T &amp; _l;<br />
	if(B[T]&gt;B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	val1 = B[T];<br />
	val2 = B[M];<br />
	#pragma unroll _L<br />
	for (int j=1; j&lt;_L; j++)<br />
	{<br />
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;<br />
		if (J2==0 &amp;&amp; j=B[M]) J2=j;<br />
	}<br />
	__syncthreads();<br />
	if (J1&gt;0) B[T+F-J1+1] = val1;<br />
	if (J2&gt;0) B[T+F+J2] = val2;<br />
	__syncthreads();<br />
	//L=16<br />
	#define _L 16<br />
	#define _l 15<br />
	composite = 0;<br />
	T = (threadIdx.x&amp;-_L) + threadIdx.x;<br />
	M = T + _L;<br />
	F = T &amp; _l;<br />
	if(B[T]&gt;B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	val1 = B[T];<br />
	val2 = B[M];<br />
	#pragma unroll _L<br />
	for (int j=1; j&lt;_L; j++)<br />
	{<br />
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;<br />
		if (J2==0 &amp;&amp; j=B[M]) J2=j;<br />
	}<br />
	__syncthreads();<br />
	if (J1&gt;0) B[T+F-J1+1] = val1;<br />
	if (J2&gt;0) B[T+F+J2] = val2;<br />
	__syncthreads();<br />
	//L=32<br />
	#define _L 32<br />
	#define _l 31<br />
	composite = 0;<br />
	T = (threadIdx.x&amp;-_L) + threadIdx.x;<br />
	M = T + _L;<br />
	F = T &amp; _l;<br />
	if(B[T]&gt;B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	val1 = B[T];<br />
	val2 = B[M];<br />
	#pragma unroll _L<br />
	for (int j=1; j&lt;_L; j++)<br />
	{<br />
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;<br />
		if (J2==0 &amp;&amp; j=B[M]) J2=j;<br />
	}<br />
	__syncthreads();<br />
	if (J1&gt;0) B[T+F-J1+1] = val1;<br />
	if (J2&gt;0) B[T+F+J2] = val2;<br />
	__syncthreads();<br />
	//L=64<br />
	#define _L 64<br />
	#define _l 63<br />
	composite = 0;<br />
	T = (threadIdx.x&amp;-_L) + threadIdx.x;<br />
	M = T + _L;<br />
	F = T &amp; _l;<br />
	if(B[T]&gt;B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	val1 = B[T];<br />
	val2 = B[M];<br />
	#pragma unroll _L<br />
	for (int j=1; j&lt;_L; j++)<br />
	{<br />
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;<br />
		if (J2==0 &amp;&amp; j=B[M]) J2=j;<br />
		if (J1!=0 &amp;&amp; J2!=0) break;//from now on we have this condition-break in loops which reduces time to 219 from 255<br />
	}<br />
	__syncthreads();<br />
	if (J1&gt;0) B[T+F-J1+1] = val1;<br />
	if (J2&gt;0) B[T+F+J2] = val2;<br />
	__syncthreads();<br />
	//L=128<br />
	#define _L 128<br />
	#define _l 127<br />
	composite = 0;<br />
	T = (threadIdx.x&amp;-_L) + threadIdx.x;<br />
	M = T + _L;<br />
	F = T &amp; _l;<br />
	if(B[T]&gt;B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	val1 = B[T];<br />
	val2 = B[M];<br />
	#pragma unroll _L<br />
	for (int j=1; j&lt;_L; j++)<br />
	{<br />
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;<br />
		if (J2==0 &amp;&amp; j=B[M]) J2=j;<br />
		if (J1!=0 &amp;&amp; J2!=0) break;<br />
	}<br />
	__syncthreads();<br />
	if (J1&gt;0) B[T+F-J1+1] = val1;<br />
	if (J2&gt;0) B[T+F+J2] = val2;<br />
	__syncthreads();<br />
	//L=256<br />
	#define _L 256<br />
	#define _l 255<br />
	composite = 0;<br />
	T = (threadIdx.x&amp;-_L) + threadIdx.x;<br />
	M = T + _L;<br />
	F = T &amp; _l;<br />
	if(B[T]&gt;B[M])<br />
	{<br />
  		B[T] ^= B[M];<br />
		B[M] ^= B[T];<br />
  		B[T] ^= B[M];<br />
	}<br />
	__syncthreads();<br />
	val1 = B[T];<br />
	val2 = B[M];<br />
	#pragma unroll _L<br />
	for (int j=1; j&lt;_L; j++)<br />
	{<br />
		if (J1==0 &amp;&amp; j&lt;=F   &amp;&amp; B[M-j]&lt;=B[T]) J1=j;<br />
		if (J2==0 &amp;&amp; j=B[M]) J2=j;<br />
		if (J1!=0 &amp;&amp; J2!=0) break;<br />
	}<br />
	__syncthreads();<br />
	if (J1&gt;0) B[T+F-J1+1] = val1;<br />
	if (J2&gt;0) B[T+F+J2] = val2;<br />
	//Update A from B<br />
	__syncthreads();<br />
	val1 = (blockIdx.x+blockIdx.y*gridDim.x)*(blockSize&lt;&lt;1)+threadIdx.x;<br />
	A[val1] = B[threadIdx.x];<br />
	__syncthreads();<br />
	A[val1 + blockSize] = B[threadIdx.x+blockSize];<br />
	__syncthreads();<br />
}</p>
<p>//A and C cannot be a single array because due to the parallel limit of 3584 (256 here) threads,<br />
//original values of A is needed for next 256 threads (512 elements) and sortings should not<br />
//be made directly to A and therefor A is treated as read-only until all threads have completed one level<br />
//this kernel performs one level of pairwise Sort (or you should name it Dual Swap sort)<br />
__global__ static void MergeSortK2Pairs(int* A, unsigned int blockSize, unsigned int L)<br />
{<br />
	unsigned int gid = (threadIdx.x+blockSize*(blockIdx.x+blockIdx.y*gridDim.x));<br />
	unsigned int t = ((gid&amp;(-L))&lt;&lt;1) +(gid&amp;(L-1));//so this kernel will run as many as half of all elements<br />
	int temp = A[t+L];<br />
	if (temp&lt;A[t])<br />
	{<br />
		A[t+L] = A[t];<br />
		A[t] = temp;<br />
	}<br />
	//This Method is much slower:<br />
	//atomicMin(&amp;A[t],atomicMax(&amp;A[t+L],A[t]));<br />
}</p>
<p>//this kernel performs one level of Sequential Sort (Loop search through the memory) one sided<br />
__global__ static void MergeSortK2Seqns(int* A, int* C, unsigned int blockSize, unsigned int L)<br />
{<br />
	unsigned int gid = (threadIdx.x+blockSize*(blockIdx.x+blockIdx.y*gridDim.x));<br />
	int cur = A[gid];//current value<br />
	unsigned int t = gid &amp; (L-1);//index in current subarray<br />
	unsigned int s;//start of other subarray<br />
	unsigned int h;//start of block (for left side) or end of block (for right side) which have to be searched<br />
	unsigned int f;//number of blocks to subarray&#039;s starting (ending) point from current block starting (ending) point<br />
	unsigned int j;//simple counter<br />
	int J;//correct offset relative to gid<br />
	if ((gid&amp;L)==0)<br />
	{<br />
		s = (gid+L)&amp;(-L);<br />
		h = (gid+L)&amp;(-blockSize);<br />
		f = (h-s)/blockSize;<br />
		//sort<br />
		if (A[h]=h; j&#8211;)<br />
			{<br />
				//search current block from t to start<br />
				if (A[j]cur)<br />
			{<br />
				//target is less than every element in the other side<br />
				J = -1;<br />
			}<br />
			else<br />
			{<br />
				//target is in range (s..h)<br />
				for (j=0; j&lt;f; j++)<br />
				{<br />
					//quickly search previous blocks in the subarray<br />
					h-=blockSize;<br />
					if (A[h]&lt;=cur) break;<br />
				}<br />
				//target is in the block with starting point &#039;h&#039;<br />
				J = h+blockSize-1;<br />
				for (j=0; j&lt;blockSize; j++)<br />
				{<br />
					if(A[J]=cur)<br />
		{<br />
			//target is in current block<br />
			J = L-t;<br />
			for (j=gid-L; j=cur) break;<br />
				J&#8211;;<br />
			}<br />
		}<br />
		else<br />
		{<br />
			//don&#8217;t search current block, instead search one other block entirely<br />
			if (A[s]&lt;cur)<br />
			{<br />
				//target is more than every element in the other side<br />
				J = 0;<br />
			}<br />
			else<br />
			{<br />
				//target is in range (h..s)<br />
				for (j=0; j=cur) break;<br />
				}<br />
				//target is in the block with ending point &#8216;h&#8217;<br />
				J = h-blockSize+1;<br />
				for (j=0; j=cur) break;<br />
					J++;<br />
				}<br />
				J = s-J+1;<br />
			}<br />
		}<br />
		//place the cur value in the correct index of C<br />
		C[gid-J] = cur;<br />
	}<br />
	__syncthreads();<br />
}</p>
<p>#endif</p>
<p>//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;    CPP CODE     &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
//&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</p>
<p>#include<br />
#include<br />
#include<br />
#include<br />
#include &#8220;mergeK.cu&#8221;<br />
#include </p>
<p>//Forward declaration<br />
void SetDimensions();<br />
void DeclareDimensions();<br />
void StartKernel();<br />
bool VerifyResults();<br />
void MakeArraysReady();<br />
void InitializeProg();<br />
void FinalizeProg(bool passed);<br />
int main ();</p>
<p>//Cuda-enabled device which is in use<br />
cudaDeviceProp prop;<br />
//grid and block size for three kernels&#8217; exec config<br />
dim3 halfGrid, grid, block;<br />
//Number of elements to sort<br />
unsigned int NUM;<br />
//Arrays<br />
int *host_input_arr;<br />
int *dev_arr1;<br />
int *dev_arr2;<br />
int *host_output_arr;<br />
//Reporting<br />
clock_t start, end;<br />
cudaError_t err;</p>
<p>void SetDimensions()<br />
{<br />
	//Set grid and block sizes for all three Kernels&#8217; execution configuration<br />
	//blockSize = warpSize * number of blocks per multiprocessor<br />
	block.x = 256;<br />
	block.y = 1;<br />
	block.z = 1;</p>
<p>	halfGrid.x = 16384;//lower this value if gpu terminated the program<br />
	halfGrid.y = 4;<br />
	halfGrid.z = 1;</p>
<p>	grid.x = halfGrid.x*2;<br />
	grid.y = halfGrid.y;<br />
	grid.z = 1;</p>
<p>	NUM = block.x*grid.x*grid.y;<br />
}<br />
void MakeArraysReady()<br />
{<br />
	//Input data is loaded into host_input_arr, then copied to dev_arr.<br />
	//And after the execution of the Kernel,<br />
	//the result will be copied back to host_output_arr<br />
	host_input_arr = new int[NUM];<br />
	host_output_arr = new int[NUM];<br />
	//Fill host_input_arr with random data<br />
	for(int i=0; i&lt;NUM; i++)<br />
	{<br />
		host_input_arr[i] = rand()%10;<br />
	}<br />
	//Allocate dev_arr&#039;s on the device memory (Global)<br />
	cutilSafeCall(cudaMalloc((void**)&amp;dev_arr1, sizeof(int)*NUM));<br />
	cutilSafeCall(cudaMalloc((void**)&amp;dev_arr2, sizeof(int)*NUM));<br />
	//Copy host_input_arr into dev_arr<br />
	cutilSafeCall(cudaMemcpy(dev_arr1, host_input_arr, sizeof(int)*NUM, cudaMemcpyHostToDevice));<br />
}</p>
<p>void StartKernel()<br />
{<br />
	printf(&quot;Working&#8230;\n&quot;);<br />
	start = clock();<br />
	//the following kernel returns an array in which elements are sorted in 512(=2*block.x) subarrays<br />
	MergeSortK1&lt;&lt;&gt;&gt;(dev_arr1,block.x);<br />
	//Wait until Kernel is done<br />
	cudaThreadSynchronize();<br />
	int SubArrayHalfLength = block.x&lt;&lt;1;<br />
	int counter = 0;<br />
	while (SubArrayHalfLength&lt;NUM)<br />
	{<br />
		if (counter%2==0)<br />
		{<br />
			//each thread of the following kernel, swap-sorts two memory addresses: A[T],A[T+512] then A[T],A[T+1024]&#8230;<br />
			MergeSortK2Pairs&lt;&lt;&gt;&gt;(dev_arr1,block.x,SubArrayHalfLength);//no shared size<br />
			//Wait until Kernel is done<br />
			cudaThreadSynchronize();<br />
			//this kernel read one elem and find its position (subarray size starts from 2&#215;512)<br />
			MergeSortK2Seqns&lt;&lt;&gt;&gt;(dev_arr1,dev_arr2,block.x,SubArrayHalfLength);<br />
			//Wait until Kernel is done<br />
			cudaThreadSynchronize();<br />
		}<br />
		else<br />
		{<br />
			MergeSortK2Pairs&lt;&lt;&gt;&gt;(dev_arr2,block.x,SubArrayHalfLength);//no shared size<br />
			//Wait until Kernel is done<br />
			cudaThreadSynchronize();<br />
			MergeSortK2Seqns&lt;&lt;&gt;&gt;(dev_arr2,dev_arr1,block.x,SubArrayHalfLength);<br />
			//Wait until Kernel is done<br />
			cudaThreadSynchronize();<br />
		}<br />
		SubArrayHalfLength&lt;1)<br />
	{<br />
		printf(&#8220;\n%i CUDA-Enabled devices found.\nFirst device is used.\n&#8221;,num_devices);<br />
	}<br />
	//Set active device to dev#0 and read its features and properties<br />
	cudaSetDevice(0);<br />
	err = cudaGetDeviceProperties(&amp;prop, 0);<br />
	//Show GPU device features on the screen<br />
	printf(&#8220;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;\n&#8221;);<br />
	printf(&#8220;&gt; GPU Basic Info  GPU Memory Sizes  Hierarchy Sizes  Input Data &lt;\n&quot;);<br />
	printf(&quot;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;\n&quot;);<br />
	printf(&quot;Number of input data: %i\n&quot;,NUM);<br />
	printf(&quot;Input data type:      int\n&quot;);<br />
	printf(&quot;Input data pattern:   Random 1..10\n&quot;);<br />
	printf(&quot;Grid Size:            %i\n&quot;,grid.x);<br />
	printf(&quot;Block Size:           %i x %i\n&quot;,block.x,block.y);<br />
	printf(&quot;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;\n&quot;);</p>
<p>}<br />
bool VerifyResults()<br />
{<br />
	for (int i=0; i0) if (host_output_arr[i]&lt;host_output_arr[i-1])<br />
			return false;<br />
	return true;<br />
}<br />
void FinalizeProg(bool passed)<br />
{<br />
	//Print the outcome<br />
	printf(&quot;\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;\n&quot;);<br />
	printf(&quot;%s\n&quot;, passed ? &quot;Pass!&quot; : &quot;Fail!&quot;);<br />
	printf(&quot;NUM:\t\t2^%i =\t%i\nGridSize:\t2^%i x\t2^%i\n&quot;, (int)log2((double)NUM), NUM, (int)log2((double)grid.x), (int)log2((double)grid.y));<br />
	printf(&quot;Time:\t\t%i ms\n&quot;, end-start);<br />
	printf(&quot;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;\n&quot;);<br />
	printf(&quot;Press ENTER to exit&#8230;\n&quot;);<br />
	fflush(stdout);<br />
	fflush(stderr);<br />
	getchar();<br />
	exit(EXIT_SUCCESS);<br />
}</b></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on CUDA Parallel Merge Sort by admin</title>
		<link>http://jamesdevine.info/index.php/projects/cuda-parallel-merge-sort/comment-page-1/#comment-1597</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Thu, 10 Mar 2011 22:37:03 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=186#comment-1597</guid>
		<description>If you look at the reports the methodology is explained. We did performance tests and the parallel version ran much faster than the iterative version (not sure I have those numbers though).</description>
		<content:encoded><![CDATA[<p>If you look at the reports the methodology is explained. We did performance tests and the parallel version ran much faster than the iterative version (not sure I have those numbers though).</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on GPGPU Island Model Gentic Algorithms by admin</title>
		<link>http://jamesdevine.info/index.php/projects/gpgpu-island-model-gentic-algorithms/comment-page-1/#comment-1596</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Thu, 10 Mar 2011 22:32:54 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=8#comment-1596</guid>
		<description>This was a project proposal for a Thesis topic during a junior seminar class. The thesis I went with was &quot;&lt;a href=&quot;http://jamesdevine.info/index.php/projects/xen-cpu-scheduling/&quot; rel=&quot;nofollow&quot;&gt;An Empirical Evaluation of Methods for Improving Efficiency in Xen CPU Scheduling&lt;/a&gt;&quot;</description>
		<content:encoded><![CDATA[<p>This was a project proposal for a Thesis topic during a junior seminar class. The thesis I went with was &#8220;<a href="http://jamesdevine.info/index.php/projects/xen-cpu-scheduling/" rel="nofollow">An Empirical Evaluation of Methods for Improving Efficiency in Xen CPU Scheduling</a>&#8220;</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on GPGPU Island Model Gentic Algorithms by Rob</title>
		<link>http://jamesdevine.info/index.php/projects/gpgpu-island-model-gentic-algorithms/comment-page-1/#comment-1594</link>
		<dc:creator>Rob</dc:creator>
		<pubDate>Thu, 10 Mar 2011 16:42:55 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=8#comment-1594</guid>
		<description>Have you done any further work on this?</description>
		<content:encoded><![CDATA[<p>Have you done any further work on this?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on CUDA Parallel Merge Sort by Andrea</title>
		<link>http://jamesdevine.info/index.php/projects/cuda-parallel-merge-sort/comment-page-1/#comment-1553</link>
		<dc:creator>Andrea</dc:creator>
		<pubDate>Wed, 02 Mar 2011 15:18:43 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=186#comment-1553</guid>
		<description>Hi, probably I don’t understand much the code, but it doesn’t look like a parallel code, but more like the iterative one running on gpu.

If you edit the main loop in this way:

if(tid==0)
{
k=1;

while(k&lt;NUM)
{
i = 0;

while(i+k NUM)
{
u = NUM+1;
}

Merge(shared, results, i, i+k, u);
i=i + k * 2;
}

k=k*2;
}
}
__syncthreads();
}

the result will be still correct and the performance will be improved.</description>
		<content:encoded><![CDATA[<p>Hi, probably I don’t understand much the code, but it doesn’t look like a parallel code, but more like the iterative one running on gpu.</p>
<p>If you edit the main loop in this way:</p>
<p>if(tid==0)<br />
{<br />
k=1;</p>
<p>while(k&lt;NUM)<br />
{<br />
i = 0;</p>
<p>while(i+k NUM)<br />
{<br />
u = NUM+1;<br />
}</p>
<p>Merge(shared, results, i, i+k, u);<br />
i=i + k * 2;<br />
}</p>
<p>k=k*2;<br />
}<br />
}<br />
__syncthreads();<br />
}</p>
<p>the result will be still correct and the performance will be improved.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on MSI Wind U100 Netbook Review by Knife Sets ·</title>
		<link>http://jamesdevine.info/index.php/2009/05/msi-wind-u100-netbook-review/comment-page-1/#comment-630</link>
		<dc:creator>Knife Sets ·</dc:creator>
		<pubDate>Tue, 09 Nov 2010 02:32:09 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?p=200#comment-630</guid>
		<description>we have 2 msi wind units at home that we always use whenever we go out camping, they are very  light and feature packed                          ``</description>
		<content:encoded><![CDATA[<p>we have 2 msi wind units at home that we always use whenever we go out camping, they are very  light and feature packed                          &#8220;</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Getting Hadoop MapReduce 0.20.2 Running On Ubuntu by Joseph</title>
		<link>http://jamesdevine.info/index.php/2010/05/getting-hadoop-mapreduce-0-20-2-running-on-ubuntu/comment-page-1/#comment-334</link>
		<dc:creator>Joseph</dc:creator>
		<pubDate>Wed, 21 Jul 2010 14:44:12 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?p=330#comment-334</guid>
		<description>Hello James,
thank you for sharing this information. please can you email the configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh) to my email address? the link don&#039;t seem to be working.
thank you once again.
Joseph</description>
		<content:encoded><![CDATA[<p>Hello James,<br />
thank you for sharing this information. please can you email the configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh) to my email address? the link don&#8217;t seem to be working.<br />
thank you once again.<br />
Joseph</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on CUDA Parallel Merge Sort by admin</title>
		<link>http://jamesdevine.info/index.php/projects/cuda-parallel-merge-sort/comment-page-1/#comment-258</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Sat, 10 Apr 2010 11:45:41 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=186#comment-258</guid>
		<description>It appears the my web server does not like to serve up .cu files. I put the kernel and main program in a zip file. The link should work now. Thanks!</description>
		<content:encoded><![CDATA[<p>It appears the my web server does not like to serve up .cu files. I put the kernel and main program in a zip file. The link should work now. Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on CUDA Parallel Merge Sort by Bobo</title>
		<link>http://jamesdevine.info/index.php/projects/cuda-parallel-merge-sort/comment-page-1/#comment-257</link>
		<dc:creator>Bobo</dc:creator>
		<pubDate>Sat, 10 Apr 2010 05:58:51 +0000</pubDate>
		<guid isPermaLink="false">http://jamesdevine.info/?page_id=186#comment-257</guid>
		<description>Hi, your link of the code is lost. Can you have a check so that we can study your code. Thanks.</description>
		<content:encoded><![CDATA[<p>Hi, your link of the code is lost. Can you have a check so that we can study your code. Thanks.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

