970 memory allocation issue revisited

-Tj- · Jan 24, 2015

Ok, its ok in headless mode,

I was also confused why I got it, that's why I participated in this thread.

palvo23 · Jan 24, 2015

What bothers me is the Nvidia's response. Having 3.5GB Vram card isnt exactly that bad, but...

Either this slipped pass unnoticed in their quality testing, or......?

SuperAverage · Jan 24, 2015

palvo23 said: ↑

What bothers me is the Nvidia's response. Having 3.5GB Vram card isnt exactly that bad, but...

Either this slipped pass unnoticed in their quality testing, or......?
Click to expand...

I didn't pay for a 3.5GB card, I paid for a 4GB card.

We'll just hang out and see what happens, though.

palvo23 · Jan 24, 2015

SuperAverage said: ↑

I didn't pay for a 3.5GB card, I paid for a 4GB card.

We'll just hang out and see what happens, though.
Click to expand...

Yes ofc we all paid for 4GB card, but I would still've bought the card if they sealed off last 500mb and speficied as 3.5G card is what I'm saying. Maybe that's just me. We'll see how this concludes

SuperAverage · Jan 24, 2015

palvo23 said: ↑

Yes ofc we all paid for 4GB card, but I would still've bought the card if they sealed off last 500mb and speficied as 3.5G card is what I'm saying. Maybe that's just me. We'll see how this concludes
Click to expand...

I have a 4k monitor. Part of the decision for buying the 970 was 4GB RAM.

palvo23 · Jan 24, 2015

SuperAverage said: ↑

I have a 4k monitor. Part of the decision for buying the 970 was 4GB RAM.
Click to expand...

Yes clearly that's the difference between you and me. Let's stop derailing the thread though

SuperAverage · Jan 24, 2015

palvo23 said: ↑

Yes clearly that's the difference between you and me. Let's stop derailing the thread though
Click to expand...

Not really derailing the thread.

In any case, I hope the link to the video I made in post 100, http://forums.guru3d.com/showpost.php?p=4998277&postcount=100
might help people figure something out.

keasy · Jan 24, 2015

SuperAverage said: ↑

Not really derailing the thread.

In any case, I hope the link to the video I made in post 100, http://forums.guru3d.com/showpost.php?p=4998277&postcount=100
might help people figure something out.
Click to expand...

Sure helped me figure it out.

As in, f*ck me what a bunch or retards the PC community still harbours.

SuperAverage · Jan 24, 2015

keasy said: ↑

Sure helped me figure it out.

As in, f*ck me what a bunch or retards the PC community still harbours.
Click to expand...

Care to elaborate?

This retard wants to know.

Öhr · Jan 24, 2015

SuperAverage said: ↑

In an attempt to show differences by video of hitching, GPU usage, in reference to over and under the 3.5ish GB threshhold, I stumbled on something that may or may not be of interest.

Please see the video here:
http://youtu.be/ZQE6p5r1tYE

In the second half, where I sorta force the GPU's to use over 3.5ish GB, the parts where in game, it just appeared as a short freeze, shadowplay recorded these glitches. Note that you can see movement in the glitching, whereas on-screen, it was just a freeze.

Also please note that the GPU's were NOT at 100% as the glitches occurred, being only 2 or 3% higher, generally, than at less-than 3.5GB.
Click to expand...

Expecting fps variance from a variable framerate is what should be properly working in a feature such as shadowplay, so you actually discovered another bug altogether: Shadowplay should and can handle lower framerates than the set framerate output of the recording. it simply repeats previous frames. however in your case, it seems to encode erroneous frames altogether... Though, it might not be caused by the VRAM bandwidth of the 970 cards and instead an issue which can be reproduced on all cards if the frametime is above a certain threshold.

SuperAverage · Jan 24, 2015

Öhr said: ↑

Expecting fps variance from a variable framerate is what should be properly working in a feature such as shadowplay, so you actually discovered another bug altogether: Shadowplay should and can handle lower framerates than the set framerate output of the recording. it simply repeats previous frames. however in your case, it seems to encode erroneous frames altogether... Though, it might not be caused by the VRAM bandwidth of the 970 cards and instead an issue which can be reproduced on all cards if the frametime is above a certain threshold.
Click to expand...

Shadowplay does produce variable framerate outputs.

Limiting FPS to 10 or 20 with say, afterburner, does not produce these glitches, and the graphics glitches evidenced in the video happened only when the scene, on screen, was frozen.

I'm not saying you're wrong, I'm saying that the coincidence between visually freezing during play and the capturing of the glitches seems too.. coincidental. Again, note that GPU is not maxed, VRAM is not maxed and RAM is not maxed when these hitches (and recorded glitches) happen. Also note that movement is still captured during those glitches, meanwhile the play screen is frozen.

VultureX · Jan 24, 2015

I finally figured out the compile options so here is the requested functionality.
You can now specify the allocation block size and the maximum memory that is used as follows:

vRamBandwidthTest.exe [BlockSizeMB] [MaxAllocationMB]
- BlockSizeMB: any number of 16 32 64 128 256 512 1024
- MaxAllocationMB: any number greater or equal to BlockSizeMB

If no arguments are given the test runs the 128MB blocksize by default with no memory limit, which corresponds exactly with the old program.

Download here:
http://nl.guru3d.com/vRamBandWidthTest-guru3d.zip

Source:

Code:

#include "device_launch_parameters.h"
#include "helper_math.h"
#include <stdio.h>
#include <iostream>
#define CacheCount 5

__global__ void BenchMarkDRAMKernel(float4* In, int Float4Count)
{
	int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4Count;
 
	float4 Temp = make_float4(1);
 
	Temp += In[ThreadID];
	
 
	if (length(Temp) == -12354)
		In[0] = Temp;
 
} 
 
__global__ void BenchMarkCacheKernel(float4* In, int Zero,int Float4Count)
{
	int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4Count;
 
	float4 Temp = make_float4(1);
 
#pragma unroll
	for (int i = 0; i < CacheCount; i++)
	{
		Temp += In[ThreadID + i*Zero];
	}
 
	if (length(Temp) == -12354)
		In[0] = Temp;
 
}

int isPowerOfTwo (unsigned int x)
{
  return ((x != 0) && !(x & (x - 1)));
}
 
int main(int argc, char *argv[])
{
	printf("Nai's Benchmark, edited by VultureX \n");

	//Sanity checks and some device info:
	int nDevices;
	cudaGetDeviceCount(&nDevices);
	if(nDevices >= 1) {
		cudaDeviceProp prop;
		cudaGetDeviceProperties(&prop, 0);
		printf("  Device: %s (%1.2f GB)\n", prop.name, prop.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
		printf("  Memory Bus Width (bits): %d\n",
			   prop.memoryBusWidth);
		printf("  Peak Theoretical DRAM Bandwidth (GB/s): %f\n\n",
			   2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
	} else {
		printf("No CUDA capable devices were found!\n");
		printf("Press return to exit...\n");
		getchar();
		return 1;
	}
	
	//Get maximum amount of memory that should be allocated
	unsigned int MemLimitMB;
	if(argc < 3 || sscanf(argv[2], " %u", &MemLimitMB) != 1) {
		MemLimitMB = INT_MAX;
	}

	//Get block size in MB, default to 128
	unsigned int ChunkSizeMB = 0;
	if(argc >= 2) {
		sscanf(argv[1], " %u", &ChunkSizeMB);	
	}
	if(ChunkSizeMB < 16 || ChunkSizeMB > 1024 || !isPowerOfTwo(ChunkSizeMB)) {
		ChunkSizeMB = 128;
	}
	if(MemLimitMB < ChunkSizeMB) {
		MemLimitMB = ChunkSizeMB;
	}
	int ChunkSize = ChunkSizeMB * 1024 * 1024; //To Bytes
	int Float4Count = ChunkSize / sizeof(float4);
	
	//Allocate as many blocks as possible
	static const int PointerCount = 5000;
	float4* Pointers[PointerCount];
	int UsedPointers = 0;
	
	printf("Allocating Memory . . . \nChunk Size: %i MiByte  \n", ChunkSizeMB);	
	while (cudaGetLastError() == cudaSuccess
		&& (UsedPointers+1) * ChunkSizeMB <= MemLimitMB)
	{ 
		cudaMalloc(&Pointers[UsedPointers], ChunkSize); 
		if (cudaGetLastError() != cudaSuccess) {
			break;
		}

		cudaMemset(Pointers[UsedPointers], 0, ChunkSize);
		UsedPointers++;
	} 
 
	printf("Allocated %i Chunks \n", UsedPointers); 
	printf("Allocated %i MiByte \n", ChunkSizeMB*UsedPointers);
 
	//Benchmarks
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
 
	int BlockSize = 128;
	int BenchmarkCount = 30;
	int BlockCount = BenchmarkCount * Float4Count / BlockSize;
	
	printf("Benchmarking DRAM \n");
	
	for (int i = 0; i < UsedPointers; i++)
	{
		cudaEventRecord(start);
 
		BenchMarkDRAMKernel <<<BlockCount, BlockSize>>>(Pointers[i], Float4Count);
 
		cudaEventRecord(stop);
		cudaEventSynchronize(stop);
		
		// Check for any errors launching the kernel
		cudaError_t cudaStatus = cudaGetLastError();
		if (cudaStatus != cudaSuccess) {
			fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
			continue;
		}
		float milliseconds = 0;
		cudaEventElapsedTime(&milliseconds, start, stop);
 
		float Bandwidth = ((float)(BenchmarkCount)* (float)(ChunkSize)) / milliseconds / 1000.f / 1000.f;
		printf("DRAM-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n", i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
	} 
 
 
	printf("Benchmarking L2-Cache \n"); 
 
	for (int i = 0; i < UsedPointers; i++)
	{
		cudaEventRecord(start);
 
		BenchMarkCacheKernel <<<BlockCount, BlockSize>>>(Pointers[i], 0, Float4Count);

		cudaEventRecord(stop);
		cudaEventSynchronize(stop);
 
		// Check for any errors launching the kernel
		cudaError_t cudaStatus = cudaGetLastError();
		if (cudaStatus != cudaSuccess) {
			fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
			continue;
		}
		float milliseconds = 0;
		cudaEventElapsedTime(&milliseconds, start, stop);
 
		float Bandwidth = (((float)CacheCount* (float)BenchmarkCount * (float)ChunkSize)) / milliseconds / 1000.f / 1000.f;
		printf("L2-Cache-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n", i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
	}
 
 
	system("pause");
 
	cudaDeviceSynchronize();
	cudaDeviceReset();
    return 0;
}

@Fox2232:
By the way "int BlockSize = 128;" has nothing to do with memory allocation and is best left at its current value. It actually denotes the number of threads per thread block of the gpu kernels.
The total amount of threads that is run is determined by BlockSize * BlockCount, so there will always be enough threads spawned to cover all of the memory.

JohnLai · Jan 24, 2015

VultureX said: ↑

I finally figured out the compile options so here is the requested functionality.
You can now specify the allocation block size and the maximum memory that is used as follows:

vRamBandwidthTest.exe [BlockSizeMB] [MaxAllocationMB]
- BlockSizeMB: any number of 16 32 64 128 256 512 1024
- MaxAllocationMB: any number greater or equal to BlockSizeMB

If no arguments are given the test runs the 128MB blocksize by default with no memory limit, which corresponds exactly with the old program.

Download here:
https://mega.co.nz/#!ABlxxCoY!K1iDgEZqyaxM0ZHP549qj0O2bOwX_WmdMeKEP9tNMU4

Source:

Code:

#include "device_launch_parameters.h"
#include "helper_math.h"
#include <stdio.h>
#include <iostream>
#define CacheCount 5

__global__ void BenchMarkDRAMKernel(float4* In, int Float4Count)
{
	int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4Count;
 
	float4 Temp = make_float4(1);
 
	Temp += In[ThreadID];
	
 
	if (length(Temp) == -12354)
		In[0] = Temp;
 
} 
 
__global__ void BenchMarkCacheKernel(float4* In, int Zero,int Float4Count)
{
	int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4Count;
 
	float4 Temp = make_float4(1);
 
#pragma unroll
	for (int i = 0; i < CacheCount; i++)
	{
		Temp += In[ThreadID + i*Zero];
	}
 
	if (length(Temp) == -12354)
		In[0] = Temp;
 
}

int isPowerOfTwo (unsigned int x)
{
  return ((x != 0) && !(x & (x - 1)));
}
 
int main(int argc, char *argv[])
{
	printf("Nai's Benchmark, edited by VultureX \n");

	//Sanity checks and some device info:
	int nDevices;
	cudaGetDeviceCount(&nDevices);
	if(nDevices >= 1) {
		cudaDeviceProp prop;
		cudaGetDeviceProperties(&prop, 0);
		printf("  Device: %s (%1.2f GB)\n", prop.name, prop.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
		printf("  Memory Bus Width (bits): %d\n",
			   prop.memoryBusWidth);
		printf("  Peak Theoretical DRAM Bandwidth (GB/s): %f\n\n",
			   2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
	} else {
		printf("No CUDA capable devices were found!\n");
		printf("Press return to exit...\n");
		getchar();
		return 1;
	}
	
	//Get maximum amount of memory that should be allocated
	unsigned int MemLimitMB;
	if(argc < 3 || sscanf(argv[2], " %u", &MemLimitMB) != 1) {
		MemLimitMB = INT_MAX;
	}

	//Get block size in MB, default to 128
	unsigned int ChunkSizeMB = 0;
	if(argc >= 2) {
		sscanf(argv[1], " %u", &ChunkSizeMB);	
	}
	if(ChunkSizeMB < 16 || ChunkSizeMB > 1024 || !isPowerOfTwo(ChunkSizeMB)) {
		ChunkSizeMB = 128;
	}
	if(MemLimitMB < ChunkSizeMB) {
		MemLimitMB = ChunkSizeMB;
	}
	int ChunkSize = ChunkSizeMB * 1024 * 1024; //To Bytes
	int Float4Count = ChunkSize / sizeof(float4);
	
	//Allocate as many blocks as possible
	static const int PointerCount = 5000;
	float4* Pointers[PointerCount];
	int UsedPointers = 0;
	
	printf("Allocating Memory . . . \nChunk Size: %i MiByte  \n", ChunkSizeMB);	
	while (cudaGetLastError() == cudaSuccess
		&& (UsedPointers+1) * ChunkSizeMB <= MemLimitMB)
	{ 
		cudaMalloc(&Pointers[UsedPointers], ChunkSize); 
		if (cudaGetLastError() != cudaSuccess) {
			break;
		}

		cudaMemset(Pointers[UsedPointers], 0, ChunkSize);
		UsedPointers++;
	} 
 
	printf("Allocated %i Chunks \n", UsedPointers); 
	printf("Allocated %i MiByte \n", ChunkSizeMB*UsedPointers);
 
	//Benchmarks
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
 
	int BlockSize = 128;
	int BenchmarkCount = 30;
	int BlockCount = BenchmarkCount * Float4Count / BlockSize;
	
	printf("Benchmarking DRAM \n");
	
	for (int i = 0; i < UsedPointers; i++)
	{
		cudaEventRecord(start);
 
		BenchMarkDRAMKernel <<<BlockCount, BlockSize>>>(Pointers[i], Float4Count);
 
		cudaEventRecord(stop);
		cudaEventSynchronize(stop);
		
		// Check for any errors launching the kernel
		cudaError_t cudaStatus = cudaGetLastError();
		if (cudaStatus != cudaSuccess) {
			fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
			continue;
		}
		float milliseconds = 0;
		cudaEventElapsedTime(&milliseconds, start, stop);
 
		float Bandwidth = ((float)(BenchmarkCount)* (float)(ChunkSize)) / milliseconds / 1000.f / 1000.f;
		printf("DRAM-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n", i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
	} 
 
 
	printf("Benchmarking L2-Cache \n"); 
 
	for (int i = 0; i < UsedPointers; i++)
	{
		cudaEventRecord(start);
 
		BenchMarkCacheKernel <<<BlockCount, BlockSize>>>(Pointers[i], 0, Float4Count);

		cudaEventRecord(stop);
		cudaEventSynchronize(stop);
 
		// Check for any errors launching the kernel
		cudaError_t cudaStatus = cudaGetLastError();
		if (cudaStatus != cudaSuccess) {
			fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
			continue;
		}
		float milliseconds = 0;
		cudaEventElapsedTime(&milliseconds, start, stop);
 
		float Bandwidth = (((float)CacheCount* (float)BenchmarkCount * (float)ChunkSize)) / milliseconds / 1000.f / 1000.f;
		printf("L2-Cache-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n", i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
	}
 
 
	system("pause");
 
	cudaDeviceSynchronize();
	cudaDeviceReset();
    return 0;
}

@Fox2232:
By the way "int BlockSize = 128;" has nothing to do with memory allocation and is best left at its current value. It actually denotes the number of threads per thread block of the gpu kernels.
The total amount of threads that is run is determined by BlockSize * BlockCount, so there will always be enough threads spawned to cover all of the memory.

Click to expand...

May we post your compiled version at other forums/website?

I can't test it because I am waiting for my gtx970 back from RMA T_T.

sykozis · Jan 24, 2015

No need for arguments. The mods have been made aware of this thread and will close it if things go askew. So far, this thread has been calm and respectful. Lets keep it that way.

That said, we need to get away from the 970 vs 980 testing. If we're going to find answers, we need to involve other "cut-down" GPUs in the testing to see if there's a trend. If anyone has a friend with a GTX470, GTX480, GTX580, GTX670 or GTX680 that's willing to run this "test" and post results, it'd be quite helpful. Would work best if we can get all 5 of those cards involved as it'll provide a much clearer picture.

VultureX · Jan 24, 2015

JohnLai said: ↑

May we post your compiled version at other forums/website?

I can't test it because I am waiting for my gtx970 back from RMA T_T.
Click to expand...

Of course!

JohnLai · Jan 24, 2015

sykozis said: ↑

No need for arguments. The mods have been made aware of this thread and will close it if things go askew. So far, this thread has been calm and respectful. Lets keep it that way.

That said, we need to get away from the 970 vs 980 testing. If we're going to find answers, we need to involve other "cut-down" GPUs in the testing to see if there's a trend. If anyone has a friend with a GTX470, GTX480, GTX580, GTX670 or GTX680 that's willing to run this "test" and post results, it'd be quite helpful. Would work best if we can get all 5 of those cards involved as it'll provide a much clearer picture.
Click to expand...

Well, one thing for sure, GTX 780 (4GPC) + GTX 780 TI (5GPC) + Titan are unlikely to encounter this issue since there are more memory controllers than GPC / raster engine cluster.
But these 780 series have their GPC/raster engine disabled instead of individual SMM/SMX unit.

JohnLai · Jan 24, 2015

VultureX said: ↑

Of course!
Click to expand...

Thanks, will be posting the link to your post at OCN.

SuperAverage · Jan 24, 2015

sykozis said: ↑

No need for arguments. The mods have been made aware of this thread and will close it if things go askew. So far, this thread has been calm and respectful. Lets keep it that way.

That said, we need to get away from the 970 vs 980 testing. If we're going to find answers, we need to involve other "cut-down" GPUs in the testing to see if there's a trend. If anyone has a friend with a GTX470, GTX480, GTX580, GTX670 or GTX680 that's willing to run this "test" and post results, it'd be quite helpful. Would work best if we can get all 5 of those cards involved as it'll provide a much clearer picture.
Click to expand...

I disagree.

The 970 is "supposed" to have the same memory available to it as the 980, on the same architecture.

Previous cards have nothing to do, in this case, with the issue (or as the case me be, non-issue) at hand.

Controlled comparisons need to be made between the two to see if, in fact, the 970 does have crippled memory access in comparison to the 980, both of which were sold with supposedly the same, at least marketed, memory capacity and bandwidth.

Scouty · Jan 24, 2015

now testing with 344.11 .. run the bench 5x and same result... better than newest driver....

maybe the driver can improve something =)

at the END of L2 ... 81gb/s vs 19gb

JohnLai · Jan 24, 2015

The nearest 'cut-down' GPU equivalent to GTX 970 and GTX 980 would be GTX 650 TI Boost and GTX 660.

Have a look on 650 TI Boost and 660 diagrams at anandtech:
http://www.anandtech.com/show/6838/nvidia-geforce-gtx-650-ti-boost-review-
http://www.anandtech.com/show/6276/nvidia-geforce-gtx-660-review-gk106-rounds-out-the-kepler-family

Log in or Sign up

970 memory allocation issue revisited

-Tj- Ancient Guru

palvo23 Guest

SuperAverage Guest

palvo23 Guest

SuperAverage Guest

palvo23 Guest

SuperAverage Guest

keasy Banned

SuperAverage Guest

Öhr Master Guru

SuperAverage Guest

VultureX Banned

JohnLai Guest

sykozis Ancient Guru

VultureX Banned

JohnLai Guest

JohnLai Guest

SuperAverage Guest

Scouty Active Member

JohnLai Guest

Share This Page