970 memory allocation issue revisited

Discussion in 'Videocards - NVIDIA GeForce' started by alanm, Jan 23, 2015.

Thread Status:
Not open for further replies.
  1. -Tj-

    -Tj- Ancient Guru

    Messages:
    18,103
    Likes Received:
    2,606
    GPU:
    3080TI iChill Black
    Ok, its ok in headless mode,

    [​IMG]

    I was also confused why I got it, that's why I participated in this thread.
     
  2. palvo23

    palvo23 Guest

    Messages:
    14
    Likes Received:
    0
    GPU:
    MSI GTX970 4G OC
    What bothers me is the Nvidia's response. Having 3.5GB Vram card isnt exactly that bad, but...

    Either this slipped pass unnoticed in their quality testing, or......?
     
  3. SuperAverage

    SuperAverage Guest

    Messages:
    247
    Likes Received:
    2
    GPU:
    Gigabyte xtreme 1080
    I didn't pay for a 3.5GB card, I paid for a 4GB card.

    We'll just hang out and see what happens, though.
     
  4. palvo23

    palvo23 Guest

    Messages:
    14
    Likes Received:
    0
    GPU:
    MSI GTX970 4G OC
    Yes ofc we all paid for 4GB card, but I would still've bought the card if they sealed off last 500mb and speficied as 3.5G card is what I'm saying. Maybe that's just me. We'll see how this concludes
     

  5. SuperAverage

    SuperAverage Guest

    Messages:
    247
    Likes Received:
    2
    GPU:
    Gigabyte xtreme 1080
    I have a 4k monitor. Part of the decision for buying the 970 was 4GB RAM.
     
  6. palvo23

    palvo23 Guest

    Messages:
    14
    Likes Received:
    0
    GPU:
    MSI GTX970 4G OC
    Yes clearly that's the difference between you and me. Let's stop derailing the thread though
     
  7. SuperAverage

    SuperAverage Guest

    Messages:
    247
    Likes Received:
    2
    GPU:
    Gigabyte xtreme 1080
    Not really derailing the thread.

    In any case, I hope the link to the video I made in post 100, http://forums.guru3d.com/showpost.php?p=4998277&postcount=100
    might help people figure something out.
     
  8. keasy

    keasy Banned

    Messages:
    548
    Likes Received:
    0
    GPU:
    d1cK
    Sure helped me figure it out.

    As in, f*ck me what a bunch or retards the PC community still harbours.
     
  9. SuperAverage

    SuperAverage Guest

    Messages:
    247
    Likes Received:
    2
    GPU:
    Gigabyte xtreme 1080
    Care to elaborate?

    This retard wants to know.
     
  10. Öhr

    Öhr Master Guru

    Messages:
    324
    Likes Received:
    65
    GPU:
    AMD RX 5700XT @ H₂O
    Expecting fps variance from a variable framerate is what should be properly working in a feature such as shadowplay, so you actually discovered another bug altogether: Shadowplay should and can handle lower framerates than the set framerate output of the recording. it simply repeats previous frames. however in your case, it seems to encode erroneous frames altogether... Though, it might not be caused by the VRAM bandwidth of the 970 cards and instead an issue which can be reproduced on all cards if the frametime is above a certain threshold.
     

  11. SuperAverage

    SuperAverage Guest

    Messages:
    247
    Likes Received:
    2
    GPU:
    Gigabyte xtreme 1080
    Shadowplay does produce variable framerate outputs.

    Limiting FPS to 10 or 20 with say, afterburner, does not produce these glitches, and the graphics glitches evidenced in the video happened only when the scene, on screen, was frozen.

    I'm not saying you're wrong, I'm saying that the coincidence between visually freezing during play and the capturing of the glitches seems too.. coincidental. Again, note that GPU is not maxed, VRAM is not maxed and RAM is not maxed when these hitches (and recorded glitches) happen. Also note that movement is still captured during those glitches, meanwhile the play screen is frozen.
     
  12. VultureX

    VultureX Banned

    Messages:
    2,577
    Likes Received:
    0
    GPU:
    MSI GTX970 SLI
    I finally figured out the compile options so here is the requested functionality.
    You can now specify the allocation block size and the maximum memory that is used as follows:

    vRamBandwidthTest.exe [BlockSizeMB] [MaxAllocationMB]
    - BlockSizeMB: any number of 16 32 64 128 256 512 1024
    - MaxAllocationMB: any number greater or equal to BlockSizeMB

    If no arguments are given the test runs the 128MB blocksize by default with no memory limit, which corresponds exactly with the old program.

    Download here:
    http://nl.guru3d.com/vRamBandWidthTest-guru3d.zip

    Source:
    Code:
    #include "device_launch_parameters.h"
    #include "helper_math.h"
    #include <stdio.h>
    #include <iostream>
    #define CacheCount 5
    
    __global__ void BenchMarkDRAMKernel(float4* In, int Float4Count)
    {
    	int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4Count;
     
    	float4 Temp = make_float4(1);
     
    	Temp += In[ThreadID];
    	
     
    	if (length(Temp) == -12354)
    		In[0] = Temp;
     
    } 
     
    __global__ void BenchMarkCacheKernel(float4* In, int Zero,int Float4Count)
    {
    	int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4Count;
     
    	float4 Temp = make_float4(1);
     
    #pragma unroll
    	for (int i = 0; i < CacheCount; i++)
    	{
    		Temp += In[ThreadID + i*Zero];
    	}
     
    	if (length(Temp) == -12354)
    		In[0] = Temp;
     
    }
    
    int isPowerOfTwo (unsigned int x)
    {
      return ((x != 0) && !(x & (x - 1)));
    }
     
    int main(int argc, char *argv[])
    {
    	printf("Nai's Benchmark, edited by VultureX \n");
    
    	//Sanity checks and some device info:
    	int nDevices;
    	cudaGetDeviceCount(&nDevices);
    	if(nDevices >= 1) {
    		cudaDeviceProp prop;
    		cudaGetDeviceProperties(&prop, 0);
    		printf("  Device: %s (%1.2f GB)\n", prop.name, prop.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
    		printf("  Memory Bus Width (bits): %d\n",
    			   prop.memoryBusWidth);
    		printf("  Peak Theoretical DRAM Bandwidth (GB/s): %f\n\n",
    			   2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
    	} else {
    		printf("No CUDA capable devices were found!\n");
    		printf("Press return to exit...\n");
    		getchar();
    		return 1;
    	}
    	
    	//Get maximum amount of memory that should be allocated
    	unsigned int MemLimitMB;
    	if(argc < 3 || sscanf(argv[2], " %u", &MemLimitMB) != 1) {
    		MemLimitMB = INT_MAX;
    	}
    
    	//Get block size in MB, default to 128
    	unsigned int ChunkSizeMB = 0;
    	if(argc >= 2) {
    		sscanf(argv[1], " %u", &ChunkSizeMB);	
    	}
    	if(ChunkSizeMB < 16 || ChunkSizeMB > 1024 || !isPowerOfTwo(ChunkSizeMB)) {
    		ChunkSizeMB = 128;
    	}
    	if(MemLimitMB < ChunkSizeMB) {
    		MemLimitMB = ChunkSizeMB;
    	}
    	int ChunkSize = ChunkSizeMB * 1024 * 1024; //To Bytes
    	int Float4Count = ChunkSize / sizeof(float4);
    	
    	//Allocate as many blocks as possible
    	static const int PointerCount = 5000;
    	float4* Pointers[PointerCount];
    	int UsedPointers = 0;
    	
    	printf("Allocating Memory . . . \nChunk Size: %i MiByte  \n", ChunkSizeMB);	
    	while (cudaGetLastError() == cudaSuccess
    		&& (UsedPointers+1) * ChunkSizeMB <= MemLimitMB)
    	{ 
    		cudaMalloc(&Pointers[UsedPointers], ChunkSize); 
    		if (cudaGetLastError() != cudaSuccess) {
    			break;
    		}
    
    		cudaMemset(Pointers[UsedPointers], 0, ChunkSize);
    		UsedPointers++;
    	} 
     
    	printf("Allocated %i Chunks \n", UsedPointers); 
    	printf("Allocated %i MiByte \n", ChunkSizeMB*UsedPointers);
     
    	//Benchmarks
    	cudaEvent_t start, stop;
    	cudaEventCreate(&start);
    	cudaEventCreate(&stop);
     
    	int BlockSize = 128;
    	int BenchmarkCount = 30;
    	int BlockCount = BenchmarkCount * Float4Count / BlockSize;
    	
    	printf("Benchmarking DRAM \n");
    	
    	for (int i = 0; i < UsedPointers; i++)
    	{
    		cudaEventRecord(start);
     
    		BenchMarkDRAMKernel <<<BlockCount, BlockSize>>>(Pointers[i], Float4Count);
     
    		cudaEventRecord(stop);
    		cudaEventSynchronize(stop);
    		
    		// Check for any errors launching the kernel
    		cudaError_t cudaStatus = cudaGetLastError();
    		if (cudaStatus != cudaSuccess) {
    			fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    			continue;
    		}
    		float milliseconds = 0;
    		cudaEventElapsedTime(&milliseconds, start, stop);
     
    		float Bandwidth = ((float)(BenchmarkCount)* (float)(ChunkSize)) / milliseconds / 1000.f / 1000.f;
    		printf("DRAM-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n", i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
    	} 
     
     
    	printf("Benchmarking L2-Cache \n"); 
     
    	for (int i = 0; i < UsedPointers; i++)
    	{
    		cudaEventRecord(start);
     
    		BenchMarkCacheKernel <<<BlockCount, BlockSize>>>(Pointers[i], 0, Float4Count);
    
    		cudaEventRecord(stop);
    		cudaEventSynchronize(stop);
     
    		// Check for any errors launching the kernel
    		cudaError_t cudaStatus = cudaGetLastError();
    		if (cudaStatus != cudaSuccess) {
    			fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    			continue;
    		}
    		float milliseconds = 0;
    		cudaEventElapsedTime(&milliseconds, start, stop);
     
    		float Bandwidth = (((float)CacheCount* (float)BenchmarkCount * (float)ChunkSize)) / milliseconds / 1000.f / 1000.f;
    		printf("L2-Cache-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n", i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
    	}
     
     
    	system("pause");
     
    	cudaDeviceSynchronize();
    	cudaDeviceReset();
        return 0;
    }
    
    @Fox2232:
    By the way "int BlockSize = 128;" has nothing to do with memory allocation and is best left at its current value. It actually denotes the number of threads per thread block of the gpu kernels.
    The total amount of threads that is run is determined by BlockSize * BlockCount, so there will always be enough threads spawned to cover all of the memory.
     
  13. JohnLai

    JohnLai Guest

    Messages:
    136
    Likes Received:
    7
    GPU:
    ASUS GTX 970 3.5+0.5GB
    May we post your compiled version at other forums/website?

    I can't test it because I am waiting for my gtx970 back from RMA T_T.
     
  14. sykozis

    sykozis Ancient Guru

    Messages:
    22,492
    Likes Received:
    1,537
    GPU:
    Asus RX6700XT
    No need for arguments. The mods have been made aware of this thread and will close it if things go askew. So far, this thread has been calm and respectful. Lets keep it that way.

    That said, we need to get away from the 970 vs 980 testing. If we're going to find answers, we need to involve other "cut-down" GPUs in the testing to see if there's a trend. If anyone has a friend with a GTX470, GTX480, GTX580, GTX670 or GTX680 that's willing to run this "test" and post results, it'd be quite helpful. Would work best if we can get all 5 of those cards involved as it'll provide a much clearer picture.
     
  15. VultureX

    VultureX Banned

    Messages:
    2,577
    Likes Received:
    0
    GPU:
    MSI GTX970 SLI
    Of course!
     

  16. JohnLai

    JohnLai Guest

    Messages:
    136
    Likes Received:
    7
    GPU:
    ASUS GTX 970 3.5+0.5GB
    Well, one thing for sure, GTX 780 (4GPC) + GTX 780 TI (5GPC) + Titan are unlikely to encounter this issue since there are more memory controllers than GPC / raster engine cluster.
    But these 780 series have their GPC/raster engine disabled instead of individual SMM/SMX unit.
     
  17. JohnLai

    JohnLai Guest

    Messages:
    136
    Likes Received:
    7
    GPU:
    ASUS GTX 970 3.5+0.5GB
    Thanks, will be posting the link to your post at OCN.
     
  18. SuperAverage

    SuperAverage Guest

    Messages:
    247
    Likes Received:
    2
    GPU:
    Gigabyte xtreme 1080
    I disagree.

    The 970 is "supposed" to have the same memory available to it as the 980, on the same architecture.

    Previous cards have nothing to do, in this case, with the issue (or as the case me be, non-issue) at hand.

    Controlled comparisons need to be made between the two to see if, in fact, the 970 does have crippled memory access in comparison to the 980, both of which were sold with supposedly the same, at least marketed, memory capacity and bandwidth.
     
  19. Scouty

    Scouty Active Member

    Messages:
    81
    Likes Received:
    0
    GPU:
    GTX 970 OC WiNDFORCE 3X
    now testing with 344.11 .. run the bench 5x and same result... better than newest driver....

    maybe the driver can improve something =)

    [​IMG]

    at the END of L2 ... 81gb/s vs 19gb
     
  20. JohnLai

    JohnLai Guest

    Messages:
    136
    Likes Received:
    7
    GPU:
    ASUS GTX 970 3.5+0.5GB
Thread Status:
Not open for further replies.

Share This Page