Opencl big-bang benchmark for multi cpu, multi gpu, igpu and accelerator

Discussion in 'Benchmark Mayhem' started by Tugrul_512bit, Jun 10, 2013.

  1. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    Opencl big-bang(Nbody) benchmark for multi device systems(heterogeneous or not)

    Hi, heres a benchmark for all of you benchmarkers. Needs java and opencl drivers installed.
    New version! vector-type(vliw,mmx,avx) workload resolution increased.

    Benchmark in Rar file(English): http://www.fileswap.com/dl/zaZqomw3bk/

    If WINDOWS.BAT does not work, here is alternative:
    http://www.fileswap.com/dl/h69qAKCfd7/
    Benchmark with alternative batch file(English): http://www.fileswap.com/dl/qJiBMxfYzw/



    Youtube video shows usage: http://www.youtube.com/watch?v=YNj10f-7yTA

    When you dont have java: http://java.com/en/download/index.jsp

    Drivers:

    INTEL(HD3000 has problems, HD4000 working, no problem for CPU part, Xeon-Phi included)
    http://software.intel.com/en-us/vcsource/tools/opencl-sdk


    AMD(AMD APP SDK) may or may not be needed
    http://developer.amd.com/tools-and-...erated-parallel-processing-app-sdk/downloads/

    NVIDIA opencl
    http://www.nvidia.com/Download/index.aspx?lang=en-us
    https://developer.nvidia.com/cuda-toolkit-31-downloads


    Just extract archive to a folder. Then run the WINDOWS.BAT or LINUX.SH to run the benchmark. If those dont work, just double click the jar file to run in that extraction folder. If you have MacOSX or Solaris, just run the .jar file.

    This benchmark is written on Java using (jocl) + (jmonkey engine 3.0) by me. This program uses opencl power to simulate a big-bang of 25600, 51200, 102400 particles just like a super-nova (without relativistic approach) on any opencl-capable device(or combination of devices) to get benchmark points related to cumpute performance(linearly dependant to gflops value but exponentially dependant to particle number because this is full O(N^2) ). For example, if you get 1000 points in 51200 particle test, you will likely get 250 points in 102400 particle test. Ofcourse gpu will be better performing with more particles so you will get 300 instead of 250. 25600 particle test is easiest and is related with host(cpu), device (cpu-gpu-accelerator-coprocessor) and ram performances.

    VLIW4(and 5), SSE, AVX, GCN, CUDA, float16(Xeon-phi) capability included.

    Opencl is slower than CUDA for same Nvidia hardware, so Nvidia scores will be multiplied by 1.2 to have an idea about how CUDA performance would be.
    Old but the only source I could find for this multiplier: http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf


    [​IMG]

    Theres scaling for at least 4 cards or even heterogeneous device combinations like nvidia+amd+intel's HD4000.

    I will add any scores with screenshots of score results and cpu-z gpu-z.
    Warning: this will heat like prime95 or furmark in the process. Dont use a low gflops cpu with a high gflops GPU because GPU will mostly be idle waiting for CPU periodically plus increasing(and suddenly decreasing) temperature on itself(if you do this, you will need a solder-gun, GTX700 series may be immune to this because they have temperature-lock mechanism, I dont know)
    Absolutely no harm for similar-performance heterogeneous devices. Example: HD7850+GTX660 or 2*Xeons and a HD4830

    For two or more devices used, you can use any workload ratio such as 50-50(crossfire or two), 30-35-35(triple sli but first card is doing the video output work so decreased opencl load), 25-25-25-25(4*opteron)

    Some scores from another forum for 25600 particles:
    GT650M (1000/1500) ----->1371 points (i7 3610QM 3.1GHz)
    Intel-HD4000 (1100/800) ----->482 points (i7 3610QM 3.1GHz)
    Xeon-E3-1270-V2 (3500MHz-7 threads) ----->268 points
    i7-3770K (4500MHz-7 threads) ----->207 points (igpu off)
    FX8150 (4300MHz-7 threads) ----->142 points
    Full list: http://forum.donanimhaber.com/m_75692871/tm.htm

    Submissions from this thread:

    __________25600 particles(for high-end cpu or low-end gpu)(1 point=1 calculation cycle)(1 cycle = 0.375 Gfloats )_________​

    TwoPlusTwo: Titan SLI @ 1150/3213 & 1137/3213------>11156 points ( 3930k @ 4.6 GHz, DDR-2133) 4183/s Gflops
    ---TK---: GTX6804GB SLI (1254/7200) ------>8692 points ( i7 2600k 4.7Ghz HT Off, DDR3-2133) 3259/s Gflops
    Tugrul_512bit: HD7870 (1200/1320)(13.4 WHQL) ----->3982 points (FX8150@4300MHz, DDR3-1866) 1493 Gflop/s(with Win 7 x64 SP1)
    ricardonuno1980: GTX480 (701/1848)(320.14Beta) ------>2601 points ( i5-2500K@4.5GHz, DDR3-1600) 975 Gflops/s(with Win 7 x64 SP1)​



    __________51200 particles(for 2-3 gpu's)(1 point=1 calculation cycle)(1 cycle = 1.5 Gfloats )_________​

    Tugrul_512bit: HD7870 (1200/1320)(13.4 WHQL) ----->1140 points (FX8150@4300MHzde, DDR3-1866) 1710 Gflop/s(with Win 7 x64 SP1)​



    __________102400 particles(for many gpus)(1 point=1 calculation cycle)(1 cycle = 6.0 Gfloats )_________​

    Tugrul_512bit: HD7870 (1200/1320)(13.4 WHQL) ----->323points (FX8150@4300MHzde, DDR3-1866) 1938 Gflop/s(with Win 7 x64 SP1)​



    Important note: If you use your CPU for benchmarking, you can change the number of cores used for computing and used for data-moving/rendering from the scrollbar under benchmark buttons(slider to left for max cores computing, slider to right for minimum cores for computing)
    [​IMG]
    [​IMG]

    Thanks.

    Dont mind below:

    High performance computing, benchmark, OpenCL scaling, jmonkey engine 3.0, jocl, galaxy simulation, nbody benchmark. Xeon-phi, Titan, multi-thread. Rise against injustice. Megadeth.
     
    Last edited: Dec 30, 2013
  2. ricardonuno1980

    ricardonuno1980 Banned

    Messages:
    4,407
    Likes Received:
    0
    GPU:
    GTX 780Ti Classified :D
    Can OpenCL 1.0 driver work this bench?
    OCL 1.0 is ~1.5x faster than 1.1 (abnormal performance for NVIDIA card only).
     
  3. yasamoka

    yasamoka Ancient Guru

    Messages:
    4,875
    Likes Received:
    259
    GPU:
    Zotac RTX 3090
    Awesome stuff. Will try it when I put my cards on water soon.
     
  4. ---TK---

    ---TK--- Guest

    Messages:
    22,104
    Likes Received:
    3
    GPU:
    2x 980Ti Gaming 1430/7296
    7244 specs in sig ht off
     

  5. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    @ricardonuno1980: I used simple vector-driven kernel and very basic cl-methods which must be avaliable for all opencl versions. This package uses jocl's opencl 1.2 compatible version. But works with 1.1. Did not try for 1.0.

    @yasamoka: waiting for 2x7970.

    @---TK---: Nice, I will multiply your score since it is Nvidia (as a rule for this benchmark, to get an idea "how would this be in CUDA?") Adding your score(this is 25600 I suppose)
     
  6. ---TK---

    ---TK--- Guest

    Messages:
    22,104
    Likes Received:
    3
    GPU:
    2x 980Ti Gaming 1430/7296
    yeah 25600
     
  7. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    Screen shots please to verify. Already added but if some 7970 guy wants proof, you may give ss to stop amd nvidia war.
     
  8. ricardonuno1980

    ricardonuno1980 Banned

    Messages:
    4,407
    Likes Received:
    0
    GPU:
    GTX 780Ti Classified :D
    Ok. New OCL 1.2 driver for NV?
     
    Last edited: Jun 10, 2013
  9. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    Also ram frequency and channels(2-4) are improtant for first(25600 particle) test. Thank you.
     
  10. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    Actually, program uses your computer's opencl drivers. Program has 1.2 capable jocl module but uses simplest methods even for 1.0 version. Maybe device-fission for cpu can be paint. GPU should be okay. You can try, if you have any corrupted opencl dlls it could give errors.

    Please dont forget to put your ram frequencies for the first test. İmportant beause you may get low results with slow memory. WINDOWS.BAT gives 2GB ram to java virtual machine. With just running .jar file, you are without that memory area hence being slow.
     

  11. ricardonuno1980

    ricardonuno1980 Banned

    Messages:
    4,407
    Likes Received:
    0
    GPU:
    GTX 780Ti Classified :D
    Ok.

    GTX 480 (320.14Beta) got 2168 pts at 25600 running Win 7 x64 SP1 with CPU i5-2500K@4.5GHz.
     
  12. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    @---TK---: a titan from another forum got 8518 points at 1140MHz and this makes your sli challenging even with this light-load test.

    @ricardonuno1980: your score makes sense since HD5850 gets similar scores with you in another forum. Dont worry. Adding to list.
     
    Last edited: Jun 10, 2013
  13. yasamoka

    yasamoka Ancient Guru

    Messages:
    4,875
    Likes Received:
    259
    GPU:
    Zotac RTX 3090
    This is written by you? Awesome stuff dude.
     
  14. ricardonuno1980

    ricardonuno1980 Banned

    Messages:
    4,407
    Likes Received:
    0
    GPU:
    GTX 780Ti Classified :D
    Sorry, GTX 480 ran stock-reference (=701 MHz core; 1848MHz mem). ;)
     
  15. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    Yes thanks, actually what I did was only connecting jocl with jmonkey engine and doing all in an self-runnable applet within a jar. It was hardwork. Without jmonkey and jocl, it could be harder than hard.

    This benchmark will become a collision of two planets with shiny surfaces and breaking into asteroids then forming a big planet.
     

  16. thatguy91

    thatguy91 Guest

    What's the point of a benchmark if you are going to multiply the Nvidia CUDA scores by 1.2x because they are lower?
     
  17. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    Because the point of Nvidia is CUDA which enables a more low-level hardware features making somewhat faster. Originated from: http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf here you can see that the performance ratio of a Nvidia card for CUDA/Opencl approaches 1.2 multiplier when you increase total-work-thread number(making a heavier load).

    Note: I am HD7870 + FX8150 user, not a fanboy and absolutely not a Nvidia fan. Maybe a CUDA+Opencl experienced programmer can tell me if I am wrong. HD7870(pitcairn) easily passes GTX660-TI in opencl. In CUDA GTX benefits. If you have CUDA then one can say "whats the point of using opencl?". To counter that, just multipled by 1.2 according to an old but only clue I found the pdf file above.
     
    Last edited: Jun 11, 2013
  18. TwoPlusTwo

    TwoPlusTwo Guest

    Messages:
    261
    Likes Received:
    0
    GPU:
    2x GTX Titan SLI
    Cool benchmark! :D

    Unfortunately I cannot submit a score because the program always locked up on me before it finished.
     
  19. Tugrul_512bit

    Tugrul_512bit Guest

    Messages:
    114
    Likes Received:
    0
    GPU:
    msi_r7870hawk_asus_r7_240
    Do you see exploding particles all around? Does it draw anything red? What are the temperatures? Your driver version? Was the driver update properly done? Too much overclock?


    Unfortunately my timezone tells that I need to sleep. :bang: :bang: :bang:
    When I wake up, will refresh scores table. Good night.
     
    Last edited: Jun 11, 2013
  20. TwoPlusTwo

    TwoPlusTwo Guest

    Messages:
    261
    Likes Received:
    0
    GPU:
    2x GTX Titan SLI
    Yeah, it starts working fine, I can see the big bang and the particles start lumping around via gravity, can move and zoom in/out, etc.

    Temps are great, I'm on the 320.18 drivers.

    OC's are rock-solid and I tried running it at stock, still locked up.

    Always seems to lock up with ~10 seconds or so left.
     

Share This Page