2013 hardware performance evaluation

General

This page will document the testing of new hardware for the 2013 pruchase. For each hardware, several tests will be made as follows:

Hardware

The following hardware was tested

 

Node name CPU type OS # of CPU
rcas6183
(reference)
Intel(R) Xeon(R) CPU X5660 @ 2.80GHz SL 5.3 (Boron) - x86_64
GNU/Linux -- 2.6.18-274.18.1.el5 -- #1 SMP
24
farmeval01 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz SL 5.3 (Boron) - x86_64
GNU/Linux -- 2.6.18-274.18.1.el5 -- #1 SMP
24
       

 

 

SciMark testing

One of the test suite I used are the Spec mark test suite created by NIST called SciMark2. SciMark2 is composed of multiple tests as follows:

  • Fast Fourier Transform - this test performs a one-dimensional forward transform of 4K complex numbers. The program starts with a bit-reversal portion (no flops) and the second performs the actual Nlog(N) computational steps. This exercises complex arithmetic, shuffling, non-constant memory references and trigonometric functions.
  • Successive Over-relax - or a Jacobi Successive Over-relaxation test. The test exercises typical access patterns in finite difference applications, for example, solving Laplace's equation in 2D with Drichlet boundary conditions. The algorithm is tailored to measure basic "grid averaging" memory patterns, where each A(i,j) is assigned an average weighting of its four nearest neighbors.
  • Monte-Carlo - is actually a Monte-Carlo integration test. It approximates the value of Pi by computing the integral of the quarter circle. The algorithm exercises random-number generators, synchronized function calls, and function inlining.
  • Sparse matmult - the Sparse Matrix multiplication test uses an unstructured sparse matrix stored in compressed-row format with a prescribed sparsity structure. A 1,000 x 1,000 sparse matrix with 5,000 nonzeros is used. This exercises indirection addressing and non-regular memory references.
  • Dense LU matrix - Dense LU matrix factorization computes the LU factorization of a dense 100x100 matrix using partial pivoting. Exercises linear algebra kernels (BLAS) and dense matrix operations.

Those tests were peformed twice, as 64 bits and 32 bits assembled executables (we do not expect differences in those canonical tests as they represent basic operations but this is a good way to test stability). We produced an average results between the two and computed the imporvement based on that average. The error is given as the standard deviation of a sampling distribution of at least 100 measurements (in some case, we pushed to 150).

The units in SciMark 2.0 are Mflops so the higher the number the better the result.

The results are as follows:

Node name   Composite
FFT
SOR
MC
SM
LU  
rcas6183 32 bits 328.02 1.39 203.85 0.74 715.87 6.25 69.42 1.21 289.72 0.57 361.22 1.11
  64 bits 321.68 1.86 200.85 2.04 699.23 5.72 69.00 1.53 283.44 3.14 355.87 5.90

Av 324.85 1.16 202.35 1.09 707.55 4.24 69.21 0.98 286.58 1.60 358.55 3.00
farmeval01 32 bits 386.18 1.02 250.22 0.85 743.14 1.31 79.72 3.23 374.02 0.61 483.77 3.38
  64 bits 356.75 1.32 231.12 0.95 730.57 1.18 78.23 1.46 347.08 0.48 396.74 6.23
  Av 371.47 0.83 240.67 0.64 736.86 0.88 78.98 1.77 360.55 0.39 440.26 3.54
  % gain 14.35%  0.08 18.94%  0.15 4.14%  0.03 14.11%  0.52 25.81%  0.17 22.79%  0.37

Findings:
  • farmeval01
    • If we beleive those tests, this node would provide an average 14% improvement over a standard rcas nodes.
    • Some tests such as the FFT and SM tends to show much better results - this would suggest a better (faster) memory access on farmeval01 than on our standard nodes as both tests have in common an aim to measure sparsity or non constant memory addressing. From the LU test, we also infer fast Kernel based operations.

unixBench

UnixBench is the original BYTE UNIX benchmark suite, updated and revised by many people over the years. The purpose of UnixBench is to provide a basic indicator of the performance of a Unix-like system; hence, multiple tests are used to test various aspects of the system's performance. For out testing, 2D and 3D graphics are not relevant hence skipped.
The following tests were run: int, float, double, tower of hanoi, whetstone, dhrystone, spawn, syscall, execl, sysexec and they are explained below.

  • int: basic integer operations
  • float: basic floating point operations
  • double: basic double operations
  • hanoi: Recursion Test resolving the "Tower of Hanoi" problem
  • sysexec: test fork() and exec()
     
  • Dhrystone: This benchmark is used to measure and compare the performance of computers. The test focuses on string handling, as there are no floating point operations. It is heavily influenced by hardware and software design, compiler and linker options, code optimization, cache memory, wait states, and integer data types.
  • Whestone: This test measures the speed and efficiency of floating-point operations. This test contains several modules that are meant to represent a mix of operations typically performed in scientific applications. A wide variety of C functions including sin, cos, sqrt, exp, and log are used as well as integer and floating-point math operations, array accesses, conditional branches, and procedure calls. This test measure both integer and floating-point arithmetic.
  • execl - This test measures the number of execl calls that can be performed per second. Execl is part of the exec family of functions that replaces the current process image with a new process image. It and many other similar commands are front ends for the function execve().
  • spawn: This test measure the number of times a process can fork and reap a child that immediately exits. Process creation refers to actually creating process control blocks and memory allocations for new processes, so this applies directly to memory bandwidth. Typically, this benchmark would be used to compare various implementations of operating system process creation calls.
  • syscall (System call overhead): This estimates the cost of entering and leaving the operating system kernel, i.e. the overhead for performing a system call. It consists of a simple program repeatedly calling the getpid (which returns the process id of the calling process) system call. The time to execute such calls is used to estimate the cost of entering and exiting the kernel
The Web site for unixBench warns "Do be aware that this is a system benchmark, not a CPU, RAM or disk benchmark. The results will depend not only on your hardware, but on your operating system, libraries, and even compiler". We should however note that all tests used the default gcc compiler on the systems (gcc (GCC) 4.3.2 20081007 in all cases) allowing a direct comparison.

The tests were run 100 times (the longer tests are scaled down 1/3rd so ~ 30 times) for the final pass with a pre-run at a 10 pass (3) sampling. The results fell within +/- 1% of each other on the final index score indictaing great stability.

The units are loop-per-seconds (lps), Million Whetstones Instructions Per Second (MWIPS) or a normalized value or "index score". The largest the value the better the results.

The results of our tests follows:



Findings:





ROOT marks

ROOT marks are part of the ROOT test suite. This Mark is heavily biased toward ROOT operations and hence, may not represent accurately peformance of (for example) a pure Monte-Carlo program (SciMark MC test would then be more accurate). However, the root4star STAR framework would be close to performing as the ROOT marks (beyond its phase of intense calculations).

For our results, we performed 500 tests consecutively using the 32bit, 64bits and bothe the optimized verison of the 32 and 64 bits executables. All tests were made against /tmp (as some tests creates external files and we did not want the test to be influenced by NFS cross-traffic). Finally, we extracted the fit1 CPUMarks and the general ROOTMarks for comparison (the second eving influenced by IO operations). We re-iterated those tests 3 times to make sure (as the results will show farmeval01 worst that a stadard cas node).

ROOTMarks should follow a standard benchmark criteria as its idea & design were based on the CERN units (DEC VAX 8600 was 1 unit while a Pentium Pro 200MHz was ~ 40 units) - the largest the number, the better the results.

The results are tabulated below:

ROOT 5.22 ROOT Marks +/- fit1 +/- Gain/Loss %tage from ref
rcas6183 (32) 741.13 26.20 255.10 29.14 0.00% 0.00% 0.00% 0.00%
rcas6183 (64) 768.01 20.54 280.95 13.88 0.00% 0.00% 0.00% 0.00%
rcas6183 (O32) 912.23 24.54 294.39 13.02 0.00% 0.00% 0.00% 0.00%
farmeval01 (32) 694.54 14.44 171.28 19.46 -6.29% 0.35% -32.86% 7.49%
farmeval01 (64) 702.89 14.01 178.94 22.63 -8.48% 0.40% -36.31% 6.39%
farmeval01 (O32) 811.15 18.43 192.05 29.40 -11.08% 0.55% -34.76% 6.86%

Findings:
  • farmeval01
    • The ROOTMarks are worst by 6%-8% comparing to a standard Linux node
    • The fit1 test shows an even worst performance grading in our test and a drop of up to ~ 35%
    • Optimization has a large effect for the overall marks but doe snot affect the fit1 test

root4star testing (composite)

(Note: test not done yet)



IO performance testing

IO performance are made on specific nodes for example, to test the speed of a normal disk configuration and a SSD based mounted space.  While concurrency tests are made to evaluate the IO scaling performance of each storage element, it is noteworthy to mention that those numbers would be relevant only if IO saturates. In other words, typical user jobs and workflows and data access may not be able to saturate the IO at levels tested in those measurements.

IOPerf

(Note: this test shows a dramatic fall of performance for /home2 in // mode fwrite which was not expected [understood this was the SSD drive - result actually consistent with IOZone test)

IOZone


(Note: test needs to be remade and/or need to mine the results  a bit more/ strange pattern appears I do not understand)