SIMD instruction sets

 

Generalities

SIMD (Single Instructions, Multiple Data) are extensions supported as Intel and AMD x86 instruction set for parallel operations on packed integer or floating-point data. This is provided for achieving data parallelism (parallel operations based on vectors) by applying the same operation in parallel on a number of data items packed into a 64, 128 or 256-bit vector. SIMD alos support "Scalar" operations on integer or floating-point values.

Possible SIMD instruction sets includes: MMX, EMMX, SSE, SSE2, SSE3, SSSE3, SSE4 (4, 4.1, 4.2 and a), XOP, FME4, CVT16, AVX, 3DNow* (! and !2), Altivec, VIS...

MMX supported only integer operations while SSE (Streaming SIMDS Extensions) included single-precision floating points, SSE2 double precisions. AVX (Advanced Vector Extensions) declared as the future of Intel in 2008, available in 2010 aimed to increase the data path from 128 bits to 256 bits and 3-operand instructions. CPUs delivered in 2011 by Intel supports AVX (the initial goal was a support in the Intel Sandy Bridge processor).

At the time of this blog, SSE2 and SSE3 are the most commonly available SIMD instructions set with a high potential toward a move to AVX.

A page describing most of the instructions sets (and their gory details) is available here.

 

Potential issues using SIMD instructions

Problem statement

A code compiled with AVX support (for example) cannot run on a hardware which does not have this SIMD instruction set support. Adding SIMD instructions hence immediately creates non-portable binaries including long term backward compatibility (old code run on new hardware MUST have the new hardware support the old instruction sets).

In fact, instruction sets CANNOT be mixed (this is the second issue). If (let's say) ROOT is assembled using SSE and SSE2 instructions, the main code cannot use AVX without extreme care - to get an explanation on this (from Intel), see for example this general paper on performance penalty when this is done. A short explanation follows

When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.

The paper also includes possible mitigation to avoid the extreme run-time penalty.

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

For gcc compiler however, the manpages specify
-mvzeroupper
This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

Integration of SIMD instructions hence has to be thought of VERY carefully as it impacts both cross CPU architecture portability and long term support. It is believed that the old SIMD instruction sets would be supported by future hardware though (TBC). In STAR so far, we have relied on portability across CPU hardware of the "same" family (all Intel and AMD "look alike", code loaded over AFS for SL/RH and compiler revision only, 32/64 bits introduced fully only in 2011).

STAR specific

Introducing SIMD instructions represents a new paradigm and yet another dimension to take care off. In principle, STAR could include yet another level to our OS/compiler separation (for now, STAR have all binaries sorted into OS/compiler specific sub-paths such as  sl53_gcc432 or sl57_gcc451 and even sl53_x8684_gcc432 for the 64 bits version) but all Makefile or make-systems should be self-consistent (which is a bit harder) in keeping the SAME set of SIMD instructions + the plethora of instructions may increase the support dimension to an unmanageable scale.

One possibility and temptation would be to reduce the support to a "minimal common denominator" (for now SSE/SSE2) and leave any site wanting to have additional tuned speed up on their own. This may NOT ensure however that external/third party libraries would be available in the same SIMD instruction set. Consider for example root4star (the main entry point for our code to date, but this is illustrative - same issue with ay .so): it used to depend on libmysql which, so far, could be kept as a system package. Imagine MySQL decides to support AVX instructions and our code supports SSE2 -> kaboum! The implication is that we would need to compile all external third party libraries and add them to OPTSTAR (the third party library base path equivalent to /usr/local/). So again, a possible inflation of supported package looming on the horizon.

No conclusions is brought forth yet.

 

Compiler and package support

Relevant GCC compilation options

-msse
-mno-sse
-msse2

...

To generate SSE/SSE2 instructions automatically from floating-point code (as opposed to 387 instructions), see -mfpmath=sse.

GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.

 

-mfpmath=unit
        `sse'
Use scalar floating-point instructions present in the SSE instruction set. This instruction set is supported by Pentium III and newer chips, and in the AMD line by Athlon-4, Athlon XP and Athlon MP chips.
[...]
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80 bits.

-mvzeroupper

(see above)

Vector Class (Vc) libraries

The Vector Class (Vc) project and espeiclaly the Vc library is a collection of SIMD vector classes with existing implementations for SSE, LRBni or a scalar fallback. In essence, the Vc headers allows the end-user to program using a standard set of vector-based instructions (via Vc-types) while the back-end implementation may be specific to a particular SIMD instruction set. This provide long term sustainability of the programing interface ... but the ABI may still break (as explained above).

In practice, Vc suppors the following SIMD sets: Scalar, AVX, SSE4a, SSHE4.2, SSSE3, SSE2

Valid GCC versions compatible with SIMD instructions sets

The other complication is that not all version of GCC supports SIMD instructions and/or are actually working with SIMD instructions. Integration would hence need to either check the SIMD-instruction-set specific tests of "a" package (Vc for example has a test suite).

The Vc project has a nice page with all recommendations for compilers. Example of pitfalls: 4.5.1 used in STAR for the test of the CA package is NOT recomended and 4.6.3 is the recommended version. Integration in STAR should consider this revision - which raise the issue that the support may not be that simple.

Forward compatibility link toward other compilers may not always work; for example, mixing codes assembled with another compiler known to be buggy SIMD-wise but using SIMD instructions may not lead to any good. The compatibility matrix would need to be checked carefully. Hopefully, those problems would go away as gcc evolves (though we are far from an SL distribution coming with any recent revision).

As per 2012/11/10: recommended compilers to use are GCC 4.7.1, 4.7.2, or 4.6.3 (would be our choice for SL6).

Survey

Survey of STAR CPUs capabilities across its sites:

  • BNL - Mostly Intel, Xeon
    CPUtype Intel(R) Xeon(R) CPU E5335 @ 2GHz
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr lahf_lm
    CPUtype Intel(R) Xeon(R) CPU E5440 @ 2GHz
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
    CPUtype Intel(R) Xeon(R) CPU X5550 @ 2GHz
    CPUtype Intel(R) Xeon(R) CPU X5560 @ 2GHz
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm const ant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
    CPUtype Intel(R) Xeon(R) CPU X5660 @ 2GHz
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm

     
  • TACC (Austin)
    AMD Opteron(tm) Processor 8354

    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
     

  • SUG@R and DaVINCi (Rice)

    Intel(R) Xeon(R) CPU E5440  @ 2.83GHz

    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
     

  • USNA

    AMD Opteron(tm) Processor 6174

    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw

    Quad-Core AMD Opteron(tm) Processor 2387

    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
     

  • KISTI
    Intel(R) Xeon(R) CPU X3320  @ 2.50GHz
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
    Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm

     
  • PDSF
    Dual-Core AMD Opteron(tm) Processor 2220
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
    Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
    Intel(R) Xeon(R) CPU L5640 @ 2.27GHz
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
    Intel(R) Xeon(R) CPU E5410 @ 2.33GHz
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    Quad-Core AMD Opteron(tm) Processor 2350
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
    Intel(R) Xeon(R) CPU X5650 @ 2.67GHz 
    
    Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm

     
  • WSU
    The following were reported - this site has some hardware not supporting sse2 instructions.
    Pentium III (Katmai)
    Support: fpu tsc msr pae cx8 apic mtrr cmov pat mmx fxsr sse up
    Intel(R) Xeon(TM) CPU 2.66GHz
    Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
    Intel(R) XEON(TM) CPU 2.00GHz
    Support: fpu tsc msr pae cx8 apic mtrr cmov pat clflush acpi mmx fxsr sse sse2 ss ht up
    Intel(R) XEON(TM) CPU 2.00GHz
    Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
    Intel(R) Xeon(TM) CPU 2.66GHz
    Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm
    Intel(R) Xeon(R) CPU E5205 @ 1.86GHz
    Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni mon

     
  • UIC
    Unknown - not reported