SIMD instruction sets
Generalities
SIMD (Single Instructions, Multiple Data) are extensions supported as Intel and AMD x86 instruction set for parallel operations on packed integer or floating-point data. This is provided for achieving data parallelism (parallel operations based on vectors) by applying the same operation in parallel on a number of data items packed into a 64, 128 or 256-bit vector. SIMD alos support "Scalar" operations on integer or floating-point values.
Possible SIMD instruction sets includes: MMX, EMMX, SSE, SSE2, SSE3, SSSE3, SSE4 (4, 4.1, 4.2 and a), XOP, FME4, CVT16, AVX, 3DNow* (! and !2), Altivec, VIS...
MMX supported only integer operations while SSE (Streaming SIMDS Extensions) included single-precision floating points, SSE2 double precisions. AVX (Advanced Vector Extensions) declared as the future of Intel in 2008, available in 2010 aimed to increase the data path from 128 bits to 256 bits and 3-operand instructions. CPUs delivered in 2011 by Intel supports AVX (the initial goal was a support in the Intel Sandy Bridge processor).
At the time of this blog, SSE2 and SSE3 are the most commonly available SIMD instructions set with a high potential toward a move to AVX.
A page describing most of the instructions sets (and their gory details) is available here.
Potential issues using SIMD instructions
Problem statement
A code compiled with AVX support (for example) cannot run on a hardware which does not have this SIMD instruction set support. Adding SIMD instructions hence immediately creates non-portable binaries including long term backward compatibility (old code run on new hardware MUST have the new hardware support the old instruction sets).
In fact, instruction sets CANNOT be mixed (this is the second issue). If (let's say) ROOT is assembled using SSE and SSE2 instructions, the main code cannot use AVX without extreme care - to get an explanation on this (from Intel), see for example this general paper on performance penalty when this is done. A short explanation follows
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The paper also includes possible mitigation to avoid the extreme run-time penalty.
If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.
- For gcc compiler however, the manpages specify
-mvzeroupper - This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.
Integration of SIMD instructions hence has to be thought of VERY carefully as it impacts both cross CPU architecture portability and long term support. It is believed that the old SIMD instruction sets would be supported by future hardware though (TBC). In STAR so far, we have relied on portability across CPU hardware of the "same" family (all Intel and AMD "look alike", code loaded over AFS for SL/RH and compiler revision only, 32/64 bits introduced fully only in 2011).
STAR specific
Introducing SIMD instructions represents a new paradigm and yet another dimension to take care off. In principle, STAR could include yet another level to our OS/compiler separation (for now, STAR have all binaries sorted into OS/compiler specific sub-paths such as sl53_gcc432 or sl57_gcc451 and even sl53_x8684_gcc432 for the 64 bits version) but all Makefile or make-systems should be self-consistent (which is a bit harder) in keeping the SAME set of SIMD instructions + the plethora of instructions may increase the support dimension to an unmanageable scale.
One possibility and temptation would be to reduce the support to a "minimal common denominator" (for now SSE/SSE2) and leave any site wanting to have additional tuned speed up on their own. This may NOT ensure however that external/third party libraries would be available in the same SIMD instruction set. Consider for example root4star (the main entry point for our code to date, but this is illustrative - same issue with ay .so): it used to depend on libmysql which, so far, could be kept as a system package. Imagine MySQL decides to support AVX instructions and our code supports SSE2 -> kaboum! The implication is that we would need to compile all external third party libraries and add them to OPTSTAR (the third party library base path equivalent to /usr/local/). So again, a possible inflation of supported package looming on the horizon.
No conclusions is brought forth yet.
Compiler and package support
Relevant GCC compilation options
- -msse
- -mno-sse
- -msse2
...
To generate SSE/SSE2 instructions automatically from floating-point code (as opposed to 387 instructions), see -mfpmath=sse.
GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.
- -mfpmath=unit
- `sse'
- Use scalar floating-point instructions present in the SSE instruction set. This instruction set is supported by Pentium III and newer chips, and in the AMD line by Athlon-4, Athlon XP and Athlon MP chips.
[...]
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80 bits.
-mvzeroupper
(see above)
Vector Class (Vc) libraries
The Vector Class (Vc) project and espeiclaly the Vc library is a collection of SIMD vector classes with existing implementations for SSE, LRBni or a scalar fallback. In essence, the Vc headers allows the end-user to program using a standard set of vector-based instructions (via Vc-types) while the back-end implementation may be specific to a particular SIMD instruction set. This provide long term sustainability of the programing interface ... but the ABI may still break (as explained above).
In practice, Vc suppors the following SIMD sets: Scalar, AVX, SSE4a, SSHE4.2, SSSE3, SSE2
Valid GCC versions compatible with SIMD instructions sets
The other complication is that not all version of GCC supports SIMD instructions and/or are actually working with SIMD instructions. Integration would hence need to either check the SIMD-instruction-set specific tests of "a" package (Vc for example has a test suite).
The Vc project has a nice page with all recommendations for compilers. Example of pitfalls: 4.5.1 used in STAR for the test of the CA package is NOT recomended and 4.6.3 is the recommended version. Integration in STAR should consider this revision - which raise the issue that the support may not be that simple.
Forward compatibility link toward other compilers may not always work; for example, mixing codes assembled with another compiler known to be buggy SIMD-wise but using SIMD instructions may not lead to any good. The compatibility matrix would need to be checked carefully. Hopefully, those problems would go away as gcc evolves (though we are far from an SL distribution coming with any recent revision).
As per 2012/11/10: recommended compilers to use are GCC 4.7.1, 4.7.2, or 4.6.3 (would be our choice for SL6).
Survey
Survey of STAR CPUs capabilities across its sites:
- BNL - Mostly Intel, Xeon
CPUtype Intel(R) Xeon(R) CPU E5335 @ 2GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr lahf_lm
CPUtype Intel(R) Xeon(R) CPU E5440 @ 2GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
CPUtype Intel(R) Xeon(R) CPU X5550 @ 2GHz CPUtype Intel(R) Xeon(R) CPU X5560 @ 2GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm const ant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
CPUtype Intel(R) Xeon(R) CPU X5660 @ 2GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
- TACC (Austin)
AMD Opteron(tm) Processor 8354
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
-
Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
-
AMD Opteron(tm) Processor 6174
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
Quad-Core AMD Opteron(tm) Processor 2387
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
- KISTI
Intel(R) Xeon(R) CPU X3320 @ 2.50GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
- PDSF
Dual-Core AMD Opteron(tm) Processor 2220
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacyIntel(R) Xeon(R) CPU E5520 @ 2.27GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lmIntel(R) Xeon(R) CPU L5640 @ 2.27GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lmIntel(R) Xeon(R) CPU E5410 @ 2.33GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriorityQuad-Core AMD Opteron(tm) Processor 2350
Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lockIntel(R) Xeon(R) CPU X5650 @ 2.67GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
- WSU
The following were reported - this site has some hardware not supporting sse2 instructions.
Pentium III (Katmai)
Support: fpu tsc msr pae cx8 apic mtrr cmov pat mmx fxsr sse up
Intel(R) Xeon(TM) CPU 2.66GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
Intel(R) XEON(TM) CPU 2.00GHz
Support: fpu tsc msr pae cx8 apic mtrr cmov pat clflush acpi mmx fxsr sse sse2 ss ht up
Intel(R) XEON(TM) CPU 2.00GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
Intel(R) Xeon(TM) CPU 2.66GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm
Intel(R) Xeon(R) CPU E5205 @ 1.86GHz
Support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni mon
- UIC
Unknown - not reported
- jeromel's blog
- Login or register to post comments