Impacts of loading high-core-count nodes

NB: Much of the discussion in this blog post relates very closely with the discussions that took place in SDCC RT ticket 37663.

______

Running many similar root4star jobs simultaneously on any high-core-count node in the SDCC farm leads to slower (longer) CPU times, less efficient cores. These efficiency losses are outweighed by the benefit of more cores, so maximal throughput per node is still with all (virtual) cores loaded. However, at the time of this posting, the efficiency losses are not understood.

These tests are on machines with 48 real cores, which provide 96 virtual cores via hyperthreading (HT).
  • Closed symbols are with HT; open symbols are without HT
  • Circles are 64-bit; triangles are 32-bit
  • Red and orange are on an Alma9 machine; blues and greens are on an SL7 machine
  • Darker colors are in SL7 containers; lighter colors are directly on SL7 (I do not have root4star running directly on Alma9)



Inverting and multiplying by the number of jobs (i.e. # jobs divided by slowdown factor) gives the effective number of cores on the machine (i.e. 45 means effectively x45 faster than a single, isolated core).






Costin argues that this is the expected behavior of the CPU clocks on Intel's Xeon Gold CPUs, colorfully documented here, and further links that can be followed therein. The machines at SDCC hold 2 of these 24-core Xeon Gold CPUs, giving them 48 real cores (before hyperthreading). Our root4star jobs contain no SIMD instructions, so the CPUs should be in their "Normal" mode, for which the base frequency is 2.1 GHz and maximum "turbo" frequency is 3.7 GHz, but the CPUs can go into a low frequency mode that slows the clock further if the electrical and thermal loads necessitate it.

Costin also referred me to the Grafana plots that allow essentially a real-time examination of the varying clock frequencies of individual nodes. He also gave a warning that he did not feel confident that what is displayed is necessarily accurate of the clock speeds as they can vary significantly in time and are only sampled at some interval (looks like every 15 seconds) for Grafana with a tool that may not even give an accurate number.

That said....I went ahead and studied the behavior of the Grafana information when loading the nodes with root4star jobs.

First, it should be noted that there is a lot of volatility for the initial part of the jobs during the initializations and overheads of the first event. After that, a somewhat steady state appeared during the remainder of the jobs. These two plots show the transition into the steady state for the cases of 48 (left) and 96 (right) jobs on the node from a 5 minute span of time:



Here now are 5-minute span steady state plots for various loads:

0 jobs
(idle)
4 jobs
16 jobs
32 jobs
48 jobs
96 jobs


Some observations on the behavior:
  • The frequencies are all over the place at idle.
  • When N jobs start, it isn't just N cores dropping in frequency (perhaps because the actual processing of any one job may hop from one core to another, even if run as embarrassingly parallel?)
  • At 48 and fewer jobs, each core may from time-to-time go back up to the maximum 3.7 GHz, but at 96 jobs all cores are held in a narrow range of frequencies.
  • No frequencies below 2.2 GHz appear outside of idle (that's about 60% of 3.7 GHz, which translates to an instantaneous "slow-down factor" of ~x1.7), so these plots (for what they're worth) don't show signs of low frequency mode.
Grafana also presents the data in tables, with mean, last, max, and min values for each core over the 5 minute intervals used here (the tops of these tables can be seen at the bottom of the screen-grabs I showed for the transition to steady state above). Using the mean values from those tables, I can plot the behavior of the clock frequencies versus job load. The following three plots all show the same data first in a 3d lego plot (without the idle data), and then in 2d "zcol" and "profs" plots (including the idle data):




There's a notable spread in the mean clock frequencies even under steady state for each degree of loading, but we're presumably sampling a variety of event reconstruction processes on a variety of events (even if they're all similar physics events). The 96-job data even shows a clear bi-modality (and the modes even seem to mostly be in large blocks of consecutively numbered cores, but it isn't worth spending time on understanding that any better here). At idle, most cores are sitting at the maximum 3.7 GHz.

We could interpret the average of the mean clock frequencies as the "slow-down" factor contributions for the node from the slower clocks when loading the cores, just as discussed in the linked information about the Xeon Gold processor. In that light, the clock frequencies give about x1.3 slow-down for 48 jobs, and just shy of x1.5 for 96 jobs. These values aren't quite as large as the slow-down factors seen from the root4star job CPU times earlier on this page, but it is clear that lowering of clock frequencies is a signficant contributor, and may perhaps explain all of it if we're not capturing the full extent of the lowering that is really happening on the chips.

-Gene