kernel task scheduler and autogrouping

There was a thought to set the nice value on HTcondor jobs on the online linux pool so that interactive users and ceph-osd could get lots of CPU time even if the machine was loaded with CPU-intensive grid jobs through the condor-ce. HTCondor has an option specifically to set the nice value on condor jobs, eg, 'JOB_RENICE_INCREMENT = 15', which is trivial to add to the condor configuration and it does indeed set the nice value of processes started under the condor_exec daemon.

However, that turned out to be too good to be true, because despite starting several condor jobs (root4star) with lower priority nice values, interactive processes seeking CPU (eg, 'stress -c 6') on a test machine (onl01) were not able to claim more CPU, even if they were themselves set to nice=-19 (eg, 'nice -n "-19" stress -c 6'). In fact, they were consistently treated as if they were lower priority tasks, forced to share the CPUs not taken by root4star tasks previously started by condor.

This turns out to be a "feature" introduced into the linux kernel process scheduler in 2010 called autogrouping, which was intended to improve the user experience in a desktop environment (and indeed there are many testimonials to it doing that). However, it works against us in this case.  I'm not sure exactly which kernel version it was introduced in (I read both 2.6.38 and 3.0), but it is certainly present in the kernels used in the OLP (4.x).  Each condor_exec process tree gets its own autogroup, as does each ceph-osd, and each interactive ssh login. With autogrouping enabled, nice values are ONLY used relative to other processes within the same group. So a ceph-osd with a nice value of 0 (or even -19) has no scheduling preference over a root4star process started by condor with a nice value of 19. Each task group has its own nice value (always zero by default as far as I've seen).

A simple example: consider a single CPU machine with three CPU-intensive processes, with nice values of 0, 0, and one very low priority 19. If each process is in its own task group, then each process will get about 33.3% of the CPU time. If instead the 19-nice job has its own task group, and the two 0-nice jobs are grouped together, then the 19 job will actually get ~ 50% of the CPU and the two 0-nice jobs would each get ~ 25% - quite counter to traditional expectations. And top is of no help here - it displays the nice values, but without knowing the task group breakdown, the allocations shown in top can seem absurd.

Autogrouping can be disabled at boot by adding "noautogroup" to the kernel line in grub.conf, or autogrouping can be disabled dynamically:

echo 0 > /proc/sys/kernel/sched_autogroup_enabled  #disables autogrouping
echo 1 > /proc/sys/kernel/sched_autogroup_enabled  #enables autogrouping

Disabling autogrouping on a running system however seems to have even more unexpected consequences, as the scheduler seems to have some memory anyway of previously assigned groups and continues to try to be "fair" amongst the groups rather than the individual processes.  I think disabling autogrouping only stops the kernel from creating new groups for each new tty.  [I'm not certain of this, but it seemed to explain the test results I got while dynamically turning the feature on and off.]

Meanwhile however, the individual group assignment and group niceness for each process can be seen (and modified) through the /proc/ filesystem in /proc/$PID/autogroup, such as this example of the ceph-osd processes on onl01, showing that they are assigned separate groups with autogrouping enabled at boot:

[root@onl01 ~]# cat /proc/sys/kernel/sched_autogroup_enabled
[root@onl01 ~]# cat /proc/{6632,6837,7037,7387}/autogroup
/autogroup-44 nice 0
/autogroup-45 nice 0
/autogroup-46 nice 0
/autogroup-47 nice 0

Here is an example of modifying the group nice value from 0 to 1 for a sample process, 3080013.  And yes, it does affect the parent process, 3004740, as well:

[root@onl01 ~]# cat /proc/3080013/autogroup
/autogroup-19874 nice 0
[root@onl01 ~]# grep 19874 /proc/*/autogroup
/proc/3004740/autogroup:/autogroup-19874 nice 0
/proc/3080013/autogroup:/autogroup-19874 nice 0
[root@onl01 ~]# echo "1" > /proc/3080013/autogroup
[root@onl01 ~]# grep 19874 /proc/*/autogroup
/proc/3004740/autogroup:/autogroup-19874 nice 1
/proc/3080013/autogroup:/autogroup-19874 nice 1

So, I propose to set the noautogroup option in the boot configuration on all of the Online Linux Pool, though I see little point in modifying it dynamically on the live machines.

There could be cases where disabling autogrouping is actually harmful, or at least perceived to be unfair.  Imagine again the single core computer with two users logged in. The first user runs 20 high-CPU processes simultaneously, while the second user only wants to run 1 CPU-intensive task.  With autogrouping, the second user might get close to half of the CPU power (somewhat fair), but without autogrouping, the second user might get only 1/21 of the CPU power - the first user really ruins thing for everyone if autogrouping is not enabled.