include/linux/sched/sysctl.h | 8 ++++++++ kernel/sched/core.c | 13 +++++++++++++ kernel/sched/fair.c | 5 +++-- kernel/sched/features.h | 10 ---------- kernel/sysctl.c | 20 ++++++++++++++++++++ 5 files changed, 44 insertions(+), 12 deletions(-)
This patchset disables the scheduler features PLACE_LAG and RUN_TO_PARITY and moves them to sysctl. Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced significant performance degradation in multiple database-oriented workloads. This degradation manifests in all kernel versions using EEVDF, across multiple Linux distributions, hardware architectures (x86_64, aarm64, amd64), and CPU generations. For example, running mysql+hammerdb results in a 12-17% throughput reduction and 12-18% latency increase compared to kernel 6.5 (using default scheduler settings everywhere). The magnitude of this performance impact is comparable to the average performance difference of a CPU generation over its predecessor. Testing combinations of available scheduler features showed that the largest improvement (short of disabling all EEVDF features) came from disabling both PLACE_LAG and RUN_TO_PARITY: Kernel | default | NO_PLACE_LAG and aarm64 | config | NO_RUN_TO_PARITY ---------+----------+----------------- 6.5 | baseline | N/A 6.6 | -13.2% | -6.8% 6.7 | -13.1% | -6.0% 6.8 | -12.3% | -6.5% 6.9 | -12.7% | -6.9% 6.10 | -13.5% | -5.8% 6.11 | -12.6% | -5.8% 6.12-rc2 | -12.2% | -8.9% ---------+----------+----------------- Kernel | default | NO_PLACE_LAG and x86_64 | config | NO_RUN_TO_PARITY ---------+----------+----------------- 6.5 | baseline | N/A 6.6 | -16.8% | -10.8% 6.7 | -16.4% | -9.9% 6.8 | -17.2% | -9.5% 6.9 | -17.4% | -9.7% 6.10 | -16.5% | -9.0% 6.11 | -15.0% | -8.5% 6.12-rc2 | -12.7% | -10.9% ---------+----------+----------------- While the long term approach is debugging and fixing the scheduler behavior, algorithm changes to address performance issues of this nature are specialized (and likely prolonged or open-ended) research. Until a change is identified which fixes the performance degradation, in the interest of a better out-of-the-box performance: (1) disable these features by default, and (2) expose these values in sysctl instead of debugfs, so they can be more easily persisted across reboots. Cristian Prundeanu (2): sched: Disable PLACE_LAG and RUN_TO_PARITY sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl include/linux/sched/sysctl.h | 8 ++++++++ kernel/sched/core.c | 13 +++++++++++++ kernel/sched/fair.c | 5 +++-- kernel/sched/features.h | 10 ---------- kernel/sysctl.c | 20 ++++++++++++++++++++ 5 files changed, 44 insertions(+), 12 deletions(-) -- 2.40.1
Here are more results with recent 6.12 code, and also using SCHED_BATCH. The control tests were run anew on Ubuntu 22.04 with the current pre-built kernels 6.5 (baseline) and 6.8 (regression out of the box). When updating mysql from 8.0.30 to 8.4.2, the regression grew even larger. Disabling PLACE_LAG and RUN _TO_PARITY improved the results more than using SCHED_BATCH. Kernel | default | NO_PLACE_LAG and | SCHED_BATCH | mysql | config | NO_RUN_TO_PARITY | | version ---------+----------+------------------+-------------+--------- 6.8 | -15.3% | | | 8.0.30 6.12-rc7 | -11.4% | -9.2% | -11.6% | 8.0.30 | | | | 6.8 | -18.1% | | | 8.4.2 6.12-rc7 | -14.0% | -10.2% | -12.7% | 8.4.2 ---------+----------+------------------+-------------+--------- Confidence intervals for all tests are smaller than +/- 0.5%. I expect to have the repro package ready by the end of the week. Thank you for your collective patience and efforts to confirm these results. On 2024-11-01, Peter Zijlstra wrote: >> (At the risk of stating the obvious, using SCHED_BATCH only to get back to >> the default CFS performance is still only a workaround, > > It is not really -- it is impossible to schedule all the various > workloads without them telling us what they really like. The quest is to > find interfaces that make sense and are implementable. But fundamentally > tasks will have to start telling us what they need. We've long since ran > out of crystal balls. Completely agree that the best performance is obtained when the tasks are individually tuned to the scheduler and explicitly set running parameters. This isn't different from before. But shouldn't our gold standard for default performance be CFS? There is a significant regression out of the box when using EEVDF; how is seeking additional tuning just to recover the lost performance not a workaround? (Not to mention that this additional tuning means shifting the burden on many users who may not be familiar enough with scheduler functionality. We're essentially asking everyone to spend considerable effort to maintain status quo from kernel 6.5.) On 2024-11-14, Joseph Salisbury wrote: > This is a confirmation that we are also seeing a 9% performance > regression with the TPCC benchmark after v6.6-rc1. We narrowed down the > regression was caused due to commit: > 86bfbb7ce4f6 ("sched/fair: Add lag based placement") > > This regression was reported via this thread: > https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/ > > Phil Auld suggested to try turning off the PLACE_LAG sched feature. We > tested with NO_PLACE_LAG and can confirm it brought back 5% of the > performance loss. We do not yet know what effect NO_PLACE_LAG will have > on other benchmarks, but it indeed helps TPCC. Thank you for confirming the regression. I've been monitoring performance on the v6.12-rcX tags since this thread started, and the results have been largely constant. I've also tested other benchmarks to verify whether (1) the regression exists and (2) the patch proposed in this thread negatively affects them. On postgresql and wordpress/nginx there is a regression which is improved when applying the patch; on mongo and mariadb no regression manifested, and the patch did not make their performance worse. On 2024-11-19, Dietmar Eggemann wrote: > #cat /etc/systemd/system/mysql.service > > [Service] > CPUSchedulingPolicy=batch > ExecStart=/usr/local/mysql/bin/mysqld_safe This is the approach I used as well to get the results above. > My hunch is that this is due to the 'connection' threads (1 per virtual > user) running in SCHED_BATCH. I yet have to confirm this by only > changing the 'connection' tasks to SCHED_BATCH. Did you have a chance to run with this scenario?
Hello Cristian, On 11/25/2024 5:05 PM, Cristian Prundeanu wrote: > Here are more results with recent 6.12 code, and also using SCHED_BATCH. > The control tests were run anew on Ubuntu 22.04 with the current pre-built > kernels 6.5 (baseline) and 6.8 (regression out of the box). > > When updating mysql from 8.0.30 to 8.4.2, the regression grew even larger. > Disabling PLACE_LAG and RUN _TO_PARITY improved the results more than > using SCHED_BATCH. > > Kernel | default | NO_PLACE_LAG and | SCHED_BATCH | mysql > | config | NO_RUN_TO_PARITY | | version > ---------+----------+------------------+-------------+--------- > 6.8 | -15.3% | | | 8.0.30 > 6.12-rc7 | -11.4% | -9.2% | -11.6% | 8.0.30 > | | | | > 6.8 | -18.1% | | | 8.4.2 > 6.12-rc7 | -14.0% | -10.2% | -12.7% | 8.4.2 > ---------+----------+------------------+-------------+--------- > > Confidence intervals for all tests are smaller than +/- 0.5%. > > I expect to have the repro package ready by the end of the week. Thank you > for your collective patience and efforts to confirm these results. Thank you! In the meantime, there is a new enhancement to perf-tool being proposed to use the data from /proc/schedstat to profile workloads and spot any obvious changes in the scheduling behavior at https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ It applies cleanly on tip:sched/core at tag "sched-core-2024-11-18" Would it be possible to use the perf-tool built there to collect the scheduling stats for MySQL benchmark runs on both v6.5 and v6.8 and share the output of "perf sched stats diff" and the two perf.data files recorded? It would help narrow down the regression if this can be linked to a system-wide behavior. Data from a run with NO_PLACE_LAG and NO_RUN_TO_PARITY can also help look at metrics that are helping improve the performance combared to vanilla v6.8 case. The proposed perf-tools changes are arch agnostic and should work on any system as long as it has /proc/schedstats with version 15 and above. > > [..snip..] > -- Thanks and Regards, Prateek
On 10/17/24 01:19, Cristian Prundeanu wrote: > This patchset disables the scheduler features PLACE_LAG and RUN_TO_PARITY > and moves them to sysctl. > > Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced > significant performance degradation in multiple database-oriented > workloads. This degradation manifests in all kernel versions using EEVDF, > across multiple Linux distributions, hardware architectures (x86_64, > aarm64, amd64), and CPU generations. > > For example, running mysql+hammerdb results in a 12-17% throughput > reduction and 12-18% latency increase compared to kernel 6.5 (using > default scheduler settings everywhere). The magnitude of this performance > impact is comparable to the average performance difference of a CPU > generation over its predecessor. > > Testing combinations of available scheduler features showed that the > largest improvement (short of disabling all EEVDF features) came from > disabling both PLACE_LAG and RUN_TO_PARITY: > > Kernel | default | NO_PLACE_LAG and > aarm64 | config | NO_RUN_TO_PARITY > ---------+----------+----------------- > 6.5 | baseline | N/A > 6.6 | -13.2% | -6.8% > 6.7 | -13.1% | -6.0% > 6.8 | -12.3% | -6.5% > 6.9 | -12.7% | -6.9% > 6.10 | -13.5% | -5.8% > 6.11 | -12.6% | -5.8% > 6.12-rc2 | -12.2% | -8.9% > ---------+----------+----------------- > > Kernel | default | NO_PLACE_LAG and > x86_64 | config | NO_RUN_TO_PARITY > ---------+----------+----------------- > 6.5 | baseline | N/A > 6.6 | -16.8% | -10.8% > 6.7 | -16.4% | -9.9% > 6.8 | -17.2% | -9.5% > 6.9 | -17.4% | -9.7% > 6.10 | -16.5% | -9.0% > 6.11 | -15.0% | -8.5% > 6.12-rc2 | -12.7% | -10.9% > ---------+----------+----------------- > > While the long term approach is debugging and fixing the scheduler > behavior, algorithm changes to address performance issues of this nature > are specialized (and likely prolonged or open-ended) research. Until a > change is identified which fixes the performance degradation, in the > interest of a better out-of-the-box performance: (1) disable these > features by default, and (2) expose these values in sysctl instead of > debugfs, so they can be more easily persisted across reboots. > > Cristian Prundeanu (2): > sched: Disable PLACE_LAG and RUN_TO_PARITY > sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl > > include/linux/sched/sysctl.h | 8 ++++++++ > kernel/sched/core.c | 13 +++++++++++++ > kernel/sched/fair.c | 5 +++-- > kernel/sched/features.h | 10 ---------- > kernel/sysctl.c | 20 ++++++++++++++++++++ > 5 files changed, 44 insertions(+), 12 deletions(-) > Hi Cristian, This is a confirmation that we are also seeing a 9% performance regression with the TPCC benchmark after v6.6-rc1. We narrowed down the regression was caused due to commit: 86bfbb7ce4f6 ("sched/fair: Add lag based placement") This regression was reported via this thread: https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/ Phil Auld suggested to try turning off the PLACE_LAG sched feature. We tested with NO_PLACE_LAG and can confirm it brought back 5% of the performance loss. We do not yet know what effect NO_PLACE_LAG will have on other benchmarks, but it indeed helps TPCC. Thanks for the work to move PLACE_LAG and RUN_TO_PARITY to sysctl! Joe
On 14/11/2024 21:10, Joseph Salisbury wrote: Hi Joseph, > On 10/17/24 01:19, Cristian Prundeanu wrote: [...] > Hi Cristian, > > This is a confirmation that we are also seeing a 9% performance > regression with the TPCC benchmark after v6.6-rc1. We narrowed down the > regression was caused due to commit: > 86bfbb7ce4f6 ("sched/fair: Add lag based placement") > > This regression was reported via this thread: > https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/ > > Phil Auld suggested to try turning off the PLACE_LAG sched feature. We > tested with NO_PLACE_LAG and can confirm it brought back 5% of the > performance loss. We do not yet know what effect NO_PLACE_LAG will have > on other benchmarks, but it indeed helps TPCC. Can you try to run mysql in SCHED_BATCH when using EEVDF? https://lkml.kernel.org/r/20241029045749.37257-1-cpru@amazon.com The regression went away for me when changing mysql threads to SCHED_BATCH. You can either start mysql with 'CPUSchedulingPolicy=batch': #cat /etc/systemd/system/mysql.service [Service] CPUSchedulingPolicy=batch ExecStart=/usr/local/mysql/bin/mysqld_safe # systemctl daemon-reload # systemctl restart mysql or change the policy with chrt for all mysql threads when doing consecutive test runs starting from the 2. run ('connection' threads have to exists already) # chrt -b -a -p 0 $PID_MYSQL # ps -p $PID_MYSQL -To comm,pid,tid,nice,class COMMAND PID TID NI CLS mysqld 4872 4872 0 B ib_io_ibuf 4872 4878 0 B ... xpl_accept-3 4872 4921 0 B connection 4872 5007 0 B ... connection 4872 5413 0 B My hunch is that this is due to the 'connection' threads (1 per virtual user) running in SCHED_BATCH. I yet have to confirm this by only changing the 'connection' tasks to SCHED_BATCH. [..]
On Thu, Oct 17, 2024 at 12:19:58AM -0500, Cristian Prundeanu wrote: > For example, running mysql+hammerdb results in a 12-17% throughput Gautham, is this a benchmark you're running? > Testing combinations of available scheduler features showed that the > largest improvement (short of disabling all EEVDF features) came from > disabling both PLACE_LAG and RUN_TO_PARITY: How does using SCHED_BATCH compare? > While the long term approach is debugging and fixing the scheduler > behavior, algorithm changes to address performance issues of this nature > are specialized (and likely prolonged or open-ended) research. Until a > change is identified which fixes the performance degradation, in the > interest of a better out-of-the-box performance: (1) disable these > features by default, and (2) expose these values in sysctl instead of > debugfs, so they can be more easily persisted across reboots. So disabling them by default will undoubtedly affect a ton of other workloads. And sysctl is arguably more of an ABI than debugfs, which doesn't really sound suitable for workaround. And I don't see how adding a line to /etc/rc.local is harder than adding a line to /etc/sysctl.conf
On 2024-10-17, 04:11, "Peter Zijlstra" <peterz@infradead.org> wrote: >> For example, running mysql+hammerdb results in a 12-17% throughput > Gautham, is this a benchmark you're running? Most of my testing for this investigation is on mysql+hammerdb because it simplifies differentiating statistically meaningful results, but performance impact (and improvement from disabling the two features) also shows on workloads based on postgresql and on wordpress+nginx. > How does using SCHED_BATCH compare? I haven't tested with SCHED_BATCH yet, will update the thread with results as they accumulate (each variation of the test takes multiple hours, not counting result processing and evaluation). Looking at man sched for SCHED_BATCH: "the scheduler will apply a small scheduling penalty with respect to wakeup behavior, so that this thread is mildly disfavored in scheduling decisions". Would this correctly translate to "the thread will run more deterministically, but be scheduled less frequently than other threads", i.e. expectedly lower performance in exchange for less variability? > So disabling them by default will undoubtedly affect a ton of other > workloads. That's very likely either way, as the testing space is near infinite, but it seems more practical to first address the issue we already know about. At this time, I don't have any data points to indicate a negative impact of disabling them for popular production workloads (as opposed to the flip case). More testing is in progress (looking at the major areas: workloads heavy on CPU, RAM, disk, and networking); so far, the results show no downside. > And sysctl is arguably more of an ABI than debugfs, which > doesn't really sound suitable for workaround. > > And I don't see how adding a line to /etc/rc.local is harder than adding > a line to /etc/sysctl.conf Adding a line is equally difficult both ways, you're right. But aren't most distros better equipped to manage (persist, modify, automate) sysctl parameters in a standardized manner? Whereas rc.local seems more "individual need / edge case" oriented. For instance: changes are done by editing the file, which is poorly scriptable (unlike the sysctl command, which is a unified interface that reconciles changes); the load order is also typically late in the boot stage, making it not an ideal place for settings that affect system processes.
>> And sysctl is arguably more of an ABI than debugfs, which >> doesn't really sound suitable for workaround. >> >> And I don't see how adding a line to /etc/rc.local is harder than adding >> a line to /etc/sysctl.conf > > Adding a line is equally difficult both ways, you're right. But aren't > most distros better equipped to manage (persist, modify, automate) sysctl > parameters in a standardized manner? > Whereas rc.local seems more "individual need / edge case" oriented. For > instance: changes are done by editing the file, which is poorly scriptable > (unlike the sysctl command, which is a unified interface that reconciles > changes); the load order is also typically late in the boot stage, making > it not an ideal place for settings that affect system processes. > I'd add to what Cristian mentioned is that having these tunables as sysctls will make them more detectable to the end users because checking output of sysctl -a is usually one of the first steps during performance troubleshooting vs checking /sys/kernel/debug/sched/ files so it's easier for people to spot these configurations as sysctls if they notice performance difference after upgrading the kernel. Hazem
Hello Christian, On 10/17/2024 11:49 PM, Prundeanu, Cristian wrote: > On 2024-10-17, 04:11, "Peter Zijlstra" <peterz@infradead.org> wrote: > >>> For example, running mysql+hammerdb results in a 12-17% throughput >> Gautham, is this a benchmark you're running? Most of our testing used sysbench as the benchmark driver. How does mysql+hammerdb work specifically? Do the tasks driving the request are located on a separate server or are co-located with the benchmarks threads on the same server? Most of our testing uses affinity to make sure the drivers do not run on same CPUs as the workload threads. If the two can run on the same CPU, then we have observed interesting behavior with a wide amount of deviation. > > Most of my testing for this investigation is on mysql+hammerdb because it > simplifies differentiating statistically meaningful results, but > performance impact (and improvement from disabling the two features) also > shows on workloads based on postgresql and on wordpress+nginx. Did you see any glaring changes in scheduler statistics with the introduction of EEVDF in v6.6? EEVDF commits up till v6.9 were easy to revert from my experience but I've not tried it on v6.12-rcX with the EEVDF complete series. Is all the regression seen purely attributable to EEVDF alone on the more recent kernels? > >> How does using SCHED_BATCH compare? > > I haven't tested with SCHED_BATCH yet, will update the thread with results > as they accumulate (each variation of the test takes multiple hours, not > counting result processing and evaluation). Could you also test running with: echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features In our testing, the using SCHED_BATCH prevents aggressive wakeup preemption, and those benchmarks also showed improvements with NO_WAKEUP_PREEMPTION. On a side note, what is the CONFIG_HZ and the preemption model on your test kernel (most of my testing was with CONFIG+HZ=250, voluntary preemption) > > Looking at man sched for SCHED_BATCH: "the scheduler will apply a small > scheduling penalty with respect to wakeup behavior, so that this thread is > mildly disfavored in scheduling decisions". Would this correctly translate > to "the thread will run more deterministically, but be scheduled less > frequently than other threads", i.e. expectedly lower performance in > exchange for less variability? > >> So disabling them by default will undoubtedly affect a ton of other >> workloads. > > That's very likely either way, as the testing space is near infinite, but > it seems more practical to first address the issue we already know about. RUN_TO_PARITY was introduced when Chenyu discovered that a large regression in blogbench reported by Intel Test Robot (https://lore.kernel.org/all/202308101628.7af4631a-oliver.sang@intel.com/) was the result of very aggressive wakeup preemption (https://lore.kernel.org/all/ZNWgAeN%2FEVS%2FvOLi@chenyu5-mobl2.bbrouter/) The data in the latter link helped root-cause the actual issue with the algorithm that the benchmark disliked. Similar information for the database benchmarks you are running, can help narrow down the issue. > > At this time, I don't have any data points to indicate a negative > impact of disabling them for popular production workloads (as opposed to > the flip case). More testing is in progress (looking at the major areas: > workloads heavy on CPU, RAM, disk, and networking); so far, the results > show no downside. Analyzing your approach, what you are essentially doing with the two sched features is as follows: o NO_PLACE_LAG - Without place lag, a newly enqueued entity will always start from the avg_vruntime point in the task timeline i.e., it will always be eligible at the time of enqueue. o NO_RUN_TO_PARITY - Do not run the current task until the vruntime meets its deadline after the first pick. Instead, preempt the current running task if it found to be ineligible at the time of wakeup. From what I can tell, your benchmark has a set of threads that like to get cpu time as fast as possible. With EEVDF Complete (I would recommend using current tip:sched/urgent branch to test them out) setting a more aggressive nice value to these threads should enable them to negate the effect of RUN_TO_PARITY thanks to PREEMPT_SHORT. As for NO_PLACE_LAG, the DELAY_DEQUEUE feature should help task shed off any lag it has built up and should very likely start from the zero-lag point unless it is a very short sleeper. > >> And sysctl is arguably more of an ABI than debugfs, which >> doesn't really sound suitable for workaround. >> >> And I don't see how adding a line to /etc/rc.local is harder than adding >> a line to /etc/sysctl.conf > > Adding a line is equally difficult both ways, you're right. But aren't > most distros better equipped to manage (persist, modify, automate) sysctl > parameters in a standardized manner? > Whereas rc.local seems more "individual need / edge case" oriented. For > instance: changes are done by editing the file, which is poorly scriptable > (unlike the sysctl command, which is a unified interface that reconciles > changes); the load order is also typically late in the boot stage, Is there any reason to flip it very early into the boot? Have you seen anything go awry with system processes during boot with EEVDF? > making > it not an ideal place for settings that affect system processes. > -- Thanks and Regards, Prateek
© 2016 - 2024 Red Hat, Inc.