include/linux/sched/topology.h | 5 ++ kernel/sched/fair.c | 114 ++++++++++++++++++++++++++++++++- kernel/sched/features.h | 4 ++ kernel/sched/stats.c | 5 +- kernel/sched/topology.c | 14 ++-- 5 files changed, 135 insertions(+), 7 deletions(-)
Hi,
This is the second version of the newidle balance optimization[1].
It aims to reduce the cost of newidle balance which is found to
occupy noticeable CPU cycles on some high-core count systems.
For example, when running sqlite on Intel Sapphire Rapids, which has
2 x 56C/112T = 224 CPUs:
6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
To mitigate this cost, the optimization is inspired by the question
raised by Tim:
Do we always have to find the busiest group and pull from it? Would
a relatively busy group be enough?
There are two proposals in this patch set.
The first one is ILB_UTIL. It was proposed to limit the scan
depth in update_sd_lb_stats(). The scan depth is based on the
overall utilization of this sched domain. The higher the utilization
is, the less update_sd_lb_stats() scans. Vice versa.
The second one is ILB_FAST. Instead of always finding the busiest
group in update_sd_lb_stats(), lower the bar and try to find a
relatively busy group. ILB_FAST takes effect when the local group
is group_has_spare. Because when there are many CPUs running
newidle_balance() concurrently, the sched groups should have a
high idle percentage.
Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
chooses a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.
patch 1/7 and patch 2/7 are preparation for ILB_UTIL.
patch 3/7 is a preparation for both ILB_UTIL and ILB_FAST.
patch 4/7 is part of ILB_UTIL. It calculates the scan depth
of sched groups which will be used by
update_sd_lb_stats(). The depth is calculated by the
periodic load balance.
patch 5/7 introduces the ILB_UTIL.
patch 6/7 introduces the ILB_FAST.
patch 7/7 is a debug patch to print more sched statistics, inspired
by Prateek's test report.
In the previous version, Prateek found some regressions[2].
This is probably caused by:
1. Cross Numa access to sched_domain_shared. So this version removed
the sched_domain_shared for Numa domain.
2. newidle balance did not try so hard to scan for the busiest
group. This version still keeps the linear scan function. If
the regression is still there, we can try to leverage the result
of SIS_UTIL. Because SIS_UTIL is a quadratic function which
could help scan the domain harder when the system is not
overloaded.
Changes since the previous version:
1. For all levels except for NUMA, connect a sched_domain_shared
instance. This makes the newidle balance optimization more
generic, and not only for LLC domain. (Peter, Gautham)
2. Introduce ILB_FAST, which terminates the sched group scan
earlier, if it finds a proper group rather than the busiest
one (Tim).
Peter has suggested reusing the statistics of the sched group
if multiple CPUs trigger newidle balance concurrently[3]. I created
a prototype[4] based on this direction. According to the test, there
are some regressions. The bottlenecks are a spin_trylock() and the
memory load from the 'cached' shared region. It is still under
investigation so I did not include that change into this patch set.
Any comments would be appreciated.
[1] https://lore.kernel.org/lkml/cover.1686554037.git.yu.c.chen@intel.com/
[2] https://lore.kernel.org/lkml/7e31ad34-ce2c-f64b-a852-f88f8a5749a6@amd.com/
[3] https://lore.kernel.org/lkml/20230621111721.GA2053369@hirez.programming.kicks-ass.net/
[4] https://github.com/chen-yu-surf/linux/commit/a6b33df883b972d6aaab5fceeddb11c34cc59059.patch
Chen Yu (7):
sched/topology: Assign sd_share for all non NUMA sched domains
sched/topology: Introduce nr_groups in sched_domain to indicate the
number of groups
sched/fair: Save a snapshot of sched domain total_load and
total_capacity
sched/fair: Calculate the scan depth for idle balance based on system
utilization
sched/fair: Adjust the busiest group scanning depth in idle load
balance
sched/fair: Pull from a relatively busy group during newidle balance
sched/stats: Track the scan number of groups during load balance
include/linux/sched/topology.h | 5 ++
kernel/sched/fair.c | 114 ++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 4 ++
kernel/sched/stats.c | 5 +-
kernel/sched/topology.c | 14 ++--
5 files changed, 135 insertions(+), 7 deletions(-)
--
2.25.1
> Hi, > > This is the second version of the newidle balance optimization[1]. > It aims to reduce the cost of newidle balance which is found to > occupy noticeable CPU cycles on some high-core count systems. Hi there, what's the status of this series? I'm seeing this same symptom of burning cycles in update_sd_lb_stats() on an AMD EPYC 7713 machine (128 CPUs, 8 NUMA nodes). The machine is about 50% idle and upadte_sd_lb_stats() sits as the first entry in perf top with about 3.62% of CPU cycles. Thanks, Matt
Hi Matt, On 2024-07-16 at 15:16:45 +0100, Matt Fleming wrote: > > Hi, > > > > This is the second version of the newidle balance optimization[1]. > > It aims to reduce the cost of newidle balance which is found to > > occupy noticeable CPU cycles on some high-core count systems. > > Hi there, what's the status of this series? > Thanks for your interest in this patch series. The RFC patch series was sent out to seek for directions and to see if this issue is worthy to fix. Since you have encountered this issue as well and it seems to be generic issue, I'll rebase thie patch series and retest it on top of latest kernel and then send out a new version. > I'm seeing this same symptom of burning cycles in update_sd_lb_stats() on an > AMD EPYC 7713 machine (128 CPUs, 8 NUMA nodes). The machine is about 50% idle > and upadte_sd_lb_stats() sits as the first entry in perf top with about 3.62% > of CPU cycles. May I know what benchmark(test scenario) you are testing? I'd like to replicate this test on my machine as well. thanks, Chenyu > > Thanks, > Matt
On Wed, Jul 17, 2024 at 4:53 AM Chen Yu <yu.c.chen@intel.com> wrote: > > Thanks for your interest in this patch series. The RFC patch series was sent > out to seek for directions and to see if this issue is worthy to fix. Since > you have encountered this issue as well and it seems to be generic issue, > I'll rebase thie patch series and retest it on top of latest kernel and then > send out a new version. Great, thanks! > > I'm seeing this same symptom of burning cycles in update_sd_lb_stats() on an > > AMD EPYC 7713 machine (128 CPUs, 8 NUMA nodes). The machine is about 50% idle > > and upadte_sd_lb_stats() sits as the first entry in perf top with about 3.62% > > of CPU cycles. > > May I know what benchmark(test scenario) you are testing? I'd like to replicate > this test on my machine as well. Actually this isn't a benchmark -- this was observed on Cloudflare's production machines. I'm happy to try out your series and report back. Thanks, Matt
On 7/27/23 8:03 PM, Chen Yu wrote: > Hi, > > This is the second version of the newidle balance optimization[1]. > It aims to reduce the cost of newidle balance which is found to > occupy noticeable CPU cycles on some high-core count systems. > > For example, when running sqlite on Intel Sapphire Rapids, which has > 2 x 56C/112T = 224 CPUs: > > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats > > To mitigate this cost, the optimization is inspired by the question > raised by Tim: > Do we always have to find the busiest group and pull from it? Would > a relatively busy group be enough? > > There are two proposals in this patch set. > The first one is ILB_UTIL. It was proposed to limit the scan > depth in update_sd_lb_stats(). The scan depth is based on the > overall utilization of this sched domain. The higher the utilization > is, the less update_sd_lb_stats() scans. Vice versa. > > The second one is ILB_FAST. Instead of always finding the busiest > group in update_sd_lb_stats(), lower the bar and try to find a > relatively busy group. ILB_FAST takes effect when the local group > is group_has_spare. Because when there are many CPUs running > newidle_balance() concurrently, the sched groups should have a > high idle percentage. > > Compared between ILB_UTIL and ILB_FAST, the former inhibits the > sched group scan when the system is busy. While the latter > chooses a compromised busy group when the system is not busy. > So they are complementary to each other and work independently. > > patch 1/7 and patch 2/7 are preparation for ILB_UTIL. > > patch 3/7 is a preparation for both ILB_UTIL and ILB_FAST. > > patch 4/7 is part of ILB_UTIL. It calculates the scan depth > of sched groups which will be used by > update_sd_lb_stats(). The depth is calculated by the > periodic load balance. > > patch 5/7 introduces the ILB_UTIL. > > patch 6/7 introduces the ILB_FAST. > > patch 7/7 is a debug patch to print more sched statistics, inspired > by Prateek's test report. > > In the previous version, Prateek found some regressions[2]. > This is probably caused by: > 1. Cross Numa access to sched_domain_shared. So this version removed > the sched_domain_shared for Numa domain. > 2. newidle balance did not try so hard to scan for the busiest > group. This version still keeps the linear scan function. If > the regression is still there, we can try to leverage the result > of SIS_UTIL. Because SIS_UTIL is a quadratic function which > could help scan the domain harder when the system is not > overloaded. > > Changes since the previous version: > 1. For all levels except for NUMA, connect a sched_domain_shared > instance. This makes the newidle balance optimization more > generic, and not only for LLC domain. (Peter, Gautham) > 2. Introduce ILB_FAST, which terminates the sched group scan > earlier, if it finds a proper group rather than the busiest > one (Tim). > > > Peter has suggested reusing the statistics of the sched group > if multiple CPUs trigger newidle balance concurrently[3]. I created > a prototype[4] based on this direction. According to the test, there > are some regressions. The bottlenecks are a spin_trylock() and the > memory load from the 'cached' shared region. It is still under > investigation so I did not include that change into this patch set. > > Any comments would be appreciated. > > [1] https://lore.kernel.org/lkml/cover.1686554037.git.yu.c.chen@intel.com/ > [2] https://lore.kernel.org/lkml/7e31ad34-ce2c-f64b-a852-f88f8a5749a6@amd.com/ > [3] https://lore.kernel.org/lkml/20230621111721.GA2053369@hirez.programming.kicks-ass.net/ > [4] https://github.com/chen-yu-surf/linux/commit/a6b33df883b972d6aaab5fceeddb11c34cc59059.patch > > Chen Yu (7): > sched/topology: Assign sd_share for all non NUMA sched domains > sched/topology: Introduce nr_groups in sched_domain to indicate the > number of groups > sched/fair: Save a snapshot of sched domain total_load and > total_capacity > sched/fair: Calculate the scan depth for idle balance based on system > utilization > sched/fair: Adjust the busiest group scanning depth in idle load > balance > sched/fair: Pull from a relatively busy group during newidle balance > sched/stats: Track the scan number of groups during load balance > > include/linux/sched/topology.h | 5 ++ > kernel/sched/fair.c | 114 ++++++++++++++++++++++++++++++++- > kernel/sched/features.h | 4 ++ > kernel/sched/stats.c | 5 +- > kernel/sched/topology.c | 14 ++-- > 5 files changed, 135 insertions(+), 7 deletions(-) > Hi Chen. It is a nice patch series in effort to reduce the newidle cost. This gives the idea of making use of calculations done in load_balance to used among different idle types. It was interesting to see how this would work on Power Systems. The reason being we have large core count and LLC size is small. i.e at small core level (llc_weight=4). This would mean quite frequest access sd_share at different level which would reside on the first_cpu of the sched domain, which might result in more cache-misses. But perf stats didnt show the same. Another concern on more number of sched groups at DIE level, which might take a hit if the balancing takes longer for the system to stabilize. tl;dr Tested with micro-benchmarks on system with 96 Cores with SMT=8. Total of 768 CPU's. There is some amount of regression with hackbench and schbench. haven't looked into why. Any pointers to check would be helpful. Did a test with more real case workload that we have called daytrader. Its is DB workload which gives total transcations done per second. That doesn't show any regression. Its true that all benchmarks will not be happy. Maybe in below cases, newidle may not be that costly. Do you have any specific benchmark to be tried? ----------------------------------------------------------------------------------------------------- 6.5.rc4 6.5.rc4 + PATCH_V2 gain Daytrader: 55049 55378 0.59% ----------------------------------------------------------------------------------------------------- hackbench(50 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) Process 10 groups : 0.19, 0.19(0.00) Process 20 groups : 0.23, 0.24(-4.35) Process 30 groups : 0.28, 0.30(-7.14) Process 40 groups : 0.38, 0.40(-5.26) Process 50 groups : 0.43, 0.45(-4.65) Process 60 groups : 0.51, 0.51(0.00) thread 10 Time : 0.21, 0.22(-4.76) thread 20 Time : 0.27, 0.32(-18.52) Process(Pipe) 10 Time : 0.17, 0.17(0.00) Process(Pipe) 20 Time : 0.23, 0.23(0.00) Process(Pipe) 30 Time : 0.28, 0.28(0.00) Process(Pipe) 40 Time : 0.33, 0.32(3.03) Process(Pipe) 50 Time : 0.38, 0.36(5.26) Process(Pipe) 60 Time : 0.40, 0.39(2.50) thread(Pipe) 10 Time : 0.14, 0.14(0.00) thread(Pipe) 20 Time : 0.20, 0.19(5.00) Observation: lower is better. socket based runs show regression quite a bit, pipe shows slight improvement. ----------------------------------------------------------------------------------------------------- Unixbench(10 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) 1 X Execl Throughput : 4280.15, 4398.30(2.76) 4 X Execl Throughput : 8171.60, 8061.60(-1.35) 1 X Pipe-based Context Switching : 172455.50, 174586.60(1.24) 4 X Pipe-based Context Switching : 633708.35, 664659.85(4.88) 1 X Process Creation : 6891.20, 7056.85(2.40) 4 X Process Creation : 8826.20, 8996.25(1.93) 1 X Shell Scripts (1 concurrent) : 9272.05, 9456.10(1.98) 4 X Shell Scripts (1 concurrent) : 27919.60, 25319.75(-9.31) 1 X Shell Scripts (8 concurrent) : 4462.70, 4392.75(-1.57) 4 X Shell Scripts (8 concurrent) : 11852.30, 10820.70(-8.70) Observation: higher is better. Results are somewhat mixed. ----------------------------------------------------------------------------------------------------- schbench(10 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) 1 Threads 50.0th: 8.00, 7.00(12.50) 75.0th: 8.00, 7.60(5.00) 90.0th: 8.80, 8.00(9.09) 95.0th: 10.20, 8.20(19.61) 99.0th: 13.60, 11.00(19.12) 99.5th: 14.00, 12.80(8.57) 99.9th: 15.80, 35.00(-121.52) 2 Threads 50.0th: 8.40, 8.20(2.38) 75.0th: 9.00, 8.60(4.44) 90.0th: 10.20, 9.60(5.88) 95.0th: 11.20, 10.20(8.93) 99.0th: 14.40, 11.40(20.83) 99.5th: 14.80, 12.80(13.51) 99.9th: 17.60, 14.80(15.91) 4 Threads 50.0th: 10.60, 10.40(1.89) 75.0th: 12.20, 11.60(4.92) 90.0th: 13.60, 12.60(7.35) 95.0th: 14.40, 13.00(9.72) 99.0th: 16.40, 15.60(4.88) 99.5th: 16.80, 16.60(1.19) 99.9th: 22.00, 29.00(-31.82) 8 Threads 50.0th: 12.00, 11.80(1.67) 75.0th: 14.40, 14.40(0.00) 90.0th: 17.00, 18.00(-5.88) 95.0th: 19.20, 19.80(-3.13) 99.0th: 23.00, 24.20(-5.22) 99.5th: 26.80, 29.20(-8.96) 99.9th: 68.00, 97.20(-42.94) 16 Threads 50.0th: 18.00, 18.20(-1.11) 75.0th: 23.20, 23.60(-1.72) 90.0th: 28.00, 27.40(2.14) 95.0th: 31.20, 30.40(2.56) 99.0th: 38.60, 38.20(1.04) 99.5th: 50.60, 50.40(0.40) 99.9th: 122.80, 108.00(12.05) 32 Threads 50.0th: 30.00, 30.20(-0.67) 75.0th: 42.20, 42.60(-0.95) 90.0th: 52.60, 55.40(-5.32) 95.0th: 58.60, 63.00(-7.51) 99.0th: 69.60, 78.20(-12.36) 99.5th: 79.20, 103.80(-31.06) 99.9th: 171.80, 209.60(-22.00) Observation: lower is better. tail latencies seem to go up. schbench also has run to run variations. ----------------------------------------------------------------------------------------------------- stress-ng(20 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) ( 100000 cpu-ops) --cpu=768 Time : 1.58, 1.53(3.16) --cpu=384 Time : 1.66, 1.63(1.81) --cpu=192 Time : 2.67, 2.77(-3.75) --cpu=96 Time : 3.70, 3.69(0.27) --cpu=48 Time : 5.73, 5.69(0.70) --cpu=24 Time : 7.27, 7.26(0.14) --cpu=12 Time : 14.25, 14.24(0.07) --cpu=6 Time : 28.42, 28.40(0.07) --cpu=3 Time : 56.81, 56.68(0.23) --cpu=768 -util=10 Time : 3.69, 3.70(-0.27) --cpu=768 -util=20 Time : 5.67, 5.70(-0.53) --cpu=768 -util=30 Time : 7.08, 7.12(-0.56) --cpu=768 -util=40 Time : 8.23, 8.27(-0.49) --cpu=768 -util=50 Time : 9.22, 9.26(-0.43) --cpu=768 -util=60 Time : 10.09, 10.15(-0.59) --cpu=768 -util=70 Time : 10.93, 10.98(-0.46) --cpu=768 -util=80 Time : 11.79, 11.79(0.00) --cpu=768 -util=90 Time : 12.63, 12.60(0.24) Observation: lower is better. Almost no difference.
Hi Shrikanth, On 2023-08-25 at 13:18:56 +0530, Shrikanth Hegde wrote: > > On 7/27/23 8:03 PM, Chen Yu wrote: > > Hi Chen. It is a nice patch series in effort to reduce the newidle cost. > This gives the idea of making use of calculations done in load_balance to used > among different idle types. > Thanks for taking a look at this patch set. > It was interesting to see how this would work on Power Systems. The reason being we have > large core count and LLC size is small. i.e at small core level (llc_weight=4). This would > mean quite frequest access sd_share at different level which would reside on the first_cpu of > the sched domain, which might result in more cache-misses. But perf stats didnt show the same. > Do you mean 1 large domain(Die domain?) has many LLC sched domains as its children, and accessing the large domain's sd_share field would cross different LLCs and the latency is high? Yes, this could be a problem and it depends on the hardware that how fast differet LLCs snoop the data with each other. On the other hand, the periodic load balance is the writer of sd_share, and the interval is based on the cpu_weight of that domain. So the write might be less frequent on large domains, and most access to sd_share would be the read issued by newidle balance, which is less costly. > Another concern on more number of sched groups at DIE level, which might take a hit if > the balancing takes longer for the system to stabilize. Do you mean, if newidle balance does not pull tasks hard enough, the imbalance between groups would last longer? Yes, Prateek has mentioned this point, the ILB_UTIL has this problem, I'll think more about it. We want to find a way in newidle balance to do less scan, but still pulls tasks as hard as before. > > tl;dr > > Tested with micro-benchmarks on system with 96 Cores with SMT=8. Total of 768 CPU's. There is some amount May I know the sched domain hierarchy of this platform? grep . /sys/kernel/debug/sched/domains/cpu0/domain*/* cat /proc/schedstat | grep cpu0 -A 4 (4 domains?) > of regression with hackbench and schbench. haven't looked into why. Any pointers to check would be helpful. May I know what is the command to run hackbench and schbench below? For example the fd number, package size and the loop number of hackbench, and number of message thread and worker thread of schbench, etc. I assume you are using the old schbench? As the latest schbench would track other metrics besides tail latency. > Did a test with more real case workload that we have called daytrader. Its is DB workload which gives total > transcations done per second. That doesn't show any regression. > > Its true that all benchmarks will not be happy. > Maybe in below cases, newidle may not be that costly. Do you have any specific benchmark to be tried? > Previously I tested schbench/hackbench/netperf/tbench/sqlite, and also I'm planning to try an OLTP. > ----------------------------------------------------------------------------------------------------- > 6.5.rc4 6.5.rc4 + PATCH_V2 gain > Daytrader: 55049 55378 0.59% > > ----------------------------------------------------------------------------------------------------- > hackbench(50 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) > > > Process 10 groups : 0.19, 0.19(0.00) > Process 20 groups : 0.23, 0.24(-4.35) > Process 30 groups : 0.28, 0.30(-7.14) > Process 40 groups : 0.38, 0.40(-5.26) > Process 50 groups : 0.43, 0.45(-4.65) > Process 60 groups : 0.51, 0.51(0.00) > thread 10 Time : 0.21, 0.22(-4.76) > thread 20 Time : 0.27, 0.32(-18.52) > Process(Pipe) 10 Time : 0.17, 0.17(0.00) > Process(Pipe) 20 Time : 0.23, 0.23(0.00) > Process(Pipe) 30 Time : 0.28, 0.28(0.00) > Process(Pipe) 40 Time : 0.33, 0.32(3.03) > Process(Pipe) 50 Time : 0.38, 0.36(5.26) > Process(Pipe) 60 Time : 0.40, 0.39(2.50) > thread(Pipe) 10 Time : 0.14, 0.14(0.00) > thread(Pipe) 20 Time : 0.20, 0.19(5.00) > > Observation: lower is better. socket based runs show regression quite a bit, > pipe shows slight improvement. > > > ----------------------------------------------------------------------------------------------------- > Unixbench(10 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) > > 1 X Execl Throughput : 4280.15, 4398.30(2.76) > 4 X Execl Throughput : 8171.60, 8061.60(-1.35) > 1 X Pipe-based Context Switching : 172455.50, 174586.60(1.24) > 4 X Pipe-based Context Switching : 633708.35, 664659.85(4.88) > 1 X Process Creation : 6891.20, 7056.85(2.40) > 4 X Process Creation : 8826.20, 8996.25(1.93) > 1 X Shell Scripts (1 concurrent) : 9272.05, 9456.10(1.98) > 4 X Shell Scripts (1 concurrent) : 27919.60, 25319.75(-9.31) > 1 X Shell Scripts (8 concurrent) : 4462.70, 4392.75(-1.57) > 4 X Shell Scripts (8 concurrent) : 11852.30, 10820.70(-8.70) > > Observation: higher is better. Results are somewhat mixed. > > > ----------------------------------------------------------------------------------------------------- > schbench(10 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) > > 1 Threads > 50.0th: 8.00, 7.00(12.50) > 75.0th: 8.00, 7.60(5.00) > 90.0th: 8.80, 8.00(9.09) > 95.0th: 10.20, 8.20(19.61) > 99.0th: 13.60, 11.00(19.12) > 99.5th: 14.00, 12.80(8.57) > 99.9th: 15.80, 35.00(-121.52) > 2 Threads > 50.0th: 8.40, 8.20(2.38) > 75.0th: 9.00, 8.60(4.44) > 90.0th: 10.20, 9.60(5.88) > 95.0th: 11.20, 10.20(8.93) > 99.0th: 14.40, 11.40(20.83) > 99.5th: 14.80, 12.80(13.51) > 99.9th: 17.60, 14.80(15.91) > 4 Threads > 50.0th: 10.60, 10.40(1.89) > 75.0th: 12.20, 11.60(4.92) > 90.0th: 13.60, 12.60(7.35) > 95.0th: 14.40, 13.00(9.72) > 99.0th: 16.40, 15.60(4.88) > 99.5th: 16.80, 16.60(1.19) > 99.9th: 22.00, 29.00(-31.82) > 8 Threads > 50.0th: 12.00, 11.80(1.67) > 75.0th: 14.40, 14.40(0.00) > 90.0th: 17.00, 18.00(-5.88) > 95.0th: 19.20, 19.80(-3.13) > 99.0th: 23.00, 24.20(-5.22) > 99.5th: 26.80, 29.20(-8.96) > 99.9th: 68.00, 97.20(-42.94) > 16 Threads > 50.0th: 18.00, 18.20(-1.11) > 75.0th: 23.20, 23.60(-1.72) > 90.0th: 28.00, 27.40(2.14) > 95.0th: 31.20, 30.40(2.56) > 99.0th: 38.60, 38.20(1.04) > 99.5th: 50.60, 50.40(0.40) > 99.9th: 122.80, 108.00(12.05) > 32 Threads > 50.0th: 30.00, 30.20(-0.67) > 75.0th: 42.20, 42.60(-0.95) > 90.0th: 52.60, 55.40(-5.32) > 95.0th: 58.60, 63.00(-7.51) > 99.0th: 69.60, 78.20(-12.36) > 99.5th: 79.20, 103.80(-31.06) > 99.9th: 171.80, 209.60(-22.00) > > Observation: lower is better. tail latencies seem to go up. schbench also has run to run variations. > > ----------------------------------------------------------------------------------------------------- > stress-ng(20 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%) > ( 100000 cpu-ops) > > --cpu=768 Time : 1.58, 1.53(3.16) > --cpu=384 Time : 1.66, 1.63(1.81) > --cpu=192 Time : 2.67, 2.77(-3.75) > --cpu=96 Time : 3.70, 3.69(0.27) > --cpu=48 Time : 5.73, 5.69(0.70) > --cpu=24 Time : 7.27, 7.26(0.14) > --cpu=12 Time : 14.25, 14.24(0.07) > --cpu=6 Time : 28.42, 28.40(0.07) > --cpu=3 Time : 56.81, 56.68(0.23) > --cpu=768 -util=10 Time : 3.69, 3.70(-0.27) > --cpu=768 -util=20 Time : 5.67, 5.70(-0.53) > --cpu=768 -util=30 Time : 7.08, 7.12(-0.56) > --cpu=768 -util=40 Time : 8.23, 8.27(-0.49) > --cpu=768 -util=50 Time : 9.22, 9.26(-0.43) > --cpu=768 -util=60 Time : 10.09, 10.15(-0.59) > --cpu=768 -util=70 Time : 10.93, 10.98(-0.46) > --cpu=768 -util=80 Time : 11.79, 11.79(0.00) > --cpu=768 -util=90 Time : 12.63, 12.60(0.24) > > > Observation: lower is better. Almost no difference. I'll try to run the same tests of hackbench/schbench on my machine, to see if I could find any clue for the regression. thanks, Chenyu
On 8/30/23 8:56 PM, Chen Yu wrote:
> Hi Shrikanth,
Hi Chen, sorry for the slightly delayed response.
note: patch as is, fails to apply cleanly as BASE_SLICE is not a
feature in the latest tip/sched/core.
>
> On 2023-08-25 at 13:18:56 +0530, Shrikanth Hegde wrote:
>>
>> On 7/27/23 8:03 PM, Chen Yu wrote:
>>
>> Hi Chen. It is a nice patch series in effort to reduce the newidle cost.
>> This gives the idea of making use of calculations done in load_balance to used
>> among different idle types.
>>
>
> Thanks for taking a look at this patch set.
>
>> It was interesting to see how this would work on Power Systems. The reason being we have
>> large core count and LLC size is small. i.e at small core level (llc_weight=4). This would
>> mean quite frequest access sd_share at different level which would reside on the first_cpu of
>> the sched domain, which might result in more cache-misses. But perf stats didnt show the same.
>>
>
> Do you mean 1 large domain(Die domain?) has many LLC sched domains as its children,
> and accessing the large domain's sd_share field would cross different LLCs and the
> latency is high? Yes, this could be a problem and it depends on the hardware that how
> fast differet LLCs snoop the data with each other.
Yes
> On the other hand, the periodic load balance is the writer of sd_share, and the
> interval is based on the cpu_weight of that domain. So the write might be less frequent
> on large domains, and most access to sd_share would be the read issued by newidle balance,
> which is less costly.
>
>> Another concern on more number of sched groups at DIE level, which might take a hit if
>> the balancing takes longer for the system to stabilize.
>
> Do you mean, if newidle balance does not pull tasks hard enough, the imbalance between groups
> would last longer? Yes, Prateek has mentioned this point, the ILB_UTIL has this problem, I'll
> think more about it. We want to find a way in newidle balance to do less scan, but still pulls
> tasks as hard as before.
>
>>
>> tl;dr
>>
>> Tested with micro-benchmarks on system with 96 Cores with SMT=8. Total of 768 CPU's. There is some amount
>
> May I know the sched domain hierarchy of this platform?
> grep . /sys/kernel/debug/sched/domains/cpu0/domain*/*
> cat /proc/schedstat | grep cpu0 -A 4 (4 domains?)
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:MC
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain4/name:NUMA
domain-0: span=0,2,4,6 level=SMT
groups: 0:{ span=0 }, 2:{ span=2 }, 4:{ span=4 }, 6:{ span=6 }
domain-1: span=0-7,24-39,48-55,72-87 level=MC
groups: 0:{ span=0,2,4,6 cap=4096 }, 1:{ span=1,3,5,7 cap=4096 }, 24:{ span=24,26,28,30 cap=4096 }, 25:{ span=25,27,29,31 cap=4096 }, 32:{ span=32,34,36,38 cap=4096 }, 33:{ span=33,35,37,39 cap=4096 }, 48:{ span=48,50,52,54 cap=4096 }, 49:{ span=49,51,53,55 cap=4096 }, 72:{ span=72,74,76,78 cap=4096 }, 73:{ span=73,75,77,79 cap=4096 }, 80:{ span=80,82,84,86 cap=4096 }, 81:{ span=81,83,85,87 cap=4096 }
domain-2: span=0-95 level=DIE
groups: 0:{ span=0-7,24-39,48-55,72-87 cap=49152 }, 8:{ span=8-23,40-47,56-71,88-95 cap=49152 }
domain-3: span=0-191 level=NUMA
groups: 0:{ span=0-95 cap=98304 }, 96:{ span=96-191 cap=98304 }
domain-4: span=0-767 level=NUMA
groups: 0:{ span=0-191 cap=196608 }, 192:{ span=192-383 cap=196608 }, 384:{ span=384-575 cap=196608 }, 576:{ span=576-767 cap=196608 }
our LLC is at SMT domain. in an MC domain there could be max upto 16 such LLC.
That is for Dedicated Logical Partitions(LPAR).
on Shared Processor Logical Partitions(SPLPAR), it is observed that MC domain
doesnt make sense. After below proposed change, DIE domain would have SMT as groups.
After that there this max number of LLC in a DIE can go upto 30.
https://lore.kernel.org/lkml/20230830105244.62477-5-srikar@linux.vnet.ibm.com/#r
>
>> of regression with hackbench and schbench. haven't looked into why. Any pointers to check would be helpful.
>
> May I know what is the command to run hackbench and schbench below? For example
> the fd number, package size and the loop number of hackbench, and
> number of message thread and worker thread of schbench, etc. I assume
> you are using the old schbench? As the latest schbench would track other metrics
> besides tail latency.
>
>
Yes. Old schbench. and Hackbench is from ltp.
I can try to test the next version.
>> Did a test with more real case workload that we have called daytrader. Its is DB workload which gives total
>> transcations done per second. That doesn't show any regression.
>>
>> Its true that all benchmarks will not be happy.
>> Maybe in below cases, newidle may not be that costly. Do you have any specific benchmark to be tried?
>>
>
> Previously I tested schbench/hackbench/netperf/tbench/sqlite, and also I'm planning
> to try an OLTP.
>
>> -----------------------------------------------------------------------------------------------------
>> 6.5.rc4 6.5.rc4 + PATCH_V2 gain
>> Daytrader: 55049 55378 0.59%
>>
>> -----------------------------------------------------------------------------------------------------
>> hackbench(50 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
>>
>>
>> Process 10 groups : 0.19, 0.19(0.00)
>> Process 20 groups : 0.23, 0.24(-4.35)
>> Process 30 groups : 0.28, 0.30(-7.14)
>> Process 40 groups : 0.38, 0.40(-5.26)
>> Process 50 groups : 0.43, 0.45(-4.65)
>> Process 60 groups : 0.51, 0.51(0.00)
>> thread 10 Time : 0.21, 0.22(-4.76)
>> thread 20 Time : 0.27, 0.32(-18.52)
>> Process(Pipe) 10 Time : 0.17, 0.17(0.00)
>> Process(Pipe) 20 Time : 0.23, 0.23(0.00)
>> Process(Pipe) 30 Time : 0.28, 0.28(0.00)
>> Process(Pipe) 40 Time : 0.33, 0.32(3.03)
>> Process(Pipe) 50 Time : 0.38, 0.36(5.26)
>> Process(Pipe) 60 Time : 0.40, 0.39(2.50)
>> thread(Pipe) 10 Time : 0.14, 0.14(0.00)
>> thread(Pipe) 20 Time : 0.20, 0.19(5.00)
>>
>> Observation: lower is better. socket based runs show regression quite a bit,
>> pipe shows slight improvement.
>>
>>
>> -----------------------------------------------------------------------------------------------------
>> Unixbench(10 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
>>
>> 1 X Execl Throughput : 4280.15, 4398.30(2.76)
>> 4 X Execl Throughput : 8171.60, 8061.60(-1.35)
>> 1 X Pipe-based Context Switching : 172455.50, 174586.60(1.24)
>> 4 X Pipe-based Context Switching : 633708.35, 664659.85(4.88)
>> 1 X Process Creation : 6891.20, 7056.85(2.40)
>> 4 X Process Creation : 8826.20, 8996.25(1.93)
>> 1 X Shell Scripts (1 concurrent) : 9272.05, 9456.10(1.98)
>> 4 X Shell Scripts (1 concurrent) : 27919.60, 25319.75(-9.31)
>> 1 X Shell Scripts (8 concurrent) : 4462.70, 4392.75(-1.57)
>> 4 X Shell Scripts (8 concurrent) : 11852.30, 10820.70(-8.70)
>>
>> Observation: higher is better. Results are somewhat mixed.
>>
>>
>> -----------------------------------------------------------------------------------------------------
>> schbench(10 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
>>
>> 1 Threads
>> 50.0th: 8.00, 7.00(12.50)
>> 75.0th: 8.00, 7.60(5.00)
>> 90.0th: 8.80, 8.00(9.09)
>> 95.0th: 10.20, 8.20(19.61)
>> 99.0th: 13.60, 11.00(19.12)
>> 99.5th: 14.00, 12.80(8.57)
>> 99.9th: 15.80, 35.00(-121.52)
>> 2 Threads
>> 50.0th: 8.40, 8.20(2.38)
>> 75.0th: 9.00, 8.60(4.44)
>> 90.0th: 10.20, 9.60(5.88)
>> 95.0th: 11.20, 10.20(8.93)
>> 99.0th: 14.40, 11.40(20.83)
>> 99.5th: 14.80, 12.80(13.51)
>> 99.9th: 17.60, 14.80(15.91)
>> 4 Threads
>> 50.0th: 10.60, 10.40(1.89)
>> 75.0th: 12.20, 11.60(4.92)
>> 90.0th: 13.60, 12.60(7.35)
>> 95.0th: 14.40, 13.00(9.72)
>> 99.0th: 16.40, 15.60(4.88)
>> 99.5th: 16.80, 16.60(1.19)
>> 99.9th: 22.00, 29.00(-31.82)
>> 8 Threads
>> 50.0th: 12.00, 11.80(1.67)
>> 75.0th: 14.40, 14.40(0.00)
>> 90.0th: 17.00, 18.00(-5.88)
>> 95.0th: 19.20, 19.80(-3.13)
>> 99.0th: 23.00, 24.20(-5.22)
>> 99.5th: 26.80, 29.20(-8.96)
>> 99.9th: 68.00, 97.20(-42.94)
>> 16 Threads
>> 50.0th: 18.00, 18.20(-1.11)
>> 75.0th: 23.20, 23.60(-1.72)
>> 90.0th: 28.00, 27.40(2.14)
>> 95.0th: 31.20, 30.40(2.56)
>> 99.0th: 38.60, 38.20(1.04)
>> 99.5th: 50.60, 50.40(0.40)
>> 99.9th: 122.80, 108.00(12.05)
>> 32 Threads
>> 50.0th: 30.00, 30.20(-0.67)
>> 75.0th: 42.20, 42.60(-0.95)
>> 90.0th: 52.60, 55.40(-5.32)
>> 95.0th: 58.60, 63.00(-7.51)
>> 99.0th: 69.60, 78.20(-12.36)
>> 99.5th: 79.20, 103.80(-31.06)
>> 99.9th: 171.80, 209.60(-22.00)
>>
>> Observation: lower is better. tail latencies seem to go up. schbench also has run to run variations.
>>
>> -----------------------------------------------------------------------------------------------------
>> stress-ng(20 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
>> ( 100000 cpu-ops)
>>
>> --cpu=768 Time : 1.58, 1.53(3.16)
>> --cpu=384 Time : 1.66, 1.63(1.81)
>> --cpu=192 Time : 2.67, 2.77(-3.75)
>> --cpu=96 Time : 3.70, 3.69(0.27)
>> --cpu=48 Time : 5.73, 5.69(0.70)
>> --cpu=24 Time : 7.27, 7.26(0.14)
>> --cpu=12 Time : 14.25, 14.24(0.07)
>> --cpu=6 Time : 28.42, 28.40(0.07)
>> --cpu=3 Time : 56.81, 56.68(0.23)
>> --cpu=768 -util=10 Time : 3.69, 3.70(-0.27)
>> --cpu=768 -util=20 Time : 5.67, 5.70(-0.53)
>> --cpu=768 -util=30 Time : 7.08, 7.12(-0.56)
>> --cpu=768 -util=40 Time : 8.23, 8.27(-0.49)
>> --cpu=768 -util=50 Time : 9.22, 9.26(-0.43)
>> --cpu=768 -util=60 Time : 10.09, 10.15(-0.59)
>> --cpu=768 -util=70 Time : 10.93, 10.98(-0.46)
>> --cpu=768 -util=80 Time : 11.79, 11.79(0.00)
>> --cpu=768 -util=90 Time : 12.63, 12.60(0.24)
>>
>>
>> Observation: lower is better. Almost no difference.
>
> I'll try to run the same tests of hackbench/schbench on my machine, to
> see if I could find any clue for the regression.
>
>
> thanks,
> Chenyu
On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote: > Hi, > > This is the second version of the newidle balance optimization[1]. > It aims to reduce the cost of newidle balance which is found to > occupy noticeable CPU cycles on some high-core count systems. > > For example, when running sqlite on Intel Sapphire Rapids, which has > 2 x 56C/112T = 224 CPUs: > > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats > > To mitigate this cost, the optimization is inspired by the question > raised by Tim: > Do we always have to find the busiest group and pull from it? Would > a relatively busy group be enough? So doesn't this basically boil down to recognising that new-idle might not be the same as regular load-balancing -- we need any task, fast, rather than we need to make equal load. David's shared runqueue patches did the same, they re-imagined this very path. Now, David's thing went side-ways because of some regression that wasn't further investigated. But it occurs to me this might be the same thing that Prateek chased down here: https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com Hmm ? Supposing that is indeed the case, I think it makes more sense to proceed with that approach. That is, completely redo the sub-numa new idle balance.
Hi Peter, On 2024-07-17 at 14:17:45 +0200, Peter Zijlstra wrote: > On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote: > > Hi, > > > > This is the second version of the newidle balance optimization[1]. > > It aims to reduce the cost of newidle balance which is found to > > occupy noticeable CPU cycles on some high-core count systems. > > > > For example, when running sqlite on Intel Sapphire Rapids, which has > > 2 x 56C/112T = 224 CPUs: > > > > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance > > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats > > > > To mitigate this cost, the optimization is inspired by the question > > raised by Tim: > > Do we always have to find the busiest group and pull from it? Would > > a relatively busy group be enough? > > So doesn't this basically boil down to recognising that new-idle might > not be the same as regular load-balancing -- we need any task, fast, > rather than we need to make equal load. > Yes, exactly. > David's shared runqueue patches did the same, they re-imagined this very > path. > > Now, David's thing went side-ways because of some regression that wasn't > further investigated. > > But it occurs to me this might be the same thing that Prateek chased > down here: > > https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com > > Hmm ? > Thanks for the patch link. I took a look and if I understand correctly, Prateek's patch fixes three issues related to TIF_POLLING_NRFLAG. And the following two issues might cause aggressive newidle balance: 1. normal idle load balance does not have a chance to be triggered when exiting the idle loop. Since normal idle load balance does not work, we have to count on newidle balance to do more work. 2. newly idle load balance is incorrectly triggered when exiting from idle due to send_ipi(), even there is no task about to sleep. Issue 2 will increase the frequency of invoking newly idle balance, but issue 1 would not. Issue 1 mainly impacts the success ratio of each newidle balance, but might not increase the frequency to trigger a newidle balance - it should mainly depend on the behavior of task runtime duration. Please correct me if I'm wrong. All Prateek's 3 patches fix the existing newidle balance issue, I'll apply his patch set and have a re-test. > Supposing that is indeed the case, I think it makes more sense to > proceed with that approach. That is, completely redo the sub-numa new > idle balance. > I did not quite follow this, Prateek's patch set does not redo the sub-numa new idle balance I suppose? Or do you mean further work based on Prateek's patch set? thanks, Chenyu
Hello Peter,
On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
> On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
>> Hi,
>>
>> This is the second version of the newidle balance optimization[1].
>> It aims to reduce the cost of newidle balance which is found to
>> occupy noticeable CPU cycles on some high-core count systems.
>>
>> For example, when running sqlite on Intel Sapphire Rapids, which has
>> 2 x 56C/112T = 224 CPUs:
>>
>> 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
>> 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
>>
>> To mitigate this cost, the optimization is inspired by the question
>> raised by Tim:
>> Do we always have to find the busiest group and pull from it? Would
>> a relatively busy group be enough?
>
> So doesn't this basically boil down to recognising that new-idle might
> not be the same as regular load-balancing -- we need any task, fast,
> rather than we need to make equal load.
>
> David's shared runqueue patches did the same, they re-imagined this very
> path.
>
> Now, David's thing went side-ways because of some regression that wasn't
> further investigated.
In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
hackbench at lower utilization seemed to raise some contention somewhere
but perf profile with IBS showed nothing specific and I left it there.
I revisited this again today and found this interesting data for perf
bench sched messaging running with one group pinned to one LLC domain on
my system:
- NO_SHARED_RUNQ
$ time ./perf bench sched messaging -p -t -l 100000 -g 1
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 1 groups == 40 threads run
Total time: 3.972 [sec] (*)
real 0m3.985s
user 0m6.203s (*)
sys 1m20.087s (*)
$ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
$ sudo perf report --no-children
Samples: 128 of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
Overhead Command Shared Object Symbol
+ 51.43% sched-messaging libc.so.6 [.] read
+ 44.94% sched-messaging libc.so.6 [.] __GI___libc_write
+ 3.60% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
0.03% sched-messaging libc.so.6 [.] __poll
0.00% sched-messaging perf [.] sender
- SHARED_RUNQ
$ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 1 groups == 40 threads run
Total time: 48.171 [sec] (*)
real 0m48.186s
user 0m5.409s (*)
sys 0m41.185s (*)
$ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
$ sudo perf report --no-children
Samples: 157 of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
Overhead Command Shared Object Symbol
+ 47.49% sched-messaging libc.so.6 [.] read
+ 46.33% sched-messaging libc.so.6 [.] __GI___libc_write
+ 2.40% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
+ 1.08% snapd snapd [.] 0x000000000006caa3
+ 1.02% cron libc.so.6 [.] clock_nanosleep@GLIBC_2.2.5
+ 0.86% containerd containerd [.] runtime.futex.abi0
+ 0.82% containerd containerd [.] runtime/internal/syscall.Syscall6
(*) The runtime has bloated massively but both "user" and "sys" time
are down and the "offcpu-time" count goes up with SHARED_RUNQ.
There seems to be a corner case that is not accounted for but I'm not
sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
since that is what I initially tested the series on but I can see the
same behavior when I rebased the changed on the current v6.10-rc5 based
tip:sched/core.
>
> But it occurs to me this might be the same thing that Prateek chased
> down here:
>
> https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com
>
> Hmm ?
Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
currently, the scheduler depends on the newidle_balance to pull tasks to
an idle CPU. Vincent had pointed it out on the first RCF to tackle the
problem that tried to do what SM_IDLE does but for fair class alone:
https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/
It shouldn't be too frequent but it could be the reason why
newidle_balance() might jump up in traces, especially if it decides to
scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
perhaps the PKG/NUMA in the case Chenyu was investigating initially).
>
> Supposing that is indeed the case, I think it makes more sense to
> proceed with that approach. That is, completely redo the sub-numa new
> idle balance.
>
>
--
Thanks and Regards,
Prateek
Hi Prateek, On 2024-07-18 at 14:58:30 +0530, K Prateek Nayak wrote: > Hello Peter, > > On 7/17/2024 5:47 PM, Peter Zijlstra wrote: > > On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote: > > > Hi, > > > > > > This is the second version of the newidle balance optimization[1]. > > > It aims to reduce the cost of newidle balance which is found to > > > occupy noticeable CPU cycles on some high-core count systems. > > > > > > For example, when running sqlite on Intel Sapphire Rapids, which has > > > 2 x 56C/112T = 224 CPUs: > > > > > > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance > > > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats > > > > > > To mitigate this cost, the optimization is inspired by the question > > > raised by Tim: > > > Do we always have to find the busiest group and pull from it? Would > > > a relatively busy group be enough? > > > > So doesn't this basically boil down to recognising that new-idle might > > not be the same as regular load-balancing -- we need any task, fast, > > rather than we need to make equal load. > > > > David's shared runqueue patches did the same, they re-imagined this very > > path. > > > > Now, David's thing went side-ways because of some regression that wasn't > > further investigated. > > In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of > hackbench at lower utilization seemed to raise some contention somewhere > but perf profile with IBS showed nothing specific and I left it there. > > I revisited this again today and found this interesting data for perf > bench sched messaging running with one group pinned to one LLC domain on > my system: > > - NO_SHARED_RUNQ > > $ time ./perf bench sched messaging -p -t -l 100000 -g 1 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver threads per group > # 1 groups == 40 threads run > Total time: 3.972 [sec] (*) > real 0m3.985s > user 0m6.203s (*) > sys 1m20.087s (*) > > $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1 > $ sudo perf report --no-children > > Samples: 128 of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*) > Overhead Command Shared Object Symbol > + 51.43% sched-messaging libc.so.6 [.] read > + 44.94% sched-messaging libc.so.6 [.] __GI___libc_write > + 3.60% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64 > 0.03% sched-messaging libc.so.6 [.] __poll > 0.00% sched-messaging perf [.] sender > > > - SHARED_RUNQ > > $ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver threads per group > # 1 groups == 40 threads run > Total time: 48.171 [sec] (*) > real 0m48.186s > user 0m5.409s (*) > sys 0m41.185s (*) > > $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1 > $ sudo perf report --no-children > > Samples: 157 of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*) > Overhead Command Shared Object Symbol > + 47.49% sched-messaging libc.so.6 [.] read > + 46.33% sched-messaging libc.so.6 [.] __GI___libc_write > + 2.40% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64 > + 1.08% snapd snapd [.] 0x000000000006caa3 > + 1.02% cron libc.so.6 [.] clock_nanosleep@GLIBC_2.2.5 > + 0.86% containerd containerd [.] runtime.futex.abi0 > + 0.82% containerd containerd [.] runtime/internal/syscall.Syscall6 > > > (*) The runtime has bloated massively but both "user" and "sys" time > are down and the "offcpu-time" count goes up with SHARED_RUNQ. > > There seems to be a corner case that is not accounted for but I'm not > sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel > since that is what I initially tested the series on but I can see the > same behavior when I rebased the changed on the current v6.10-rc5 based > tip:sched/core. > > > > > But it occurs to me this might be the same thing that Prateek chased > > down here: > > > > https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com > > > > Hmm ? > > Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2), > currently, the scheduler depends on the newidle_balance to pull tasks to > an idle CPU. Vincent had pointed it out on the first RCF to tackle the > problem that tried to do what SM_IDLE does but for fair class alone: > > https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/ > > It shouldn't be too frequent but it could be the reason why > newidle_balance() might jump up in traces, especially if it decides to > scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case, > perhaps the PKG/NUMA in the case Chenyu was investigating initially). > Yes, this is my understanding too, I'll apply your patches and have a re-test. thanks, Chenyu
© 2016 - 2025 Red Hat, Inc.