Hi all, So after much delay this is hopefully the final version of the EEVDF patches. They've been sitting in my git tree for ever it seems, and people have been testing it and sending fixes. I've spend the last two days testing and fixing cfs-bandwidth, and as far as I know that was the very last issue holding it back. These patches apply on top of queue.git sched/dl-server, which I plan on merging in tip/sched/core once -rc1 drops. I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. Aside from a ton of bug fixes -- thanks all! -- new in this version is: - split up the huge delay-dequeue patch - tested/fixed cfs-bandwidth - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating - SCHED_BATCH is equivalent to RESPECT_SLICE - propagate min_slice up cgroups - CLOCK_THREAD_DVFS_ID
Hi Peter, On 7/27/24 11:27, Peter Zijlstra wrote: > Hi all, > > So after much delay this is hopefully the final version of the EEVDF patches. > They've been sitting in my git tree for ever it seems, and people have been > testing it and sending fixes. > > I've spend the last two days testing and fixing cfs-bandwidth, and as far > as I know that was the very last issue holding it back. > > These patches apply on top of queue.git sched/dl-server, which I plan on merging > in tip/sched/core once -rc1 drops. > > I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. > > > Aside from a ton of bug fixes -- thanks all! -- new in this version is: > > - split up the huge delay-dequeue patch > - tested/fixed cfs-bandwidth > - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating > - SCHED_BATCH is equivalent to RESPECT_SLICE > - propagate min_slice up cgroups > - CLOCK_THREAD_DVFS_ID > > > Thanks for the updated series. FTR, I ran this through our Pixel 6 and did not see significant differences between the baseline (6.8 kernel with EEVDF + fixes) and the baseline with the updated eevdf-complete series applied on top. The power use and frame metrics seem mostly the same between the two. I'll run a few more tests just in case.
On Sat, Jul 27, 2024 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote: > > Hi all, > > So after much delay this is hopefully the final version of the EEVDF patches. > They've been sitting in my git tree for ever it seems, and people have been > testing it and sending fixes. > > I've spend the last two days testing and fixing cfs-bandwidth, and as far > as I know that was the very last issue holding it back. > > These patches apply on top of queue.git sched/dl-server, which I plan on merging > in tip/sched/core once -rc1 drops. > > I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. > > > Aside from a ton of bug fixes -- thanks all! -- new in this version is: > > - split up the huge delay-dequeue patch > - tested/fixed cfs-bandwidth > - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating > - SCHED_BATCH is equivalent to RESPECT_SLICE > - propagate min_slice up cgroups > - CLOCK_THREAD_DVFS_ID > Hi Peter, TL;DR: We run some basic sched/cpufreq behavior tests on a Pixel 6 for every change we accept. Some of these changes are merges from Linus's tree. We can see a very clear change in behavior with this patch series. Based on what we are seeing, we'd expect this change in behavior to cause pretty serious power regression (7-25%) depending on what the actual bug is and the use case. Intro: We run these tests 20 times for every build (a bunch of changes). All the data below is from the 20+ builds before this series and 20 builds after this series (inclusive). So, all the "before numbers" are from (20 x 20) 400+ runs and all the after numbers are from another 400+ runs. Test: We create a synthetic "tiny" thread that runs for 3ms and sleeps for 10ms at Fmin. We let it run like this for several seconds to make sure the util is low and all the "new thread" boost stuff isn't kicking in. So, at this point, the CPU frequency is at Fmin. Then we let this thread run continuously without sleeping and measure (using ftrace) the time it takes for the CPU to get to Fmax. We do this separately (fresh run) on the Pixel 6 with the cpu affinity set to each cluster and once without any cpu affinity (thread starts at little). Data: All the values below are in milliseconds. When the thread is not affined to any CPU: So thread starts on little, ramps up to fmax, migrates to middle, ramps up to fmax, migrates to big, ramps up to fmax. +----------------------------------+ | Data | Before | After | |-----------------------+----------| | 5th percentile | 169 | 151 | |-----------------------+----------| | Median | 221 | 177 | |-----------------------+----------| | Mean | 221 | 177 | |-----------------------+----------| | 95th percentile | 249 | 200 | +----------------------------------+ When thread is affined to the little cluster: The average time to reach Fmax is 104 ms without this series and 66 ms after this series. We didn't collect the individual per run data. We can if you really need it. We also noticed that the little cluster wouldn't go to Fmin (300 MHz) after this series even when the CPUs are mostly idle. It was instead hovering at 738 MHz (the Fmax is ~1800 MHz). When thread is affined to the middle cluster: +----------------------------------+ | Data | Before | After | |-----------------------+----------| | 5th percentile | 99 | 84 | |-----------------------+----------| | Median | 111 | 104 | |-----------------------+----------| | Mean | 111 | 104 | |-----------------------+----------| | 95th percentile | 120 | 119 | +----------------------------------+ When thread is affined to the big cluster: +----------------------------------+ | Data | Before | After | |-----------------------+----------| | 5th percentile | 138 | 96 | |-----------------------+----------| | Median | 147 | 144 | |-----------------------+----------| | Mean | 145 | 134 | |-----------------------+----------| | 95th percentile | 151 | 150 | +----------------------------------+ As you can see, the ramp up time has decreased noticeably. Also, as you can tell from the 5th percentile numbers, the standard deviation has also increased a lot too, causing a wider spread of the ramp up time (more noticeable in the middle and big clusters). So in general this looks like it's going to increase the usage of the middle and big CPU clusters and also going to shift the CPU frequency residency to frequencies that are 5 to 25% higher. We already checked the rate_limit_us value and it is the same for both the before/after cases and it's set to 7.5 ms (jiffies is 4ms in our case). Also, based on my limited understanding the DELAYED_DEQUEUE stuff is only relevant if there are multiple contending threads in a CPU. In this case it's just 1 continuously running thread with a kworker that runs sporadically less than 1% of the time. So, without a deeper understanding of this patch series, it's behaving as if the PELT signal is accumulating faster than expected. Which is a bit surprising to me because AFAIK (which is not much) the EEVDF series isn't supposed to change the PELT behavior. If you want to get a visual idea of what the system is doing, here are some perfetto links that visualize the traces. Hopefully you have access permissions to these. You can use the W, S, A, D keys to pan and zoom around the timeline. Big Before: https://ui.perfetto.dev/#!/?s=01aa3ad3a5ddd78f2948c86db4265ce2249da8aa Big After: https://ui.perfetto.dev/#!/?s=7729ee012f238e96cfa026459eac3f8c3e88d9a9 Thanks, Saravana, Sam and David
(+ Mike, Luis)
Hello Saravana, Sam, David,
On 11/6/2024 6:37 AM, Saravana Kannan wrote:
> On Sat, Jul 27, 2024 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF patches.
>> They've been sitting in my git tree for ever it seems, and people have been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>> - split up the huge delay-dequeue patch
>> - tested/fixed cfs-bandwidth
>> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>> - SCHED_BATCH is equivalent to RESPECT_SLICE
>> - propagate min_slice up cgroups
>> - CLOCK_THREAD_DVFS_ID
>>
>
> Hi Peter,
>
> TL;DR:
> We run some basic sched/cpufreq behavior tests on a Pixel 6 for every
> change we accept. Some of these changes are merges from Linus's tree.
> We can see a very clear change in behavior with this patch series.
> Based on what we are seeing, we'd expect this change in behavior to
> cause pretty serious power regression (7-25%) depending on what the
> actual bug is and the use case.
Do the regressions persist with NO_DELAY_DEQUEUE? You can disable the
DELAY_DEQUEUE feature added in EEVDF Complete via debugfs by doing a:
# echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features
Since delayed entities are still on the runqueue, they can affect PELT
calculation. Vincent and Dietmar have both noted this and Peter posted
https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
in response but it was pulled out since Luis reported observing -ve
values for h_nr_delayed on his setup. A lot has been fixed around
delayed dequeue since and I wonder if now would be the right time to
re-attempt h_nr_delayed tracking.
There is also the fact that delayed entities don't update the tg
loadavg since the delayed path calls update_load_avg() without
UPDATE_TG flag set which can again cause some changes in PELT
calculation since the averages are used to estimate the entity
shares when running with cgroups.
>
> Intro:
> We run these tests 20 times for every build (a bunch of changes). All
> the data below is from the 20+ builds before this series and 20 builds
> after this series (inclusive). So, all the "before numbers" are from
> (20 x 20) 400+ runs and all the after numbers are from another 400+
> runs.
>
> Test:
> We create a synthetic "tiny" thread that runs for 3ms and sleeps for
> 10ms at Fmin. We let it run like this for several seconds to make sure
> the util is low and all the "new thread" boost stuff isn't kicking in.
> So, at this point, the CPU frequency is at Fmin. Then we let this
> thread run continuously without sleeping and measure (using ftrace)
> the time it takes for the CPU to get to Fmax.
>
> We do this separately (fresh run) on the Pixel 6 with the cpu affinity
> set to each cluster and once without any cpu affinity (thread starts
> at little).
>
> Data:
> All the values below are in milliseconds.
>
> When the thread is not affined to any CPU: So thread starts on little,
> ramps up to fmax, migrates to middle, ramps up to fmax, migrates to
> big, ramps up to fmax.
> +----------------------------------+
> | Data | Before | After |
> |-----------------------+----------|
> | 5th percentile | 169 | 151 |
> |-----------------------+----------|
> | Median | 221 | 177 |
> |-----------------------+----------|
> | Mean | 221 | 177 |
> |-----------------------+----------|
> | 95th percentile | 249 | 200 |
> +----------------------------------+
>
> When thread is affined to the little cluster:
> The average time to reach Fmax is 104 ms without this series and 66 ms
> after this series. We didn't collect the individual per run data. We
> can if you really need it. We also noticed that the little cluster
> wouldn't go to Fmin (300 MHz) after this series even when the CPUs are
> mostly idle. It was instead hovering at 738 MHz (the Fmax is ~1800
> MHz).
>
> When thread is affined to the middle cluster:
> +----------------------------------+
> | Data | Before | After |
> |-----------------------+----------|
> | 5th percentile | 99 | 84 |
> |-----------------------+----------|
> | Median | 111 | 104 |
> |-----------------------+----------|
> | Mean | 111 | 104 |
> |-----------------------+----------|
> | 95th percentile | 120 | 119 |
> +----------------------------------+
>
> When thread is affined to the big cluster:
> +----------------------------------+
> | Data | Before | After |
> |-----------------------+----------|
> | 5th percentile | 138 | 96 |
> |-----------------------+----------|
> | Median | 147 | 144 |
> |-----------------------+----------|
> | Mean | 145 | 134 |
> |-----------------------+----------|
> | 95th percentile | 151 | 150 |
> +----------------------------------+
>
> As you can see, the ramp up time has decreased noticeably. Also, as
> you can tell from the 5th percentile numbers, the standard deviation
> has also increased a lot too, causing a wider spread of the ramp up
> time (more noticeable in the middle and big clusters). So in general
> this looks like it's going to increase the usage of the middle and big
> CPU clusters and also going to shift the CPU frequency residency to
> frequencies that are 5 to 25% higher.
>
> We already checked the rate_limit_us value and it is the same for both
> the before/after cases and it's set to 7.5 ms (jiffies is 4ms in our
> case). Also, based on my limited understanding the DELAYED_DEQUEUE
> stuff is only relevant if there are multiple contending threads in a
> CPU. In this case it's just 1 continuously running thread with a
> kworker that runs sporadically less than 1% of the time.
There is an ongoing investigation on delayed entities possibly not
migrating if they are woken up before they are fully dequeued. Since you
mention there is only one task, this should not matter but could you
also try out Mike's suggestion from
https://lore.kernel.org/lkml/1bffa5f2ca0fec8a00f84ffab86dc6e8408af31c.camel@gmx.de/
and see if it makes a difference on your test suite?
--
Thanks and Regards,
Prateek
>
> So, without a deeper understanding of this patch series, it's behaving
> as if the PELT signal is accumulating faster than expected. Which is a
> bit surprising to me because AFAIK (which is not much) the EEVDF
> series isn't supposed to change the PELT behavior.
>
> If you want to get a visual idea of what the system is doing, here are
> some perfetto links that visualize the traces. Hopefully you have
> access permissions to these. You can use the W, S, A, D keys to pan
> and zoom around the timeline.
>
> Big Before:
> https://ui.perfetto.dev/#!/?s=01aa3ad3a5ddd78f2948c86db4265ce2249da8aa
> Big After:
> https://ui.perfetto.dev/#!/?s=7729ee012f238e96cfa026459eac3f8c3e88d9a9
P.S. I only gave a quick glance but I do see the frequency ramp up with
larger deltas and reach Fmax much quickly in case of "Big After"
>
> Thanks,
> Saravana, Sam and David
On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote: > Since delayed entities are still on the runqueue, they can affect PELT > calculation. Vincent and Dietmar have both noted this and Peter posted > https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/ > in response but it was pulled out since Luis reported observing -ve > values for h_nr_delayed on his setup. A lot has been fixed around > delayed dequeue since and I wonder if now would be the right time to > re-attempt h_nr_delayed tracking. Yeah, it's something I meant to get back to. I think the patch as posted was actually right and it didn't work for Luis because of some other, since fixed issue. But I might be misremembering things. I'll get to it eventually :/
Hi, On 11/6/24 11:09, Peter Zijlstra wrote: > On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote: > >> Since delayed entities are still on the runqueue, they can affect PELT >> calculation. Vincent and Dietmar have both noted this and Peter posted >> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/ >> in response but it was pulled out since Luis reported observing -ve >> values for h_nr_delayed on his setup. A lot has been fixed around >> delayed dequeue since and I wonder if now would be the right time to >> re-attempt h_nr_delayed tracking. > > Yeah, it's something I meant to get back to. I think the patch as posted > was actually right and it didn't work for Luis because of some other, > since fixed issue. > > But I might be misremembering things. I'll get to it eventually :/ Sorry for the late reply, I got sidetracked on something else. There have been a few power regressions (based on our Pixel6-based testing) due to the delayed-dequeue series. The main one drove the frequencies up due to an imbalance in the uclamp inc/dec handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1] The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things. So power use was brought down by the above fix, but some issues still remained, like the accounting issues with h_nr_running and not taking sched_delayed tasks into account. Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2] Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed accounting that slowly drove frequencies up again. If I recall correctly, Peter has pulled that particular patch from the tree, but we should definitely revisit it with a proper fix for the imbalance. Suggestion in [5]. [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/ [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/ [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/ [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/ [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote: > > Hi, > > On 11/6/24 11:09, Peter Zijlstra wrote: > > On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote: > > > >> Since delayed entities are still on the runqueue, they can affect PELT > >> calculation. Vincent and Dietmar have both noted this and Peter posted > >> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/ > >> in response but it was pulled out since Luis reported observing -ve > >> values for h_nr_delayed on his setup. A lot has been fixed around > >> delayed dequeue since and I wonder if now would be the right time to > >> re-attempt h_nr_delayed tracking. > > > > Yeah, it's something I meant to get back to. I think the patch as posted > > was actually right and it didn't work for Luis because of some other, > > since fixed issue. > > > > But I might be misremembering things. I'll get to it eventually :/ > > Sorry for the late reply, I got sidetracked on something else. > > There have been a few power regressions (based on our Pixel6-based testing) due > to the delayed-dequeue series. > > The main one drove the frequencies up due to an imbalance in the uclamp inc/dec > handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1] > > The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the > imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things. > > So power use was brought down by the above fix, but some issues still remained, like the > accounting issues with h_nr_running and not taking sched_delayed tasks into account. > > Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2] > > Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was > mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed > accounting that slowly drove frequencies up again. > > If I recall correctly, Peter has pulled that particular patch from the tree, but we should > definitely revisit it with a proper fix for the imbalance. Suggestion in [5]. > > [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/ > [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/ > [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/ > [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/ > [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/ Thanks for the replies. We are trying to disable DELAY_DEQUEUE and recollect the data to see if that's the cause. We'll get back to this thread once we have some data. -Saravana
On Thu, Nov 7, 2024 at 11:08 PM Saravana Kannan <saravanak@google.com> wrote: > > On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote: > > > > Hi, > > > > On 11/6/24 11:09, Peter Zijlstra wrote: > > > On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote: > > > > > >> Since delayed entities are still on the runqueue, they can affect PELT > > >> calculation. Vincent and Dietmar have both noted this and Peter posted > > >> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/ > > >> in response but it was pulled out since Luis reported observing -ve > > >> values for h_nr_delayed on his setup. A lot has been fixed around > > >> delayed dequeue since and I wonder if now would be the right time to > > >> re-attempt h_nr_delayed tracking. > > > > > > Yeah, it's something I meant to get back to. I think the patch as posted > > > was actually right and it didn't work for Luis because of some other, > > > since fixed issue. > > > > > > But I might be misremembering things. I'll get to it eventually :/ > > > > Sorry for the late reply, I got sidetracked on something else. > > > > There have been a few power regressions (based on our Pixel6-based testing) due > > to the delayed-dequeue series. > > > > The main one drove the frequencies up due to an imbalance in the uclamp inc/dec > > handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1] > > > > The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the > > imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things. > > > > So power use was brought down by the above fix, but some issues still remained, like the > > accounting issues with h_nr_running and not taking sched_delayed tasks into account. > > > > Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2] > > > > Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was > > mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed > > accounting that slowly drove frequencies up again. > > > > If I recall correctly, Peter has pulled that particular patch from the tree, but we should > > definitely revisit it with a proper fix for the imbalance. Suggestion in [5]. > > > > [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/ > > [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/ > > [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/ > > [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/ > > [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/ > > Thanks for the replies. We are trying to disable DELAY_DEQUEUE and > recollect the data to see if that's the cause. We'll get back to this > thread once we have some data. > > -Saravana The test data is back to pre-EEVDF state with DELAY_DEQUEUE disabled. Same test example from before, when thread is affined to the big cluster: +----------------------------------+ | Data | Enabled | Disabled | |-----------------------+----------| | 5th percentile | 96 | 143 | |-----------------------+----------| | Median | 144 | 147 | |-----------------------+----------| | Mean | 134 | 147 | |-----------------------+----------| | 95th percentile | 150 | 150 | +----------------------------------+ What are the next steps to bring this behavior back? Will DELAY_DEQUEUE always be enabled by default and/or is there a fix coming for 6.12? Thanks, Sam
Hello Sam,
On 11/9/2024 4:47 AM, Samuel Wu wrote:
> On Thu, Nov 7, 2024 at 11:08 PM Saravana Kannan <saravanak@google.com> wrote:
>>
>> On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> On 11/6/24 11:09, Peter Zijlstra wrote:
>>>> On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:
>>>>
>>>>> Since delayed entities are still on the runqueue, they can affect PELT
>>>>> calculation. Vincent and Dietmar have both noted this and Peter posted
>>>>> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
>>>>> in response but it was pulled out since Luis reported observing -ve
>>>>> values for h_nr_delayed on his setup. A lot has been fixed around
>>>>> delayed dequeue since and I wonder if now would be the right time to
>>>>> re-attempt h_nr_delayed tracking.
>>>>
>>>> Yeah, it's something I meant to get back to. I think the patch as posted
>>>> was actually right and it didn't work for Luis because of some other,
>>>> since fixed issue.
>>>>
>>>> But I might be misremembering things. I'll get to it eventually :/
>>>
>>> Sorry for the late reply, I got sidetracked on something else.
>>>
>>> There have been a few power regressions (based on our Pixel6-based testing) due
>>> to the delayed-dequeue series.
>>>
>>> The main one drove the frequencies up due to an imbalance in the uclamp inc/dec
>>> handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1]
>>>
>>> The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the
>>> imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things.
>>>
>>> So power use was brought down by the above fix, but some issues still remained, like the
>>> accounting issues with h_nr_running and not taking sched_delayed tasks into account.
>>>
>>> Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2]
>>>
>>> Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was
>>> mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed
>>> accounting that slowly drove frequencies up again.
>>>
>>> If I recall correctly, Peter has pulled that particular patch from the tree, but we should
>>> definitely revisit it with a proper fix for the imbalance. Suggestion in [5].
>>>
>>> [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/
>>> [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/
>>> [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
>>> [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
>>> [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
>>
>> Thanks for the replies. We are trying to disable DELAY_DEQUEUE and
>> recollect the data to see if that's the cause. We'll get back to this
>> thread once we have some data.
>>
>> -Saravana
>
> The test data is back to pre-EEVDF state with DELAY_DEQUEUE disabled.
>
> Same test example from before, when thread is affined to the big cluster:
> +----------------------------------+
> | Data | Enabled | Disabled |
> |-----------------------+----------|
> | 5th percentile | 96 | 143 |
> |-----------------------+----------|
> | Median | 144 | 147 |
> |-----------------------+----------|
> | Mean | 134 | 147 |
> |-----------------------+----------|
> | 95th percentile | 150 | 150 |
> +----------------------------------+
>
> What are the next steps to bring this behavior back? Will DELAY_DEQUEUE always
> be enabled by default and/or is there a fix coming for 6.12?
DELAY_DEQUEUE should be enabled by default from v6.12 but there are a
few fixes for the same in-flight. Could try running with the changes
from [1] and [2] and see if you could reproduce the behavior and if
you can, is it equally bad?
Both changes apply cleanly for me on top of current
git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
at commit fe9beaaa802d ("sched: No PREEMPT_RT=y for all{yes,mod}config")
when applied in that order.
[1] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
[2] https://lore.kernel.org/lkml/750542452c4f852831e601e1b8de40df4b108d9a.camel@gmx.de/
>
> Thanks,
> Sam
--
Thanks and Regards,
Prateek
On Sun, Nov 10, 2024 at 8:08 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Sam,
>
> On 11/9/2024 4:47 AM, Samuel Wu wrote:
> > On Thu, Nov 7, 2024 at 11:08 PM Saravana Kannan <saravanak@google.com> wrote:
> >>
> >> On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On 11/6/24 11:09, Peter Zijlstra wrote:
> >>>> On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:
> >>>>
> >>>>> Since delayed entities are still on the runqueue, they can affect PELT
> >>>>> calculation. Vincent and Dietmar have both noted this and Peter posted
> >>>>> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> >>>>> in response but it was pulled out since Luis reported observing -ve
> >>>>> values for h_nr_delayed on his setup. A lot has been fixed around
> >>>>> delayed dequeue since and I wonder if now would be the right time to
> >>>>> re-attempt h_nr_delayed tracking.
> >>>>
> >>>> Yeah, it's something I meant to get back to. I think the patch as posted
> >>>> was actually right and it didn't work for Luis because of some other,
> >>>> since fixed issue.
> >>>>
> >>>> But I might be misremembering things. I'll get to it eventually :/
> >>>
> >>> Sorry for the late reply, I got sidetracked on something else.
> >>>
> >>> There have been a few power regressions (based on our Pixel6-based testing) due
> >>> to the delayed-dequeue series.
> >>>
> >>> The main one drove the frequencies up due to an imbalance in the uclamp inc/dec
> >>> handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1]
> >>>
> >>> The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the
> >>> imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things.
> >>>
> >>> So power use was brought down by the above fix, but some issues still remained, like the
> >>> accounting issues with h_nr_running and not taking sched_delayed tasks into account.
> >>>
> >>> Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2]
> >>>
> >>> Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was
> >>> mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed
> >>> accounting that slowly drove frequencies up again.
> >>>
> >>> If I recall correctly, Peter has pulled that particular patch from the tree, but we should
> >>> definitely revisit it with a proper fix for the imbalance. Suggestion in [5].
> >>>
> >>> [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/
> >>> [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/
> >>> [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> >>> [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
> >>> [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
> >>
> >> Thanks for the replies. We are trying to disable DELAY_DEQUEUE and
> >> recollect the data to see if that's the cause. We'll get back to this
> >> thread once we have some data.
> >>
> >> -Saravana
> >
> > The test data is back to pre-EEVDF state with DELAY_DEQUEUE disabled.
> >
> > Same test example from before, when thread is affined to the big cluster:
> > +----------------------------------+
> > | Data | Enabled | Disabled |
> > |-----------------------+----------|
> > | 5th percentile | 96 | 143 |
> > |-----------------------+----------|
> > | Median | 144 | 147 |
> > |-----------------------+----------|
> > | Mean | 134 | 147 |
> > |-----------------------+----------|
> > | 95th percentile | 150 | 150 |
> > +----------------------------------+
> >
> > What are the next steps to bring this behavior back? Will DELAY_DEQUEUE always
> > be enabled by default and/or is there a fix coming for 6.12?
>
> DELAY_DEQUEUE should be enabled by default from v6.12 but there are a
> few fixes for the same in-flight. Could try running with the changes
> from [1] and [2] and see if you could reproduce the behavior and if
> you can, is it equally bad?
>
> Both changes apply cleanly for me on top of current
>
> git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
>
> at commit fe9beaaa802d ("sched: No PREEMPT_RT=y for all{yes,mod}config")
> when applied in that order.
>
> [1] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> [2] https://lore.kernel.org/lkml/750542452c4f852831e601e1b8de40df4b108d9a.camel@gmx.de/
Have these changes landed in 6.12? Or will these in 6.13?
We tested 6.12 and the issue we reported is still present. What should
we do for any products we want to ship on 6.12? Disable Delayed
Dequeue or backport any fixes to 6.12 LTS?
Peter/Vincent, do you plan on backporting the future fixes to the 6.12
LTS kernel? Anything else we can do to help with making sure this is
fixed on the LTS kernel?
-Saravana
Peter Zijlstra <peterz@infradead.org> writes: > Hi all, > > So after much delay this is hopefully the final version of the EEVDF patches. > They've been sitting in my git tree for ever it seems, and people have been > testing it and sending fixes. > > I've spend the last two days testing and fixing cfs-bandwidth, and as far > as I know that was the very last issue holding it back. > > These patches apply on top of queue.git sched/dl-server, which I plan on merging > in tip/sched/core once -rc1 drops. > > I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. > > > Aside from a ton of bug fixes -- thanks all! -- new in this version is: > > - split up the huge delay-dequeue patch > - tested/fixed cfs-bandwidth > - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating > - SCHED_BATCH is equivalent to RESPECT_SLICE > - propagate min_slice up cgroups > - CLOCK_THREAD_DVFS_ID I'm seeing crashes/warnings like the following on s390 with linux-next 20240909: Sometimes the system doesn't manage to print a oops, this one is the best i got: [ 596.146142] ------------[ cut here ]------------ [ 596.146161] se->sched_delayed [ 596.146166] WARNING: CPU: 1 PID: 0 at kernel/sched/fair.c:13131 __set_next_task_fair.part.0+0x350/0x400 [ 596.146179] Modules linked in: [..] [ 596.146288] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.11.0-rc7-next-20240909 #18 [ 596.146294] Hardware name: IBM 3931 A01 704 (LPAR) [ 596.146298] Krnl PSW : 0404e00180000000 001a9c2b5eea4ea4 (__set_next_task_fair.part.0+0x354/0x400) [ 596.146307] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 [ 596.146314] Krnl GPRS: 001c000300000027 001c000300000023 0000000000000011 0000000000000004 [ 596.146319] 0000000000000001 001a9c2b5f1fb118 000000036ef94dd5 0000001b77ca6ea8 [ 596.146323] 001c000000000000 001a9c2b5eec6fc0 0000001b77ca6000 00000000b7334800 [ 596.146328] 0000000000000000 001a9c2b5eefad70 001a9c2b5eea4ea0 001a9bab5ee8f9f8 [ 596.146340] Krnl Code: 001a9c2b5eea4e94: c0200121bbe6 larl %r2,001a9c2b612dc660 [ 596.146340] 001a9c2b5eea4e9a: c0e5fff9e9d3 brasl %r14,001a9c2b5ede2240 [ 596.146340] #001a9c2b5eea4ea0: af000000 mc 0,0 [ 596.146340] >001a9c2b5eea4ea4: a7f4fe83 brc 15,001a9c2b5eea4baa [ 596.146340] 001a9c2b5eea4ea8: c0e50038ba2c brasl %r14,001a9c2b5f5bc300 [ 596.146558] CPU: 1 UID: 0 PID: 18582 Comm: prctl-sched-cor Tainted: G W 6.11.0-rc7-next-20240909 #18 [ 596.146564] Tainted: [W]=WARN [ 596.146567] Hardware name: IBM 3931 A01 704 (LPAR) [ 596.146570] Krnl PSW : 0404e00180000000 001a9c2b5eec2de4 (dequeue_entity+0xe64/0x11f0) [ 596.146578] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 [ 596.146584] Krnl GPRS: 001c000300000027 001c000300000023 000000000000001a 0000000000000004 [ 596.146589] 0000000000000001 001a9c2b5f1fb118 001a9c2b61be7144 0000000016e6692a [ 596.146593] 0000000000000001 00000000b7334951 0000000158494800 00000000b7334900 [ 596.146597] 000000000000489e 0000000000000009 001a9c2b5eec2de0 001a9bab75dff760 [ 596.146607] Krnl Code: 001a9c2b5eec2dd4: c0200120cdf6 larl %r2,001a9c2b612dc9c0 [ 596.146607] 001a9c2b5eec2dda: c0e5fff8fa33 brasl %r14,001a9c2b5ede2240 [ 596.146607] #001a9c2b5eec2de0: af000000 mc 0,0 [ 596.146607] >001a9c2b5eec2de4: c004fffff90a brcl 0,001a9c2b5eec1ff8 [ 596.146607] 001a9c2b5eec2dea: a7f4fbbe brc 15,001a9c2b5eec2566 [ 596.146607] 001a9c2b5eec2dee: a7d10001 tmll %r13,1 [ 596.146607] 001a9c2b5eec2df2: a774fb1c brc 7,001a9c2b5eec242a [ 596.146607] 001a9c2b5eec2df6: a7f4f95f brc 15,001a9c2b5eec20b4 [ 596.146637] Call Trace: [ 596.146640] [<001a9c2b5eec2de4>] dequeue_entity+0xe64/0x11f0 [ 596.146645] ([<001a9c2b5eec2de0>] dequeue_entity+0xe60/0x11f0) [ 596.146650] [<001a9c2b5eec34b0>] dequeue_entities+0x340/0xe10 [ 596.146655] [<001a9c2b5eec4208>] dequeue_task_fair+0xb8/0x5a0 [ 596.146660] [<001a9c2b6115ab68>] __schedule+0xb58/0x14f0 [ 596.146666] [<001a9c2b6115b59c>] schedule+0x9c/0x240 [ 596.146670] [<001a9c2b5edf5190>] do_wait+0x160/0x440 [ 596.146676] [<001a9c2b5edf5936>] kernel_waitid+0xd6/0x110 [ 596.146680] [<001a9c2b5edf5b4e>] __do_sys_waitid+0x1de/0x1f0 [ 596.146685] [<001a9c2b5edf5c36>] __s390x_sys_waitid+0xd6/0x120 [ 596.146690] [<001a9c2b5ed0cbd6>] do_syscall+0x2f6/0x430 [ 596.146695] [<001a9c2b611543a4>] __do_syscall+0xa4/0x170 [ 596.146700] [<001a9c2b6117046c>] system_call+0x74/0x98 [ 596.146705] Last Breaking-Event-Address: [ 596.146707] [<001a9c2b5ede2418>] __warn_printk+0x1d8/0x1e0 This happens when running the strace test suite. The system normaly has 128 CPUs. With this configuration the crash doesn't happen, but when disabling all but four CPUs and running 'make check -j16' in the strace test suite the crash is almost always reproducable. Regards Sven
Sven Schnelle <svens@linux.ibm.com> writes:
> Peter Zijlstra <peterz@infradead.org> writes:
>
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF patches.
>> They've been sitting in my git tree for ever it seems, and people have been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>> - split up the huge delay-dequeue patch
>> - tested/fixed cfs-bandwidth
>> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>> - SCHED_BATCH is equivalent to RESPECT_SLICE
>> - propagate min_slice up cgroups
>> - CLOCK_THREAD_DVFS_ID
>
> I'm seeing crashes/warnings like the following on s390 with linux-next 20240909:
>
> Sometimes the system doesn't manage to print a oops, this one is the best i got:
>
> [..]
> This happens when running the strace test suite. The system normaly has
> 128 CPUs. With this configuration the crash doesn't happen, but when
> disabling all but four CPUs and running 'make check -j16' in the strace
> test suite the crash is almost always reproducable.
I failed to add the log from git bisect. Unfortunately i had to skip
some commit because the kernel didn't compile:
git bisect start
# status: waiting for both good and bad commits
# bad: [100cc857359b5d731407d1038f7e76cd0e871d94] Add linux-next specific files for 20240909
git bisect bad 100cc857359b5d731407d1038f7e76cd0e871d94
# status: waiting for good commit(s), bad commit known
# good: [da3ea35007d0af457a0afc87e84fddaebc4e0b63] Linux 6.11-rc7
git bisect good da3ea35007d0af457a0afc87e84fddaebc4e0b63
# good: [df20078b9706977cc3308740b56993cf27665f90] Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
git bisect good df20078b9706977cc3308740b56993cf27665f90
# good: [609f9e1b6242e7158ce96f9124372601997ce56c] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc.git
git bisect good 609f9e1b6242e7158ce96f9124372601997ce56c
# skip: [664c3413e9a6c345a6c926841358314be9da8309] Merge branch 'usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
git bisect skip 664c3413e9a6c345a6c926841358314be9da8309
# good: [16531118ba63dd9bcd65203d04a9c9d6f6800547] iio: bmi323: peripheral in lowest power state on suspend
git bisect good 16531118ba63dd9bcd65203d04a9c9d6f6800547
# bad: [d9c7ac7f8bfb16f431daa7c77bdfe2b163361ead] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git
git bisect bad d9c7ac7f8bfb16f431daa7c77bdfe2b163361ead
# bad: [05536babd768b38d84ad168450f48634a013603d] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect bad 05536babd768b38d84ad168450f48634a013603d
# good: [dabef94a179957db117db344b924e5d5c4074e5f] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
git bisect good dabef94a179957db117db344b924e5d5c4074e5f
# bad: [d4886a325947ecae6867fc858657062211aae3b9] Merge branch into tip/master: 'locking/core'
git bisect bad d4886a325947ecae6867fc858657062211aae3b9
# bad: [51c095bee5c77590d43519f03179342e910d333c] Merge branch into tip/master: 'core/core'
git bisect bad 51c095bee5c77590d43519f03179342e910d333c
# bad: [fc1892becd5672f52329a75c73117b60ac7841b7] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
git bisect bad fc1892becd5672f52329a75c73117b60ac7841b7
# good: [ae04f69de0bef93c7086cf2983dbc8e8fd624ebe] sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()
git bisect good ae04f69de0bef93c7086cf2983dbc8e8fd624ebe
# good: [abc158c82ae555078aa5dd2d8407c3df0f868904] sched: Prepare generic code for delayed dequeue
git bisect good abc158c82ae555078aa5dd2d8407c3df0f868904
# skip: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
git bisect skip 781773e3b68031bd001c0c18aa72e8470c225ebd
# skip: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
git bisect skip e1459a50ba31831efdfc35278023d959e4ba775b
# skip: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
git bisect skip a1c446611e31ca5363d4db51e398271da1dce0af
# good: [e28b5f8bda01720b5ce8456b48cf4b963f9a80a1] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
git bisect good e28b5f8bda01720b5ce8456b48cf4b963f9a80a1
# skip: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
git bisect skip f12e148892ede8d9ee82bcd3e469e6d01fc077ac
# skip: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
git bisect skip 152e11f6df293e816a6a37c69757033cdc72667d
# skip: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
git bisect skip 2e0199df252a536a03f4cb0810324dff523d1e79
# bad: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
git bisect bad 54a58a78779169f9c92a51facf6de7ce94962328
# only skipped commits left to test
# possible first bad commit: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
# possible first bad commit: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
# possible first bad commit: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
# possible first bad commit: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
# possible first bad commit: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
# possible first bad commit: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
# possible first bad commit: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
On Tue, Sep 10, 2024 at 02:21:05PM +0200, Sven Schnelle wrote:
> Sven Schnelle <svens@linux.ibm.com> writes:
>
> > Peter Zijlstra <peterz@infradead.org> writes:
> >
> >> Hi all,
> >>
> >> So after much delay this is hopefully the final version of the EEVDF patches.
> >> They've been sitting in my git tree for ever it seems, and people have been
> >> testing it and sending fixes.
> >>
> >> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> >> as I know that was the very last issue holding it back.
> >>
> >> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> >> in tip/sched/core once -rc1 drops.
> >>
> >> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
> >>
> >>
> >> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
> >>
> >> - split up the huge delay-dequeue patch
> >> - tested/fixed cfs-bandwidth
> >> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
> >> - SCHED_BATCH is equivalent to RESPECT_SLICE
> >> - propagate min_slice up cgroups
> >> - CLOCK_THREAD_DVFS_ID
> >
> > I'm seeing crashes/warnings like the following on s390 with linux-next 20240909:
> >
> > Sometimes the system doesn't manage to print a oops, this one is the best i got:
> >
> > [..]
> > This happens when running the strace test suite. The system normaly has
> > 128 CPUs. With this configuration the crash doesn't happen, but when
> > disabling all but four CPUs and running 'make check -j16' in the strace
> > test suite the crash is almost always reproducable.
I noted: Comm: prctl-sched-cor, which is testing core scheduling, right?
Only today I;ve merged a fix for that:
c662e2b1e8cf ("sched: Fix sched_delayed vs sched_core")
Could you double check if merging tip/sched/core into your next tree
helps anything at all?
Peter Zijlstra <peterz@infradead.org> writes:
> On Tue, Sep 10, 2024 at 02:21:05PM +0200, Sven Schnelle wrote:
>> Sven Schnelle <svens@linux.ibm.com> writes:
>>
>> > Peter Zijlstra <peterz@infradead.org> writes:
>> >
>> >> Hi all,
>> >>
>> >> So after much delay this is hopefully the final version of the EEVDF patches.
>> >> They've been sitting in my git tree for ever it seems, and people have been
>> >> testing it and sending fixes.
>> >>
>> >> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> >> as I know that was the very last issue holding it back.
>> >>
>> >> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> >> in tip/sched/core once -rc1 drops.
>> >>
>> >> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>> >>
>> >>
>> >> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>> >>
>> >> - split up the huge delay-dequeue patch
>> >> - tested/fixed cfs-bandwidth
>> >> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>> >> - SCHED_BATCH is equivalent to RESPECT_SLICE
>> >> - propagate min_slice up cgroups
>> >> - CLOCK_THREAD_DVFS_ID
>> >
>> > I'm seeing crashes/warnings like the following on s390 with linux-next 20240909:
>> >
>> > Sometimes the system doesn't manage to print a oops, this one is the best i got:
>> >
>> > [..]
>> > This happens when running the strace test suite. The system normaly has
>> > 128 CPUs. With this configuration the crash doesn't happen, but when
>> > disabling all but four CPUs and running 'make check -j16' in the strace
>> > test suite the crash is almost always reproducable.
>
> I noted: Comm: prctl-sched-cor, which is testing core scheduling, right?
>
> Only today I;ve merged a fix for that:
>
> c662e2b1e8cf ("sched: Fix sched_delayed vs sched_core")
>
> Could you double check if merging tip/sched/core into your next tree
> helps anything at all?
Yes, that fixes the issue. Thanks!
This series has caused an explosion of different kernel crashes on our syzbot instance that fuzzes linux-next. I guess, such kernel behavior indicates some massive underlying memory corruption (?) Some of the crash titles we've seen (we didn't release them -- there were too many, 70+): KASAN: stack-out-of-bounds Write in insn_decode kernel panic: stack is corrupted in vprintk_store kernel panic: stack is corrupted in _printk BUG: spinlock recursion in __schedule WARNING in __put_task_struct BUG: unable to handle kernel NULL pointer dereference in asm_exc_page_fault WARNING in rng_dev_read BUG: scheduling while atomic in prb_final_commit kernel BUG in dequeue_rt_stack BUG: scheduling while atomic in rcu_is_watching BUG: spinlock recursion in copy_process KASAN: slab-use-after-free Read in sched_core_enqueue kernel panic: stack is corrupted in refill_stock kernel panic: stack is corrupted in prb_reserve WARNING: bad unlock balance in timekeeping_get_ns KASAN: slab-use-after-free Read in set_next_task_fair I wonder if the actual problem is already known and possibly there are even some fix patches? If not and if it may be of any help, we can try to come up with some contained instruction to reproduce these issues with syzkaller. -- Aleksandr
On Sat, 27 Jul 2024 at 13:02, Peter Zijlstra <peterz@infradead.org> wrote: > > Hi all, > > So after much delay this is hopefully the final version of the EEVDF patches. > They've been sitting in my git tree for ever it seems, and people have been > testing it and sending fixes. > > I've spend the last two days testing and fixing cfs-bandwidth, and as far > as I know that was the very last issue holding it back. > > These patches apply on top of queue.git sched/dl-server, which I plan on merging > in tip/sched/core once -rc1 drops. > > I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. While trying to test what would be the impact of delayed dequeue on load_avg, I noticed something strange with the running slice. I have a simple test with 2 always running threads on 1 CPU and the each thread runs around 100ms continuously before switching to the other one whereas I was expecting 3ms (the sysctl_sched_base_slice on my system) between 2 context swicthes I'm using your sched/core branch. Is it the correct one ? > > > Aside from a ton of bug fixes -- thanks all! -- new in this version is: > > - split up the huge delay-dequeue patch > - tested/fixed cfs-bandwidth > - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating > - SCHED_BATCH is equivalent to RESPECT_SLICE > - propagate min_slice up cgroups > - CLOCK_THREAD_DVFS_ID > >
On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote: > > While trying to test what would be the impact of delayed dequeue on > load_avg, I noticed something strange with the running slice. I have a > simple test with 2 always running threads on 1 CPU and the each thread > runs around 100ms continuously before switching to the other one > whereas I was expecting 3ms (the sysctl_sched_base_slice on my system) > between 2 context swicthes > > I'm using your sched/core branch. Is it the correct one ? Hm, building that branch, I see the expected tick granularity (4ms). -Mike
On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote: > > On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote: > > > > While trying to test what would be the impact of delayed dequeue on > > load_avg, I noticed something strange with the running slice. I have a > > simple test with 2 always running threads on 1 CPU and the each thread > > runs around 100ms continuously before switching to the other one > > whereas I was expecting 3ms (the sysctl_sched_base_slice on my system) > > between 2 context swicthes > > > > I'm using your sched/core branch. Is it the correct one ? > > Hm, building that branch, I see the expected tick granularity (4ms). On my side tip/sched/core switches every 4ms but Peter's sched/core, which is delayed queued on top of tip/sched/core if I don't get it wrong, switches every 100ms. > > -Mike >
Hello Vincent, Mike,
On 8/14/2024 10:29 PM, Vincent Guittot wrote:
> On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote:
>>
>> On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote:
>>>
>>> While trying to test what would be the impact of delayed dequeue on
>>> load_avg, I noticed something strange with the running slice. I have a
>>> simple test with 2 always running threads on 1 CPU and the each thread
>>> runs around 100ms continuously before switching to the other one
>>> whereas I was expecting 3ms (the sysctl_sched_base_slice on my system)
>>> between 2 context swicthes
>>>
>>> I'm using your sched/core branch. Is it the correct one ?
>>
>> Hm, building that branch, I see the expected tick granularity (4ms).
>
> On my side tip/sched/core switches every 4ms but Peter's sched/core,
> which is delayed queued on top of tip/sched/core if I don't get it
> wrong, switches every 100ms.
I could not observe this behavior when running two busy loops pinned to
one CPU on my end. I'm running with base_slice_ns of 3ms and the
sched_feats related to EEVDF complete looks as follows:
PLACE_LAG
PLACE_DEADLINE_INITIAL
PLACE_REL_DEADLINE
RUN_TO_PARITY
PREEMPT_SHORT
NO_NEXT_BUDDY
CACHE_HOT_BUDDY
DELAY_DEQUEUE
DELAY_ZERO
WAKEUP_PREEMPTION
...
Also I'm running with CONFIG_HZ=250 (4ms tick granularity)
CONFIG_HZ_250=y
CONFIG_HZ=250
Enabling sched_switch tracepeoint, I see the following:
...
loop-4061 109.710379: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.714377: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.718375: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.722374: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.726379: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.730377: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.734367: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.738365: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.742364: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.746361: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.750359: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.754357: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.758355: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.762353: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.766351: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.770349: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.774347: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.778345: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.782343: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.786346: sched_switch: ... prev_pid=4060 ... prev_state=R ==> next_comm=kworker/1:1 next_pid=1616 next_prio=120
kworker/1:1-1616 109.786412: sched_switch: prev_comm=kworker/1:1 prev_pid=1616 ... prev_state=I ==> ... next_pid=4061 ...
loop-4061 109.794337: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.798335: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.802335: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.806331: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.810329: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.814327: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.818325: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.822323: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.826321: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
loop-4060 109.830321: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
loop-4061 109.834317: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
...
(Trimmed traces are for busy loops with pids 4060 and 4061)
I see the expected tick granularity similar to Mike. Since Peter's tree
is prone to force-updates, I'm on
git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
at commit 4cc290c20a98 "sched/eevdf: Dequeue in switched_from_fair()"
which was committed at "2024-08-14 08:15:39 +0200".
>
>>
>> -Mike
>>
--
Thanks and Regards,
Prateek
On Wed, 2024-08-14 at 18:59 +0200, Vincent Guittot wrote: > On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote: > > > > On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote: > > > > > > While trying to test what would be the impact of delayed dequeue on > > > load_avg, I noticed something strange with the running slice. I have a > > > simple test with 2 always running threads on 1 CPU and the each thread > > > runs around 100ms continuously before switching to the other one > > > whereas I was expecting 3ms (the sysctl_sched_base_slice on my system) > > > between 2 context swicthes > > > > > > I'm using your sched/core branch. Is it the correct one ? > > > > Hm, building that branch, I see the expected tick granularity (4ms). > > On my side tip/sched/core switches every 4ms but Peter's sched/core, > which is delayed queued on top of tip/sched/core if I don't get it > wrong, switches every 100ms. FWIW, I checked my local master-rt tree as well, which has Peter's latest eevdf series wedged in (plus 4cc290c20a98 now).. it also worked as expected. -Mike
On Wed, 14 Aug 2024 at 19:18, Mike Galbraith <efault@gmx.de> wrote: > > On Wed, 2024-08-14 at 18:59 +0200, Vincent Guittot wrote: > > On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote: > > > > > > On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote: > > > > > > > > While trying to test what would be the impact of delayed dequeue on > > > > load_avg, I noticed something strange with the running slice. I have a > > > > simple test with 2 always running threads on 1 CPU and the each thread > > > > runs around 100ms continuously before switching to the other one > > > > whereas I was expecting 3ms (the sysctl_sched_base_slice on my system) > > > > between 2 context swicthes > > > > > > > > I'm using your sched/core branch. Is it the correct one ? > > > > > > Hm, building that branch, I see the expected tick granularity (4ms). > > > > On my side tip/sched/core switches every 4ms but Peter's sched/core, > > which is delayed queued on top of tip/sched/core if I don't get it > > wrong, switches every 100ms. > > FWIW, I checked my local master-rt tree as well, which has Peter's > latest eevdf series wedged in (plus 4cc290c20a98 now).. it also worked > as expected. After looking at deadline and slice, the issue is on my tool was trying to change the slice (an old trial for previous version) which got clamp to 100ms. we can forgot this, sorry for the noise > > -Mike >
Hi Peter, On 27/07/2024 11:27, Peter Zijlstra wrote: > Hi all, > > So after much delay this is hopefully the final version of the EEVDF patches. > They've been sitting in my git tree for ever it seems, and people have been > testing it and sending fixes. > > I've spend the last two days testing and fixing cfs-bandwidth, and as far > as I know that was the very last issue holding it back. > > These patches apply on top of queue.git sched/dl-server, which I plan on merging > in tip/sched/core once -rc1 drops. > > I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. > > > Aside from a ton of bug fixes -- thanks all! -- new in this version is: > > - split up the huge delay-dequeue patch > - tested/fixed cfs-bandwidth > - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating > - SCHED_BATCH is equivalent to RESPECT_SLICE > - propagate min_slice up cgroups > - CLOCK_THREAD_DVFS_ID > The latest tip/sched/core at commit aef6987d89544d63a47753cf3741cabff0b5574c crashes very early on on my Juno r2 board (arm64). The trace is here: [ 0.049599] ------------[ cut here ]------------ [ 0.054279] kernel BUG at kernel/sched/deadline.c:63! [ 0.059401] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 0.066285] Modules linked in: [ 0.069382] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc1-g55404cef33db #1070 [ 0.077855] Hardware name: ARM Juno development board (r2) (DT) [ 0.083856] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 0.090919] pc : enqueue_dl_entity+0x53c/0x540 [ 0.095434] lr : dl_server_start+0xb8/0x10c [ 0.099679] sp : ffffffc081ca3c30 [ 0.103034] x29: ffffffc081ca3c40 x28: 0000000000000001 x27: 0000000000000002 [ 0.110281] x26: 00000000000b71b0 x25: 0000000000000000 x24: 0000000000000001 [ 0.117525] x23: ffffff897ef21140 x22: 0000000000000000 x21: 0000000000000000 [ 0.124770] x20: ffffff897ef21040 x19: ffffff897ef219a8 x18: ffffffc080d0ad00 [ 0.132015] x17: 000000000000002f x16: 0000000000000000 x15: ffffffc081ca8000 [ 0.139260] x14: 00000000016ef200 x13: 00000000000e6667 x12: 0000000000000001 [ 0.146505] x11: 000000003b9aca00 x10: 0000000002faf080 x9 : 0000000000000030 [ 0.153749] x8 : 0000000000000071 x7 : 000000002cf93d25 x6 : 000000002cf93d25 [ 0.160994] x5 : ffffffc081e04938 x4 : ffffffc081ca3d40 x3 : 0000000000000001 [ 0.168238] x2 : 000000003b9aca00 x1 : 0000000000000001 x0 : ffffff897ef21040 [ 0.175483] Call trace: [ 0.177958] enqueue_dl_entity+0x53c/0x540 [ 0.182117] dl_server_start+0xb8/0x10c [ 0.186010] enqueue_task_fair+0x5c8/0x6ac [ 0.190165] enqueue_task+0x54/0x1e8 [ 0.193793] wake_up_new_task+0x250/0x39c [ 0.197862] kernel_clone+0x140/0x2f0 [ 0.201578] user_mode_thread+0x4c/0x58 [ 0.205468] rest_init+0x24/0xd8 [ 0.208743] start_kernel+0x2bc/0x2fc [ 0.212460] __primary_switched+0x80/0x88 [ 0.216535] Code: b85fc3a8 7100051f 54fff8e9 17ffffce (d4210000) [ 0.222711] ---[ end trace 0000000000000000 ]--- [ 0.227391] Kernel panic - not syncing: Attempted to kill the idle task! [ 0.234187] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- I'm not an expert in DL server so I have no idea where the problem could be. If you know where to look off the top of your head then much better. If not, I'll do some bi-section later. Hongyan
On 20/08/2024 17:43, Hongyan Xia wrote: > Hi Peter, > > On 27/07/2024 11:27, Peter Zijlstra wrote: >> Hi all, >> >> So after much delay this is hopefully the final version of the EEVDF >> patches. >> They've been sitting in my git tree for ever it seems, and people have >> been >> testing it and sending fixes. >> >> I've spend the last two days testing and fixing cfs-bandwidth, and as far >> as I know that was the very last issue holding it back. >> >> These patches apply on top of queue.git sched/dl-server, which I plan >> on merging >> in tip/sched/core once -rc1 drops. >> >> I'm hoping to then merge all this (+- the DVFS clock patch) right >> before -rc2. >> >> >> Aside from a ton of bug fixes -- thanks all! -- new in this version is: >> >> - split up the huge delay-dequeue patch >> - tested/fixed cfs-bandwidth >> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating >> - SCHED_BATCH is equivalent to RESPECT_SLICE >> - propagate min_slice up cgroups >> - CLOCK_THREAD_DVFS_ID >> > > The latest tip/sched/core at commit > > aef6987d89544d63a47753cf3741cabff0b5574c > > crashes very early on on my Juno r2 board (arm64). The trace is here: > > [ 0.049599] ------------[ cut here ]------------ > [ 0.054279] kernel BUG at kernel/sched/deadline.c:63! > [ 0.059401] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT > SMP > [ 0.066285] Modules linked in: > [ 0.069382] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted > 6.11.0-rc1-g55404cef33db #1070 > [ 0.077855] Hardware name: ARM Juno development board (r2) (DT) > [ 0.083856] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS > BTYPE=--) > [ 0.090919] pc : enqueue_dl_entity+0x53c/0x540 > [ 0.095434] lr : dl_server_start+0xb8/0x10c > [ 0.099679] sp : ffffffc081ca3c30 > [ 0.103034] x29: ffffffc081ca3c40 x28: 0000000000000001 x27: > 0000000000000002 > [ 0.110281] x26: 00000000000b71b0 x25: 0000000000000000 x24: > 0000000000000001 > [ 0.117525] x23: ffffff897ef21140 x22: 0000000000000000 x21: > 0000000000000000 > [ 0.124770] x20: ffffff897ef21040 x19: ffffff897ef219a8 x18: > ffffffc080d0ad00 > [ 0.132015] x17: 000000000000002f x16: 0000000000000000 x15: > ffffffc081ca8000 > [ 0.139260] x14: 00000000016ef200 x13: 00000000000e6667 x12: > 0000000000000001 > [ 0.146505] x11: 000000003b9aca00 x10: 0000000002faf080 x9 : > 0000000000000030 > [ 0.153749] x8 : 0000000000000071 x7 : 000000002cf93d25 x6 : > 000000002cf93d25 > [ 0.160994] x5 : ffffffc081e04938 x4 : ffffffc081ca3d40 x3 : > 0000000000000001 > [ 0.168238] x2 : 000000003b9aca00 x1 : 0000000000000001 x0 : > ffffff897ef21040 > [ 0.175483] Call trace: > [ 0.177958] enqueue_dl_entity+0x53c/0x540 > [ 0.182117] dl_server_start+0xb8/0x10c > [ 0.186010] enqueue_task_fair+0x5c8/0x6ac > [ 0.190165] enqueue_task+0x54/0x1e8 > [ 0.193793] wake_up_new_task+0x250/0x39c > [ 0.197862] kernel_clone+0x140/0x2f0 > [ 0.201578] user_mode_thread+0x4c/0x58 > [ 0.205468] rest_init+0x24/0xd8 > [ 0.208743] start_kernel+0x2bc/0x2fc > [ 0.212460] __primary_switched+0x80/0x88 > [ 0.216535] Code: b85fc3a8 7100051f 54fff8e9 17ffffce (d4210000) > [ 0.222711] ---[ end trace 0000000000000000 ]--- > [ 0.227391] Kernel panic - not syncing: Attempted to kill the idle task! > [ 0.234187] ---[ end Kernel panic - not syncing: Attempted to kill > the idle task! ]--- > > I'm not an expert in DL server so I have no idea where the problem could > be. If you know where to look off the top of your head then much better. > If not, I'll do some bi-section later. > Okay, in case the trace I provided isn't clear enough, I traced the crash to a call chain like this: dl_server_start() enqueue_dl_entity() update_stats_enqueue_dl() update_stats_enqueue_sleeper_dl() __schedstats_from_dl_se() dl_task_of() <---------- crash If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if this is the reason why other people are not seeing this. This is probably not EEVDF but DL refactoring related.
On Wed, Aug 21, 2024 at 10:46:07AM +0100, Hongyan Xia wrote: > Okay, in case the trace I provided isn't clear enough, I traced the crash to > a call chain like this: > > dl_server_start() > enqueue_dl_entity() > update_stats_enqueue_dl() > update_stats_enqueue_sleeper_dl() > __schedstats_from_dl_se() > dl_task_of() <---------- crash > > If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if this is > the reason why other people are not seeing this. This is probably not EEVDF > but DL refactoring related. Thanks for the report -- I'll see if I can spot something. Since you initially fingered these eevdf patches, could you confirm or deny that changing: kernel/sched/features.h:SCHED_FEAT(DELAY_DEQUEUE, true) to false, makes any difference in the previously failing case?
On 22/08/2024 16:55, Peter Zijlstra wrote: > On Wed, Aug 21, 2024 at 10:46:07AM +0100, Hongyan Xia wrote: >> Okay, in case the trace I provided isn't clear enough, I traced the crash to >> a call chain like this: >> >> dl_server_start() >> enqueue_dl_entity() >> update_stats_enqueue_dl() >> update_stats_enqueue_sleeper_dl() >> __schedstats_from_dl_se() >> dl_task_of() <---------- crash >> >> If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if this is >> the reason why other people are not seeing this. This is probably not EEVDF >> but DL refactoring related. > > Thanks for the report -- I'll see if I can spot something. Since you > initially fingered these eevdf patches, could you confirm or deny that > changing: > > kernel/sched/features.h:SCHED_FEAT(DELAY_DEQUEUE, true) > > to false, makes any difference in the previously failing case? Sadly the issue persists. I'm seeing exactly the same backtrace on my Juno board.
On Wed, 2024-08-21 at 10:46 +0100, Hongyan Xia wrote: > > If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if > this is the reason why other people are not seeing this. This is > probably not EEVDF but DL refactoring related. FWIW, tip.today boots/runs gripe free on dinky rpi4b (SCHEDSTATS=y). -Mike
On 27/07/24 12:27, Peter Zijlstra wrote:
> Hi all,
>
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
>
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
>
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
>
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>
>
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>
> - split up the huge delay-dequeue patch
> - tested/fixed cfs-bandwidth
> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
> - SCHED_BATCH is equivalent to RESPECT_SLICE
> - propagate min_slice up cgroups
> - CLOCK_THREAD_DVFS_ID
So what I've been testing at your queue/sched/core, HEAD at:
4cc290c20a98 ("sched/eevdf: Dequeue in switched_from_fair()")
survives my (simplistic) CFS bandwidth testing, so FWIW:
Tested-by: Valentin Schneider <vschneid@redhat.com>
And for patches 01-20:
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
with one caveat that I agree with Vincent on maybe doing something about
the load PELT signal of delayed dequeued entities.
Hi all, On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote: > Hi all, > > So after much delay this is hopefully the final version of the EEVDF patches. > They've been sitting in my git tree for ever it seems, and people have been > testing it and sending fixes. > > I've spend the last two days testing and fixing cfs-bandwidth, and as far > as I know that was the very last issue holding it back. > > These patches apply on top of queue.git sched/dl-server, which I plan on merging > in tip/sched/core once -rc1 drops. > > I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. > > > Aside from a ton of bug fixes -- thanks all! -- new in this version is: > > - split up the huge delay-dequeue patch > - tested/fixed cfs-bandwidth > - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating > - SCHED_BATCH is equivalent to RESPECT_SLICE > - propagate min_slice up cgroups > - CLOCK_THREAD_DVFS_ID We found the following 7 commits from this patch set to crash in enqueue_dl_entity(): 54a58a787791 sched/fair: Implement DELAY_ZERO 152e11f6df29 sched/fair: Implement delayed dequeue e1459a50ba31 sched: Teach dequeue_task() about special task states a1c446611e31 sched,freezer: Mark TASK_FROZEN special 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue Resulting in the following crash dump (this is running v6.12.1): [ 14.652856] sched: DL replenish lagged too much [ 16.572706] ------------[ cut here ]------------ [ 16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c [ 16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax [ 16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15 [ 16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024 [ 16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 16.580273] pc : enqueue_dl_entity+0x46c/0x55c [ 16.580661] lr : dl_server_start+0x44/0x12c [ 16.581028] sp : ffff80008002bc00 [ 16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000 [ 16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009 [ 16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0 [ 16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000 [ 16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698 [ 16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000 [ 16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009 [ 16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440 [ 16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0 [ 16.587540] Call trace: [ 16.587754] enqueue_dl_entity+0x46c/0x55c [ 16.588113] dl_server_start+0x44/0x12c [ 16.588449] enqueue_task_fair+0x124/0x49c [ 16.588807] enqueue_task+0x3c/0xe0 [ 16.589113] ttwu_do_activate.isra.0+0x6c/0x208 [ 16.589511] try_to_wake_up+0x1d0/0x61c [ 16.589847] wake_up_process+0x18/0x24 [ 16.590175] kick_pool+0x84/0x150 [ 16.590467] __queue_work+0x2f4/0x544 [ 16.590788] delayed_work_timer_fn+0x1c/0x28 [ 16.591161] call_timer_fn+0x34/0x1ac [ 16.591481] __run_timer_base+0x20c/0x314 [ 16.591832] run_timer_softirq+0x3c/0x78 [ 16.592176] handle_softirqs+0x124/0x35c [ 16.592520] __do_softirq+0x14/0x20 [ 16.592827] ____do_softirq+0x10/0x1c [ 16.593148] call_on_irq_stack+0x24/0x4c [ 16.593490] do_softirq_own_stack+0x1c/0x2c [ 16.593857] irq_exit_rcu+0x8c/0xc0 [ 16.594163] el0_interrupt+0x48/0xbc [ 16.594477] __el0_irq_handler_common+0x18/0x24 [ 16.594874] el0t_64_irq_handler+0x10/0x1c [ 16.595232] el0t_64_irq+0x190/0x194 [ 16.595545] ---[ end trace 0000000000000000 ]--- [ 16.595950] ------------[ cut here ]------------ It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1]. We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms. While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64. We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current findings asking for any feedback/suggestions from the community. Thanks! Cheers Marcel [1] https://hastebin.skyra.pw/hoqesigaye.yaml
On 11/28/24 10:32, Marcel Ziswiler wrote: > Hi all, > > On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote: >> Hi all, >> >> So after much delay this is hopefully the final version of the EEVDF patches. >> They've been sitting in my git tree for ever it seems, and people have been >> testing it and sending fixes. >> >> I've spend the last two days testing and fixing cfs-bandwidth, and as far >> as I know that was the very last issue holding it back. >> >> These patches apply on top of queue.git sched/dl-server, which I plan on merging >> in tip/sched/core once -rc1 drops. >> >> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. >> >> >> Aside from a ton of bug fixes -- thanks all! -- new in this version is: >> >> - split up the huge delay-dequeue patch >> - tested/fixed cfs-bandwidth >> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating >> - SCHED_BATCH is equivalent to RESPECT_SLICE >> - propagate min_slice up cgroups >> - CLOCK_THREAD_DVFS_ID > > We found the following 7 commits from this patch set to crash in enqueue_dl_entity(): > > 54a58a787791 sched/fair: Implement DELAY_ZERO > 152e11f6df29 sched/fair: Implement delayed dequeue > e1459a50ba31 sched: Teach dequeue_task() about special task states > a1c446611e31 sched,freezer: Mark TASK_FROZEN special > 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED > f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue > 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue > > Resulting in the following crash dump (this is running v6.12.1): > > [ 14.652856] sched: DL replenish lagged too much > [ 16.572706] ------------[ cut here ]------------ > [ 16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c > [ 16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan > phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg > ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm > hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me > m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec > drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils > drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax > [ 16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15 > [ 16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024 > [ 16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 16.580273] pc : enqueue_dl_entity+0x46c/0x55c > [ 16.580661] lr : dl_server_start+0x44/0x12c > [ 16.581028] sp : ffff80008002bc00 > [ 16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000 > [ 16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009 > [ 16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0 > [ 16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000 > [ 16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698 > [ 16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000 > [ 16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009 > [ 16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440 > [ 16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 > [ 16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0 > [ 16.587540] Call trace: > [ 16.587754] enqueue_dl_entity+0x46c/0x55c > [ 16.588113] dl_server_start+0x44/0x12c > [ 16.588449] enqueue_task_fair+0x124/0x49c > [ 16.588807] enqueue_task+0x3c/0xe0 > [ 16.589113] ttwu_do_activate.isra.0+0x6c/0x208 > [ 16.589511] try_to_wake_up+0x1d0/0x61c > [ 16.589847] wake_up_process+0x18/0x24 > [ 16.590175] kick_pool+0x84/0x150 > [ 16.590467] __queue_work+0x2f4/0x544 > [ 16.590788] delayed_work_timer_fn+0x1c/0x28 > [ 16.591161] call_timer_fn+0x34/0x1ac > [ 16.591481] __run_timer_base+0x20c/0x314 > [ 16.591832] run_timer_softirq+0x3c/0x78 > [ 16.592176] handle_softirqs+0x124/0x35c > [ 16.592520] __do_softirq+0x14/0x20 > [ 16.592827] ____do_softirq+0x10/0x1c > [ 16.593148] call_on_irq_stack+0x24/0x4c > [ 16.593490] do_softirq_own_stack+0x1c/0x2c > [ 16.593857] irq_exit_rcu+0x8c/0xc0 > [ 16.594163] el0_interrupt+0x48/0xbc > [ 16.594477] __el0_irq_handler_common+0x18/0x24 > [ 16.594874] el0t_64_irq_handler+0x10/0x1c > [ 16.595232] el0t_64_irq+0x190/0x194 > [ 16.595545] ---[ end trace 0000000000000000 ]--- > [ 16.595950] ------------[ cut here ]------------ Random piece of data, but I also had some difficulty making things boot on Android when trying Vincent's nr_running accounting series due to a very similar crash/stack trace. Though what I saw went a bit further and actually crashed within task_contending, called from enqueue_dl_entity. Possibly crashed in one of the inlined functions. Even though the kernel was 6.8 and it was backports, it seems awfully similar to the above. > > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1]. > > We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms. > While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64. > > We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current > findings asking for any feedback/suggestions from the community. > > Thanks! > > Cheers > > Marcel > > [1] https://hastebin.skyra.pw/hoqesigaye.yaml >
On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote: > Resulting in the following crash dump (this is running v6.12.1): > > [ 14.652856] sched: DL replenish lagged too much > [ 16.572706] ------------[ cut here ]------------ > [ 16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c > [ 16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15 > [ 16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024 > [ 16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 16.580273] pc : enqueue_dl_entity+0x46c/0x55c > [ 16.580661] lr : dl_server_start+0x44/0x12c > [ 16.581028] sp : ffff80008002bc00 > [ 16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000 > [ 16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009 > [ 16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0 > [ 16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000 > [ 16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698 > [ 16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000 > [ 16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009 > [ 16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440 > [ 16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 > [ 16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0 > [ 16.587540] Call trace: > [ 16.587754] enqueue_dl_entity+0x46c/0x55c > [ 16.588113] dl_server_start+0x44/0x12c > [ 16.588449] enqueue_task_fair+0x124/0x49c > [ 16.588807] enqueue_task+0x3c/0xe0 > [ 16.589113] ttwu_do_activate.isra.0+0x6c/0x208 > [ 16.589511] try_to_wake_up+0x1d0/0x61c > [ 16.589847] wake_up_process+0x18/0x24 > [ 16.590175] kick_pool+0x84/0x150 > [ 16.590467] __queue_work+0x2f4/0x544 > [ 16.590788] delayed_work_timer_fn+0x1c/0x28 > [ 16.591161] call_timer_fn+0x34/0x1ac > [ 16.591481] __run_timer_base+0x20c/0x314 > [ 16.591832] run_timer_softirq+0x3c/0x78 > [ 16.592176] handle_softirqs+0x124/0x35c > [ 16.592520] __do_softirq+0x14/0x20 > [ 16.592827] ____do_softirq+0x10/0x1c > [ 16.593148] call_on_irq_stack+0x24/0x4c > [ 16.593490] do_softirq_own_stack+0x1c/0x2c > [ 16.593857] irq_exit_rcu+0x8c/0xc0 > [ 16.594163] el0_interrupt+0x48/0xbc > [ 16.594477] __el0_irq_handler_common+0x18/0x24 > [ 16.594874] el0t_64_irq_handler+0x10/0x1c > [ 16.595232] el0t_64_irq+0x190/0x194 > [ 16.595545] ---[ end trace 0000000000000000 ]--- > [ 16.595950] ------------[ cut here ]------------ > > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1]. Right, I've had a number of these reports, but so far we've not yet managed to figure out how it's all happening. > We are running the exact same scheduler stress test both on Intel NUCs > as well as RADXA ROCK 5B board farms. While so far we have not seen > this on amd64 it crashes consistently/reproducible on aarch64. Oooh, that's something. So far the few reports have not been (easily) reproducible. If this is readily reproducible on arm64 that would help a lot. Juri, do you have access to an arm64 test box? A very long shot: https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com doesn't help, does it?
On Thu, 2024-11-28 at 11:58 +0100, Peter Zijlstra wrote: > On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote: > > > Resulting in the following crash dump (this is running v6.12.1): [snip] > > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available > > [1]. > > Right, I've had a number of these reports, but so far we've not yet > managed to figure out how it's all happening. > > > We are running the exact same scheduler stress test both on Intel NUCs > > as well as RADXA ROCK 5B board farms. While so far we have not seen > > this on amd64 it crashes consistently/reproducible on aarch64. > > Oooh, that's something. So far the few reports have not been (easily) > reproducible. If this is readily reproducible on arm64 that would > help a lot. Juri, do you have access to an arm64 test box? As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share anything which helps in resolving this. > A very long shot: > > https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com > > doesn't help, does it? No, still the same with this on top of today's -next. Thanks! Cheers Marcel
On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote:
> > Oooh, that's something. So far the few reports have not been (easily)
> > reproducible. If this is readily reproducible on arm64 that would
> > help a lot. Juri, do you have access to an arm64 test box?
>
> As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
> anything which helps in resolving this.
I was hoping you could perhaps share a binary with Juri privately or
with RHT (same difference etc), such that he can poke at it too.
Anyway, if you don't mind a bit of back and forth, would you mind adding
the below patch to your kernel and doing:
(all assuming your kernel has ftrace enabled)
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
echo 1 > /proc/sys/kernel/traceoff_on_warning
running your test to failure and then dumping the trace into a file
like:
cat /sys/kernel/debug/tracing/trace > ~/trace
Then compress the file (bzip2 or whatever is popular these days) and
send it my way along with a dmesg dump (private is fine -- these things
tend to be large-ish).
Hopefully, this will give us a little clue as to where the double
enqueue happens.
---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9d5a702f1a6..b9cd9b40a19f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
scoped_guard (rq_lock, rq) {
struct rq_flags *rf = &scope.rf;
+ if (dl_se == &rq->fair_server) {
+ trace_printk("timer fair server %d throttled %d\n",
+ cpu_of(rq), dl_se->dl_throttled);
+ }
+
if (!dl_se->dl_throttled || !dl_se->dl_runtime)
return HRTIMER_NORESTART;
@@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
rq_lock(rq, &rf);
}
+ if (dl_se == &rq->fair_server)
+ trace_printk("inactive fair server %d\n", cpu_of(rq));
+
sched_clock_tick();
update_rq_clock(rq);
@@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_se(dl_se);
+
+ if (dl_se == &rq->fair_server) {
+ trace_printk("enqueue fair server %d h_nr_running %d\n",
+ cpu_of(rq), rq->cfs.h_nr_running);
+ }
WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
@@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_se(dl_se);
+
+ if (dl_se == &rq->fair_server) {
+ trace_printk("dequeue fair server %d h_nr_running %d\n",
+ cpu_of(rq), rq->cfs.h_nr_running);
+ }
if (RB_EMPTY_NODE(&dl_se->rb_node))
return;
Sorry for my late reply, I was traveling back from Manchester to Switzerland but I am all settled down again.
On Fri, 2024-11-29 at 10:08 +0100, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote:
>
> > > Oooh, that's something. So far the few reports have not been (easily)
> > > reproducible. If this is readily reproducible on arm64 that would
> > > help a lot. Juri, do you have access to an arm64 test box?
> >
> > As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
> > anything which helps in resolving this.
>
> I was hoping you could perhaps share a binary with Juri privately or
> with RHT (same difference etc), such that he can poke at it too.
Sure, there is nothing secret about it, it is just that we have not gotten around open sourcing all parts of it
just yet.
The UEFI aarch64 embedded Linux image I am using may be found here [1]. Plus matching bmap file should you
fancy using that [2]. And the SSH key may help when interacting with the system (e.g. that is how I trigger the
failure as the console is quite busy with tracing) [3]. However, that was built by CI and does not contain a
kernel with below patch applied yet. I manually dumped the kernel config and compiled v6.12.1 with your patch
applied and deployed it (to /lib/modules, /usr/lib/kernel et. al.) in the below case where I provide the dump.
> Anyway, if you don't mind a bit of back and forth,
Sure.
> would you mind adding
> the below patch to your kernel and doing:
>
> (all assuming your kernel has ftrace enabled)
>
> echo 1 > /sys/kernel/debug/tracing/options/stacktrace
> echo 1 > /proc/sys/kernel/traceoff_on_warning
>
> running your test to failure and then dumping the trace into a file
> like:
>
> cat /sys/kernel/debug/tracing/trace > ~/trace
Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
buffer any longer. So I did the following instead on the serial console terminal:
tail -f /sys/kernel/debug/tracing/trace
Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
megabaud I am not fully sure whether it was able to keep up logging what you are looking for.
> Then compress the file (bzip2 or whatever is popular these days)
xz or zstd (;-p)
> and
> send it my way along with a dmesg dump (private is fine -- these things
> tend to be large-ish).
As mentioned before, there is nothing secret about it. Please find it here [4].
> Hopefully, this will give us a little clue as to where the double
> enqueue happens.
Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!
> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d9d5a702f1a6..b9cd9b40a19f 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> scoped_guard (rq_lock, rq) {
> struct rq_flags *rf = &scope.rf;
>
> + if (dl_se == &rq->fair_server) {
> + trace_printk("timer fair server %d throttled %d\n",
> + cpu_of(rq), dl_se->dl_throttled);
> + }
> +
> if (!dl_se->dl_throttled || !dl_se->dl_runtime)
> return HRTIMER_NORESTART;
>
> @@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> rq_lock(rq, &rf);
> }
>
> + if (dl_se == &rq->fair_server)
> + trace_printk("inactive fair server %d\n", cpu_of(rq));
> +
> sched_clock_tick();
> update_rq_clock(rq);
>
> @@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
> static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> {
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("enqueue fair server %d h_nr_running %d\n",
> + cpu_of(rq), rq->cfs.h_nr_running);
> + }
>
> WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
>
> @@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> {
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("dequeue fair server %d h_nr_running %d\n",
> + cpu_of(rq), rq->cfs.h_nr_running);
> + }
>
> if (RB_EMPTY_NODE(&dl_se->rb_node))
> return;
[1] https://drive.codethink.co.uk/s/N8CQipaNNN45gYM
[2] https://drive.codethink.co.uk/s/mpcPawXpCjPL8D3
[3] https://drive.codethink.co.uk/s/8RjHNTQQRpYgaLc
[4] https://drive.codethink.co.uk/s/MWtzWjLDtdD3E5i
Cheers
Marcel
On Mon, 02 Dec 2024 19:46:21 +0100 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace > buffer any longer. So I did the following instead on the serial console terminal: > > tail -f /sys/kernel/debug/tracing/trace > > Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5 > megabaud I am not fully sure whether it was able to keep up logging what you are looking for. If the memory of the machine is persistent (it is on several of my machines) you can use the persistent ring buffer. Add to the kernel command line (or enable bootconfig that attaches a command line to the kernel if you can't change the parameters): reserve_mem=20M:12M:trace trace_instance=boot_map^traceoff^traceprintk@trace The above will create a "boot_map" instance with tracing off on boot and trace_printk() going to it. Start tracing: trace-cmd start -B boot_map -p nop If or replace "-p nop" with any events or tracers you want, including function tracing", then after a crash. trace-cmd show -B boot_map If the memory is persistent and you don't use KASLR (may want to also add nokaslr if arm64 supports KASLR it and you use it), you should get the trace right up to the crash. See Documentation/trace/debugging.rst for more details. -- Steve
Sorry for the delay, I got laid low by snot monsters :/
On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote:
> Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
> buffer any longer. So I did the following instead on the serial console terminal:
>
> tail -f /sys/kernel/debug/tracing/trace
>
> Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> megabaud I am not fully sure whether it was able to keep up logging what you are looking for.
Ah, that is unfortunate. There is a ftrace_dump_on_oops option that
might be of help. And yes, dumping trace buffers over 1m5 serial lines
is tedious -- been there done that, got a t-shirt and all that.
Still, let me see if perhaps making that WARN in enqueue_dl_entity()
return makes the whole thing less fatal.
I've included the traceoff_on_warning and ftrace_dump in the code, so
all you really need to still do is enable the stacktrace option.
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
> Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!
Could I bother you to try again with the below patch?
There are two new hunks vs the previous one, the hunk in
enqueue_dl_entity() (the very last bit) will stop tracing and dump the
buffers when that condition is hit in addition to then aborting the
double enqueue, hopefully leaving the system is a slightly better state.
The other new hunk is the one for dl_server_stop() (second hunk), while
going over the code last week, I found that this might be a possible
hole leading to the observed double enqueue, so fingers crossed.
---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 33b4646f8b24..bd1df7612482 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
scoped_guard (rq_lock, rq) {
struct rq_flags *rf = &scope.rf;
+ if (dl_se == &rq->fair_server) {
+ trace_printk("timer fair server %d throttled %d\n",
+ cpu_of(rq), dl_se->dl_throttled);
+ }
+
if (!dl_se->dl_throttled || !dl_se->dl_runtime)
return HRTIMER_NORESTART;
@@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
void dl_server_stop(struct sched_dl_entity *dl_se)
{
+ if (current->dl_server == dl_se) {
+ struct rq *rq = rq_of_dl_se(dl_se);
+ trace_printk("stop fair server %d\n", cpu_of(rq));
+ current->dl_server = NULL;
+ }
+
if (!dl_se->dl_runtime)
return;
@@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
rq_lock(rq, &rf);
}
+ if (dl_se == &rq->fair_server)
+ trace_printk("inactive fair server %d\n", cpu_of(rq));
+
sched_clock_tick();
update_rq_clock(rq);
@@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_se(dl_se);
+
+ if (dl_se == &rq->fair_server) {
+ trace_printk("enqueue fair server %d h_nr_running %d\n",
+ cpu_of(rq), rq->cfs.h_nr_running);
+ }
WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
@@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_se(dl_se);
+
+ if (dl_se == &rq->fair_server) {
+ trace_printk("dequeue fair server %d h_nr_running %d\n",
+ cpu_of(rq), rq->cfs.h_nr_running);
+ }
if (RB_EMPTY_NODE(&dl_se->rb_node))
return;
@@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
static void
enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
{
- WARN_ON_ONCE(on_dl_rq(dl_se));
+ if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
+ tracing_off();
+ ftrace_dump(DUMP_ALL);
+ return;
+ }
update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
On Mon, 2024-12-09 at 10:49 +0100, Peter Zijlstra wrote:
>
> Sorry for the delay, I got laid low by snot monsters :/
>
> On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote:
>
> > Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the
> > trace
> > buffer any longer. So I did the following instead on the serial console terminal:
> >
> > tail -f /sys/kernel/debug/tracing/trace
> >
> > Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> > megabaud I am not fully sure whether it was able to keep up logging what you are looking for.
>
> Ah, that is unfortunate. There is a ftrace_dump_on_oops option that
> might be of help. And yes, dumping trace buffers over 1m5 serial lines
> is tedious -- been there done that, got a t-shirt and all that.
>
> Still, let me see if perhaps making that WARN in enqueue_dl_entity()
> return makes the whole thing less fatal.
>
> I've included the traceoff_on_warning and ftrace_dump in the code, so
> all you really need to still do is enable the stacktrace option.
>
> echo 1 > /sys/kernel/debug/tracing/options/stacktrace
>
> > Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!
>
> Could I bother you to try again with the below patch?
Sure, here you go.
https://drive.codethink.co.uk/s/HniZCtccDBMHpAK
> There are two new hunks vs the previous one, the hunk in
> enqueue_dl_entity() (the very last bit) will stop tracing and dump the
> buffers when that condition is hit in addition to then aborting the
> double enqueue, hopefully leaving the system is a slightly better state.
>
> The other new hunk is the one for dl_server_stop() (second hunk), while
> going over the code last week, I found that this might be a possible
> hole leading to the observed double enqueue, so fingers crossed.
>
> ---
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 33b4646f8b24..bd1df7612482 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> scoped_guard (rq_lock, rq) {
> struct rq_flags *rf = &scope.rf;
>
> + if (dl_se == &rq->fair_server) {
> + trace_printk("timer fair server %d throttled %d\n",
> + cpu_of(rq), dl_se->dl_throttled);
> + }
> +
> if (!dl_se->dl_throttled || !dl_se->dl_runtime)
> return HRTIMER_NORESTART;
>
> @@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>
> void dl_server_stop(struct sched_dl_entity *dl_se)
> {
> + if (current->dl_server == dl_se) {
> + struct rq *rq = rq_of_dl_se(dl_se);
> + trace_printk("stop fair server %d\n", cpu_of(rq));
> + current->dl_server = NULL;
> + }
> +
> if (!dl_se->dl_runtime)
> return;
>
> @@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> rq_lock(rq, &rf);
> }
>
> + if (dl_se == &rq->fair_server)
> + trace_printk("inactive fair server %d\n", cpu_of(rq));
> +
> sched_clock_tick();
> update_rq_clock(rq);
>
> @@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
> static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> {
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("enqueue fair server %d h_nr_running %d\n",
> + cpu_of(rq), rq->cfs.h_nr_running);
> + }
>
> WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
>
> @@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> {
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("dequeue fair server %d h_nr_running %d\n",
> + cpu_of(rq), rq->cfs.h_nr_running);
> + }
>
> if (RB_EMPTY_NODE(&dl_se->rb_node))
> return;
> @@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> static void
> enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
> {
> - WARN_ON_ONCE(on_dl_rq(dl_se));
> + if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
> + tracing_off();
> + ftrace_dump(DUMP_ALL);
> + return;
> + }
>
> update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
© 2016 - 2025 Red Hat, Inc.