mm/vmstat.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Even after mm/vmstat:online teardown, shepherd may still queue work for
the dying cpu until the cpu is removed from online mask. While it's
quite rare, this means that after unbind_workers() unbinds a per-cpu
kworker, it potentially runs vmstat_update for the dying CPU on an
irrelevant cpu before entering atomic AP states.
When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
backtrace.
BUG: using smp_processor_id() in preemptible [00000000] code: \
kworker/7:3/1702
caller is refresh_cpu_vm_stats+0x235/0x5f0
CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
Tainted: [N]=TEST
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xb0
check_preemption_disabled+0xce/0xe0
refresh_cpu_vm_stats+0x235/0x5f0
vmstat_update+0x17/0xa0
process_one_work+0x869/0x1aa0
worker_thread+0x5e5/0x1100
kthread+0x29e/0x380
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
symmetrically enable it on startup.
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
---
v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
---
mm/vmstat.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d016314a56c..0889b75cef14 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
if (!node_state(cpu_to_node(cpu), N_CPU)) {
node_set_state(cpu_to_node(cpu), N_CPU);
}
+ enable_delayed_work(&per_cpu(vmstat_work, cpu));
return 0;
}
static int vmstat_cpu_down_prep(unsigned int cpu)
{
- cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+ disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
return 0;
}
--
2.43.0
On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> Even after mm/vmstat:online teardown, shepherd may still queue work for
> the dying cpu until the cpu is removed from online mask. While it's
> quite rare, this means that after unbind_workers() unbinds a per-cpu
> kworker, it potentially runs vmstat_update for the dying CPU on an
> irrelevant cpu before entering atomic AP states.
> When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> backtrace.
>
> BUG: using smp_processor_id() in preemptible [00000000] code: \
> kworker/7:3/1702
> caller is refresh_cpu_vm_stats+0x235/0x5f0
> CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
> Tainted: [N]=TEST
> Workqueue: mm_percpu_wq vmstat_update
> Call Trace:
> <TASK>
> dump_stack_lvl+0x8d/0xb0
> check_preemption_disabled+0xce/0xe0
> refresh_cpu_vm_stats+0x235/0x5f0
> vmstat_update+0x17/0xa0
> process_one_work+0x869/0x1aa0
> worker_thread+0x5e5/0x1100
> kthread+0x29e/0x380
> ret_from_fork+0x2d/0x70
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> symmetrically enable it on startup.
>
> Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Hi,
I observed a warning in my qemu and real hardware, which I bisected to this commit:
[ 0.087733] ------------[ cut here ]------------
[ 0.087733] workqueue: work disable count underflowed
[ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
This is:
static void work_offqd_enable(struct work_offq_data *offqd)
{
if (likely(offqd->disable > 0))
offqd->disable--;
else
WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
}
So (based on this code) presumably an enable is only required if previously
disabled, and this code is being called on startup unconditionally without
the work having been disabled previously? I'm not hugely familiar with
delayed workqueue implementation details.
[ 0.087733] Modules linked in:
[ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
[ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 0.087733] RIP: 0010:enable_work+0xb5/0xc0
[ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
[ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
[ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
[ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
[ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
[ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
[ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
[ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
[ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
[ 0.087733] PKRU: 55555554
[ 0.087733] Call Trace:
[ 0.087733] <TASK>
[ 0.087733] ? enable_work+0xb5/0xc0
[ 0.087733] ? __warn.cold+0x93/0xf2
[ 0.087733] ? enable_work+0xb5/0xc0
[ 0.087733] ? report_bug+0xff/0x140
[ 0.087733] ? handle_bug+0x54/0x90
[ 0.087733] ? exc_invalid_op+0x17/0x70
[ 0.087733] ? asm_exc_invalid_op+0x1a/0x20
[ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10
[ 0.087733] ? enable_work+0xb5/0xc0
[ 0.087733] vmstat_cpu_online+0x5c/0x70
[ 0.087733] cpuhp_invoke_callback+0x133/0x440
[ 0.087733] cpuhp_thread_fun+0x95/0x150
[ 0.087733] smpboot_thread_fn+0xd5/0x1d0
[ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 0.087735] kthread+0xc8/0xf0
[ 0.087737] ? __pfx_kthread+0x10/0x10
[ 0.087738] ret_from_fork+0x2c/0x50
[ 0.087739] ? __pfx_kthread+0x10/0x10
[ 0.087740] ret_from_fork_asm+0x1a/0x30
[ 0.087742] </TASK>
[ 0.087742] ---[ end trace 0000000000000000 ]---
> ---
> v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
> ---
> mm/vmstat.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4d016314a56c..0889b75cef14 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
> if (!node_state(cpu_to_node(cpu), N_CPU)) {
> node_set_state(cpu_to_node(cpu), N_CPU);
> }
> + enable_delayed_work(&per_cpu(vmstat_work, cpu));
Probably needs to be 'if disabled' here, as this is invoked on normal
startup when the work won't have been disabled?
Had a brief look at code and couldn't see how that could be done
however... and one would need to be careful about races... Tricky!
>
> return 0;
> }
>
> static int vmstat_cpu_down_prep(unsigned int cpu)
> {
> - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> return 0;
> }
>
> --
> 2.43.0
>
>
Let me know if you need any more details, .config etc.
I noticed this warning on a real box too (in both cases running akpm's
mm-unstable branch), FWIW.
On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
> On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> > Even after mm/vmstat:online teardown, shepherd may still queue work for
> > the dying cpu until the cpu is removed from online mask. While it's
> > quite rare, this means that after unbind_workers() unbinds a per-cpu
> > kworker, it potentially runs vmstat_update for the dying CPU on an
> > irrelevant cpu before entering atomic AP states.
> > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> > backtrace.
> >
> > BUG: using smp_processor_id() in preemptible [00000000] code: \
> > kworker/7:3/1702
> > caller is refresh_cpu_vm_stats+0x235/0x5f0
> > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
> > Tainted: [N]=TEST
> > Workqueue: mm_percpu_wq vmstat_update
> > Call Trace:
> > <TASK>
> > dump_stack_lvl+0x8d/0xb0
> > check_preemption_disabled+0xce/0xe0
> > refresh_cpu_vm_stats+0x235/0x5f0
> > vmstat_update+0x17/0xa0
> > process_one_work+0x869/0x1aa0
> > worker_thread+0x5e5/0x1100
> > kthread+0x29e/0x380
> > ret_from_fork+0x2d/0x70
> > ret_from_fork_asm+0x1a/0x30
> > </TASK>
> >
> > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> > symmetrically enable it on startup.
> >
> > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
>
> Hi,
>
> I observed a warning in my qemu and real hardware, which I bisected to this commit:
>
> [ 0.087733] ------------[ cut here ]------------
> [ 0.087733] workqueue: work disable count underflowed
> [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
>
> This is:
>
> static void work_offqd_enable(struct work_offq_data *offqd)
> {
> if (likely(offqd->disable > 0))
> offqd->disable--;
> else
> WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
> }
>
> So (based on this code) presumably an enable is only required if previously
> disabled, and this code is being called on startup unconditionally without
> the work having been disabled previously? I'm not hugely familiar with
> delayed workqueue implementation details.
>
> [ 0.087733] Modules linked in:
> [ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
> [ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [ 0.087733] RIP: 0010:enable_work+0xb5/0xc0
> [ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
> [ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
> [ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
> [ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
> [ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
> [ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
> [ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
> [ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
> [ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
> [ 0.087733] PKRU: 55555554
> [ 0.087733] Call Trace:
> [ 0.087733] <TASK>
> [ 0.087733] ? enable_work+0xb5/0xc0
> [ 0.087733] ? __warn.cold+0x93/0xf2
> [ 0.087733] ? enable_work+0xb5/0xc0
> [ 0.087733] ? report_bug+0xff/0x140
> [ 0.087733] ? handle_bug+0x54/0x90
> [ 0.087733] ? exc_invalid_op+0x17/0x70
> [ 0.087733] ? asm_exc_invalid_op+0x1a/0x20
> [ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10
> [ 0.087733] ? enable_work+0xb5/0xc0
> [ 0.087733] vmstat_cpu_online+0x5c/0x70
> [ 0.087733] cpuhp_invoke_callback+0x133/0x440
> [ 0.087733] cpuhp_thread_fun+0x95/0x150
> [ 0.087733] smpboot_thread_fn+0xd5/0x1d0
> [ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10
> [ 0.087735] kthread+0xc8/0xf0
> [ 0.087737] ? __pfx_kthread+0x10/0x10
> [ 0.087738] ret_from_fork+0x2c/0x50
> [ 0.087739] ? __pfx_kthread+0x10/0x10
> [ 0.087740] ret_from_fork_asm+0x1a/0x30
> [ 0.087742] </TASK>
> [ 0.087742] ---[ end trace 0000000000000000 ]---
>
>
> > ---
> > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
> > ---
> > mm/vmstat.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4d016314a56c..0889b75cef14 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
> > if (!node_state(cpu_to_node(cpu), N_CPU)) {
> > node_set_state(cpu_to_node(cpu), N_CPU);
> > }
> > + enable_delayed_work(&per_cpu(vmstat_work, cpu));
>
> Probably needs to be 'if disabled' here, as this is invoked on normal
> startup when the work won't have been disabled?
>
> Had a brief look at code and couldn't see how that could be done
> however... and one would need to be careful about races... Tricky!
>
> >
> > return 0;
> > }
> >
> > static int vmstat_cpu_down_prep(unsigned int cpu)
> > {
> > - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > return 0;
> > }
> >
> > --
> > 2.43.0
> >
> >
>
> Let me know if you need any more details, .config etc.
>
> I noticed this warning on a real box too (in both cases running akpm's
> mm-unstable branch), FWIW.
Thank you for the report. I was able to reproduce the warning and now
wonder how I missed it.. My oversight, apologies.
In my current view, the simplest solution would be to make sure a local
vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
during boot-up. The following patch suppresses the warning:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0889b75cef14..19ceed5d34bf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
{
int cpu;
- for_each_possible_cpu(cpu)
+ for_each_possible_cpu(cpu) {
INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
vmstat_update);
+ /* will be enabled on vmstat_cpu_online */
+ disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+ }
+
schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
}
If you think of a better solution later, please let me know. Otherwise,
I'll submit a follow-up fix patch with the above diff.
Thanks.
-Koichiro
+cc tglx, peterz for insight on CPU hot plug
On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote:
> On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
> > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> > > Even after mm/vmstat:online teardown, shepherd may still queue work for
> > > the dying cpu until the cpu is removed from online mask. While it's
> > > quite rare, this means that after unbind_workers() unbinds a per-cpu
> > > kworker, it potentially runs vmstat_update for the dying CPU on an
> > > irrelevant cpu before entering atomic AP states.
> > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> > > backtrace.
> > >
> > > BUG: using smp_processor_id() in preemptible [00000000] code: \
> > > kworker/7:3/1702
> > > caller is refresh_cpu_vm_stats+0x235/0x5f0
> > > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
> > > Tainted: [N]=TEST
> > > Workqueue: mm_percpu_wq vmstat_update
> > > Call Trace:
> > > <TASK>
> > > dump_stack_lvl+0x8d/0xb0
> > > check_preemption_disabled+0xce/0xe0
> > > refresh_cpu_vm_stats+0x235/0x5f0
> > > vmstat_update+0x17/0xa0
> > > process_one_work+0x869/0x1aa0
> > > worker_thread+0x5e5/0x1100
> > > kthread+0x29e/0x380
> > > ret_from_fork+0x2d/0x70
> > > ret_from_fork_asm+0x1a/0x30
> > > </TASK>
> > >
> > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> > > symmetrically enable it on startup.
> > >
> > > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
> >
> > Hi,
> >
> > I observed a warning in my qemu and real hardware, which I bisected to this commit:
> >
> > [ 0.087733] ------------[ cut here ]------------
> > [ 0.087733] workqueue: work disable count underflowed
> > [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
> >
> > This is:
> >
> > static void work_offqd_enable(struct work_offq_data *offqd)
> > {
> > if (likely(offqd->disable > 0))
> > offqd->disable--;
> > else
> > WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
> > }
> >
> > So (based on this code) presumably an enable is only required if previously
> > disabled, and this code is being called on startup unconditionally without
> > the work having been disabled previously? I'm not hugely familiar with
> > delayed workqueue implementation details.
> >
> > [ 0.087733] Modules linked in:
> > [ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
> > [ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > [ 0.087733] RIP: 0010:enable_work+0xb5/0xc0
> > [ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
> > [ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
> > [ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
> > [ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
> > [ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
> > [ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
> > [ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
> > [ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
> > [ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
> > [ 0.087733] PKRU: 55555554
> > [ 0.087733] Call Trace:
> > [ 0.087733] <TASK>
> > [ 0.087733] ? enable_work+0xb5/0xc0
> > [ 0.087733] ? __warn.cold+0x93/0xf2
> > [ 0.087733] ? enable_work+0xb5/0xc0
> > [ 0.087733] ? report_bug+0xff/0x140
> > [ 0.087733] ? handle_bug+0x54/0x90
> > [ 0.087733] ? exc_invalid_op+0x17/0x70
> > [ 0.087733] ? asm_exc_invalid_op+0x1a/0x20
> > [ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10
> > [ 0.087733] ? enable_work+0xb5/0xc0
> > [ 0.087733] vmstat_cpu_online+0x5c/0x70
> > [ 0.087733] cpuhp_invoke_callback+0x133/0x440
> > [ 0.087733] cpuhp_thread_fun+0x95/0x150
> > [ 0.087733] smpboot_thread_fn+0xd5/0x1d0
> > [ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10
> > [ 0.087735] kthread+0xc8/0xf0
> > [ 0.087737] ? __pfx_kthread+0x10/0x10
> > [ 0.087738] ret_from_fork+0x2c/0x50
> > [ 0.087739] ? __pfx_kthread+0x10/0x10
> > [ 0.087740] ret_from_fork_asm+0x1a/0x30
> > [ 0.087742] </TASK>
> > [ 0.087742] ---[ end trace 0000000000000000 ]---
> >
> >
> > > ---
> > > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
> > > ---
> > > mm/vmstat.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 4d016314a56c..0889b75cef14 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
> > > if (!node_state(cpu_to_node(cpu), N_CPU)) {
> > > node_set_state(cpu_to_node(cpu), N_CPU);
> > > }
> > > + enable_delayed_work(&per_cpu(vmstat_work, cpu));
> >
> > Probably needs to be 'if disabled' here, as this is invoked on normal
> > startup when the work won't have been disabled?
> >
> > Had a brief look at code and couldn't see how that could be done
> > however... and one would need to be careful about races... Tricky!
> >
> > >
> > > return 0;
> > > }
> > >
> > > static int vmstat_cpu_down_prep(unsigned int cpu)
> > > {
> > > - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > return 0;
> > > }
> > >
> > > --
> > > 2.43.0
> > >
> > >
> >
> > Let me know if you need any more details, .config etc.
> >
> > I noticed this warning on a real box too (in both cases running akpm's
> > mm-unstable branch), FWIW.
>
> Thank you for the report. I was able to reproduce the warning and now
> wonder how I missed it.. My oversight, apologies.
>
> In my current view, the simplest solution would be to make sure a local
> vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
> during boot-up. The following patch suppresses the warning:
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0889b75cef14..19ceed5d34bf 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
> {
> int cpu;
>
> - for_each_possible_cpu(cpu)
> + for_each_possible_cpu(cpu) {
> INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> vmstat_update);
>
> + /* will be enabled on vmstat_cpu_online */
> + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> + }
> +
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> }
>
> If you think of a better solution later, please let me know. Otherwise,
> I'll submit a follow-up fix patch with the above diff.
Thanks, this resolves the problem, but are we sure that _all_ CPUs will
definitely call vmstat_cpu_online()?
I did a bit of printk output and it seems like this _didn't_ online CPU 0,
presumably the boot CPU which calls this function in the first instance?
I also see that init_mm_internals() invokes cpuhp_setup_state_nocalls()
explicitly which does _not_ call the callback, though even if it did this
would be too early as it calls start_shepherd_timer() _after_ this anyway.
So yeah, unless I'm missing something, I think this patch is broken.
I have added Thomas and Peter to give some insight on the CPU hotplug side.
It feels like the patch really needs an 'enable if not already enabled'
call in vmstat_cpu_online().
>
> Thanks.
>
> -Koichiro
On Mon, Jan 06, 2025 at 10:52:37AM +0000, Lorenzo Stoakes wrote:
> +cc tglx, peterz for insight on CPU hot plug
>
> On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote:
> > On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
> > > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> > > > Even after mm/vmstat:online teardown, shepherd may still queue work for
> > > > the dying cpu until the cpu is removed from online mask. While it's
> > > > quite rare, this means that after unbind_workers() unbinds a per-cpu
> > > > kworker, it potentially runs vmstat_update for the dying CPU on an
> > > > irrelevant cpu before entering atomic AP states.
> > > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> > > > backtrace.
> > > >
> > > > BUG: using smp_processor_id() in preemptible [00000000] code: \
> > > > kworker/7:3/1702
> > > > caller is refresh_cpu_vm_stats+0x235/0x5f0
> > > > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
> > > > Tainted: [N]=TEST
> > > > Workqueue: mm_percpu_wq vmstat_update
> > > > Call Trace:
> > > > <TASK>
> > > > dump_stack_lvl+0x8d/0xb0
> > > > check_preemption_disabled+0xce/0xe0
> > > > refresh_cpu_vm_stats+0x235/0x5f0
> > > > vmstat_update+0x17/0xa0
> > > > process_one_work+0x869/0x1aa0
> > > > worker_thread+0x5e5/0x1100
> > > > kthread+0x29e/0x380
> > > > ret_from_fork+0x2d/0x70
> > > > ret_from_fork_asm+0x1a/0x30
> > > > </TASK>
> > > >
> > > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> > > > symmetrically enable it on startup.
> > > >
> > > > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
> > >
> > > Hi,
> > >
> > > I observed a warning in my qemu and real hardware, which I bisected to this commit:
> > >
> > > [ 0.087733] ------------[ cut here ]------------
> > > [ 0.087733] workqueue: work disable count underflowed
> > > [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
> > >
> > > This is:
> > >
> > > static void work_offqd_enable(struct work_offq_data *offqd)
> > > {
> > > if (likely(offqd->disable > 0))
> > > offqd->disable--;
> > > else
> > > WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
> > > }
> > >
> > > So (based on this code) presumably an enable is only required if previously
> > > disabled, and this code is being called on startup unconditionally without
> > > the work having been disabled previously? I'm not hugely familiar with
> > > delayed workqueue implementation details.
> > >
> > > [ 0.087733] Modules linked in:
> > > [ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
> > > [ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > > [ 0.087733] RIP: 0010:enable_work+0xb5/0xc0
> > > [ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
> > > [ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
> > > [ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
> > > [ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
> > > [ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
> > > [ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
> > > [ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
> > > [ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
> > > [ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
> > > [ 0.087733] PKRU: 55555554
> > > [ 0.087733] Call Trace:
> > > [ 0.087733] <TASK>
> > > [ 0.087733] ? enable_work+0xb5/0xc0
> > > [ 0.087733] ? __warn.cold+0x93/0xf2
> > > [ 0.087733] ? enable_work+0xb5/0xc0
> > > [ 0.087733] ? report_bug+0xff/0x140
> > > [ 0.087733] ? handle_bug+0x54/0x90
> > > [ 0.087733] ? exc_invalid_op+0x17/0x70
> > > [ 0.087733] ? asm_exc_invalid_op+0x1a/0x20
> > > [ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10
> > > [ 0.087733] ? enable_work+0xb5/0xc0
> > > [ 0.087733] vmstat_cpu_online+0x5c/0x70
> > > [ 0.087733] cpuhp_invoke_callback+0x133/0x440
> > > [ 0.087733] cpuhp_thread_fun+0x95/0x150
> > > [ 0.087733] smpboot_thread_fn+0xd5/0x1d0
> > > [ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10
> > > [ 0.087735] kthread+0xc8/0xf0
> > > [ 0.087737] ? __pfx_kthread+0x10/0x10
> > > [ 0.087738] ret_from_fork+0x2c/0x50
> > > [ 0.087739] ? __pfx_kthread+0x10/0x10
> > > [ 0.087740] ret_from_fork_asm+0x1a/0x30
> > > [ 0.087742] </TASK>
> > > [ 0.087742] ---[ end trace 0000000000000000 ]---
> > >
> > >
> > > > ---
> > > > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
> > > > ---
> > > > mm/vmstat.c | 3 ++-
> > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > > index 4d016314a56c..0889b75cef14 100644
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
> > > > if (!node_state(cpu_to_node(cpu), N_CPU)) {
> > > > node_set_state(cpu_to_node(cpu), N_CPU);
> > > > }
> > > > + enable_delayed_work(&per_cpu(vmstat_work, cpu));
> > >
> > > Probably needs to be 'if disabled' here, as this is invoked on normal
> > > startup when the work won't have been disabled?
> > >
> > > Had a brief look at code and couldn't see how that could be done
> > > however... and one would need to be careful about races... Tricky!
> > >
> > > >
> > > > return 0;
> > > > }
> > > >
> > > > static int vmstat_cpu_down_prep(unsigned int cpu)
> > > > {
> > > > - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > > return 0;
> > > > }
> > > >
> > > > --
> > > > 2.43.0
> > > >
> > > >
> > >
> > > Let me know if you need any more details, .config etc.
> > >
> > > I noticed this warning on a real box too (in both cases running akpm's
> > > mm-unstable branch), FWIW.
> >
> > Thank you for the report. I was able to reproduce the warning and now
> > wonder how I missed it.. My oversight, apologies.
> >
> > In my current view, the simplest solution would be to make sure a local
> > vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
> > during boot-up. The following patch suppresses the warning:
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 0889b75cef14..19ceed5d34bf 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
> > {
> > int cpu;
> >
> > - for_each_possible_cpu(cpu)
> > + for_each_possible_cpu(cpu) {
> > INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> > vmstat_update);
> >
> > + /* will be enabled on vmstat_cpu_online */
> > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > + }
> > +
> > schedule_delayed_work(&shepherd,
> > round_jiffies_relative(sysctl_stat_interval));
> > }
> >
> > If you think of a better solution later, please let me know. Otherwise,
> > I'll submit a follow-up fix patch with the above diff.
>
> Thanks, this resolves the problem, but are we sure that _all_ CPUs will
> definitely call vmstat_cpu_online()?
>
> I did a bit of printk output and it seems like this _didn't_ online CPU 0,
> presumably the boot CPU which calls this function in the first instance?
>
> I also see that init_mm_internals() invokes cpuhp_setup_state_nocalls()
> explicitly which does _not_ call the callback, though even if it did this
> would be too early as it calls start_shepherd_timer() _after_ this anyway.
>
> So yeah, unless I'm missing something, I think this patch is broken.
You're absolutely right, thanks a lot.
I also appreciate you testing it, thanks everyone.
>
> I have added Thomas and Peter to give some insight on the CPU hotplug side.
>
> It feels like the patch really needs an 'enable if not already enabled'
> call in vmstat_cpu_online().
Right. While not fully polished yet, I've tested the following diff:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0889b75cef14..f967aa22392f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1937,7 +1937,11 @@ static const struct seq_operations vmstat_op = {
#endif /* CONFIG_PROC_FS */
#ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
+struct vmstat_work {
+ struct delayed_work dwork;
+ bool enabled;
+};
+static DEFINE_PER_CPU(struct vmstat_work, vmstat_work);
int sysctl_stat_interval __read_mostly = HZ;
static int vmstat_late_init_done;
@@ -2015,7 +2019,7 @@ static void vmstat_update(struct work_struct *w)
* update worker thread.
*/
queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
- this_cpu_ptr(&vmstat_work),
+ &this_cpu_ptr(&vmstat_work)->dwork,
round_jiffies_relative(sysctl_stat_interval));
}
}
@@ -2059,7 +2063,7 @@ void quiet_vmstat(void)
if (system_state != SYSTEM_RUNNING)
return;
- if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
+ if (!delayed_work_pending(&this_cpu_ptr(&vmstat_work)->dwork))
return;
if (!need_update(smp_processor_id()))
@@ -2091,7 +2095,7 @@ static void vmstat_shepherd(struct work_struct *w)
cpus_read_lock();
/* Check processors whose vmstat worker threads have been disabled */
for_each_online_cpu(cpu) {
- struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
+ struct delayed_work *dw = &per_cpu(vmstat_work, cpu).dwork;
/*
* In kernel users of vmstat counters either require the precise value and
@@ -2120,11 +2124,14 @@ static void vmstat_shepherd(struct work_struct *w)
static void __init start_shepherd_timer(void)
{
+ struct vmstat_work *vmstat;
int cpu;
- for_each_possible_cpu(cpu)
- INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
- vmstat_update);
+ for_each_possible_cpu(cpu) {
+ vmstat = per_cpu_ptr(&vmstat_work, cpu);
+ INIT_DEFERRABLE_WORK(&vmstat->dwork, vmstat_update);
+ vmstat->enabled = true;
+ }
schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
@@ -2142,20 +2149,28 @@ static void __init init_cpu_node_state(void)
static int vmstat_cpu_online(unsigned int cpu)
{
+ struct vmstat_work *vmstat = per_cpu_ptr(&vmstat_work, cpu);
+
if (vmstat_late_init_done)
refresh_zone_stat_thresholds();
if (!node_state(cpu_to_node(cpu), N_CPU)) {
node_set_state(cpu_to_node(cpu), N_CPU);
}
- enable_delayed_work(&per_cpu(vmstat_work, cpu));
+
+ if (!vmstat->enabled) {
+ enable_delayed_work(&vmstat->dwork);
+ vmstat->enabled = true;
+ }
return 0;
}
static int vmstat_cpu_down_prep(unsigned int cpu)
{
- disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+ struct vmstat_work *vmstat = per_cpu_ptr(&vmstat_work, cpu);
+
+ disable_delayed_work_sync(&vmstat->dwork);
return 0;
}
Lorenzo, and Thomas and Peter if you're available, I'd greatly appreciate
any thoughts or feedback on this.
Thanks.
-Koichiro Den
>
> >
> > Thanks.
> >
> > -Koichiro
On Mon, Jan 06, 2025 at 10:03:57PM +0900, Koichiro Den wrote:
> On Mon, Jan 06, 2025 at 10:52:37AM +0000, Lorenzo Stoakes wrote:
> > +cc tglx, peterz for insight on CPU hot plug
> >
> > On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote:
> > > On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
> > > > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> > > > > Even after mm/vmstat:online teardown, shepherd may still queue work for
> > > > > the dying cpu until the cpu is removed from online mask. While it's
> > > > > quite rare, this means that after unbind_workers() unbinds a per-cpu
> > > > > kworker, it potentially runs vmstat_update for the dying CPU on an
> > > > > irrelevant cpu before entering atomic AP states.
> > > > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> > > > > backtrace.
> > > > >
> > > > > BUG: using smp_processor_id() in preemptible [00000000] code: \
> > > > > kworker/7:3/1702
> > > > > caller is refresh_cpu_vm_stats+0x235/0x5f0
> > > > > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
> > > > > Tainted: [N]=TEST
> > > > > Workqueue: mm_percpu_wq vmstat_update
> > > > > Call Trace:
> > > > > <TASK>
> > > > > dump_stack_lvl+0x8d/0xb0
> > > > > check_preemption_disabled+0xce/0xe0
> > > > > refresh_cpu_vm_stats+0x235/0x5f0
> > > > > vmstat_update+0x17/0xa0
> > > > > process_one_work+0x869/0x1aa0
> > > > > worker_thread+0x5e5/0x1100
> > > > > kthread+0x29e/0x380
> > > > > ret_from_fork+0x2d/0x70
> > > > > ret_from_fork_asm+0x1a/0x30
> > > > > </TASK>
> > > > >
> > > > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> > > > > symmetrically enable it on startup.
> > > > >
> > > > > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
> > > >
> > > > Hi,
> > > >
> > > > I observed a warning in my qemu and real hardware, which I bisected to this commit:
> > > >
> > > > [ 0.087733] ------------[ cut here ]------------
> > > > [ 0.087733] workqueue: work disable count underflowed
> > > > [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
> > > >
> > > > This is:
> > > >
> > > > static void work_offqd_enable(struct work_offq_data *offqd)
> > > > {
> > > > if (likely(offqd->disable > 0))
> > > > offqd->disable--;
> > > > else
> > > > WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
> > > > }
> > > >
> > > > So (based on this code) presumably an enable is only required if previously
> > > > disabled, and this code is being called on startup unconditionally without
> > > > the work having been disabled previously? I'm not hugely familiar with
> > > > delayed workqueue implementation details.
> > > >
> > > > [ 0.087733] Modules linked in:
> > > > [ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
> > > > [ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > > > [ 0.087733] RIP: 0010:enable_work+0xb5/0xc0
> > > > [ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
> > > > [ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
> > > > [ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
> > > > [ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
> > > > [ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
> > > > [ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
> > > > [ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
> > > > [ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
> > > > [ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
> > > > [ 0.087733] PKRU: 55555554
> > > > [ 0.087733] Call Trace:
> > > > [ 0.087733] <TASK>
> > > > [ 0.087733] ? enable_work+0xb5/0xc0
> > > > [ 0.087733] ? __warn.cold+0x93/0xf2
> > > > [ 0.087733] ? enable_work+0xb5/0xc0
> > > > [ 0.087733] ? report_bug+0xff/0x140
> > > > [ 0.087733] ? handle_bug+0x54/0x90
> > > > [ 0.087733] ? exc_invalid_op+0x17/0x70
> > > > [ 0.087733] ? asm_exc_invalid_op+0x1a/0x20
> > > > [ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10
> > > > [ 0.087733] ? enable_work+0xb5/0xc0
> > > > [ 0.087733] vmstat_cpu_online+0x5c/0x70
> > > > [ 0.087733] cpuhp_invoke_callback+0x133/0x440
> > > > [ 0.087733] cpuhp_thread_fun+0x95/0x150
> > > > [ 0.087733] smpboot_thread_fn+0xd5/0x1d0
> > > > [ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10
> > > > [ 0.087735] kthread+0xc8/0xf0
> > > > [ 0.087737] ? __pfx_kthread+0x10/0x10
> > > > [ 0.087738] ret_from_fork+0x2c/0x50
> > > > [ 0.087739] ? __pfx_kthread+0x10/0x10
> > > > [ 0.087740] ret_from_fork_asm+0x1a/0x30
> > > > [ 0.087742] </TASK>
> > > > [ 0.087742] ---[ end trace 0000000000000000 ]---
> > > >
> > > >
> > > > > ---
> > > > > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
> > > > > ---
> > > > > mm/vmstat.c | 3 ++-
> > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > > > index 4d016314a56c..0889b75cef14 100644
> > > > > --- a/mm/vmstat.c
> > > > > +++ b/mm/vmstat.c
> > > > > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
> > > > > if (!node_state(cpu_to_node(cpu), N_CPU)) {
> > > > > node_set_state(cpu_to_node(cpu), N_CPU);
> > > > > }
> > > > > + enable_delayed_work(&per_cpu(vmstat_work, cpu));
> > > >
> > > > Probably needs to be 'if disabled' here, as this is invoked on normal
> > > > startup when the work won't have been disabled?
> > > >
> > > > Had a brief look at code and couldn't see how that could be done
> > > > however... and one would need to be careful about races... Tricky!
> > > >
> > > > >
> > > > > return 0;
> > > > > }
> > > > >
> > > > > static int vmstat_cpu_down_prep(unsigned int cpu)
> > > > > {
> > > > > - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > > > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > > > return 0;
> > > > > }
> > > > >
> > > > > --
> > > > > 2.43.0
> > > > >
> > > > >
> > > >
> > > > Let me know if you need any more details, .config etc.
> > > >
> > > > I noticed this warning on a real box too (in both cases running akpm's
> > > > mm-unstable branch), FWIW.
> > >
> > > Thank you for the report. I was able to reproduce the warning and now
> > > wonder how I missed it.. My oversight, apologies.
> > >
> > > In my current view, the simplest solution would be to make sure a local
> > > vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
> > > during boot-up. The following patch suppresses the warning:
> > >
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 0889b75cef14..19ceed5d34bf 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
> > > {
> > > int cpu;
> > >
> > > - for_each_possible_cpu(cpu)
> > > + for_each_possible_cpu(cpu) {
> > > INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> > > vmstat_update);
> > >
> > > + /* will be enabled on vmstat_cpu_online */
> > > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > + }
> > > +
> > > schedule_delayed_work(&shepherd,
> > > round_jiffies_relative(sysctl_stat_interval));
> > > }
> > >
> > > If you think of a better solution later, please let me know. Otherwise,
> > > I'll submit a follow-up fix patch with the above diff.
> >
> > Thanks, this resolves the problem, but are we sure that _all_ CPUs will
> > definitely call vmstat_cpu_online()?
> >
> > I did a bit of printk output and it seems like this _didn't_ online CPU 0,
> > presumably the boot CPU which calls this function in the first instance?
> >
> > I also see that init_mm_internals() invokes cpuhp_setup_state_nocalls()
> > explicitly which does _not_ call the callback, though even if it did this
> > would be too early as it calls start_shepherd_timer() _after_ this anyway.
> >
> > So yeah, unless I'm missing something, I think this patch is broken.
>
> You're absolutely right, thanks a lot.
> I also appreciate you testing it, thanks everyone.
>
> >
> > I have added Thomas and Peter to give some insight on the CPU hotplug side.
> >
> > It feels like the patch really needs an 'enable if not already enabled'
> > call in vmstat_cpu_online().
>
> Right. While not fully polished yet, I've tested the following diff:
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0889b75cef14..f967aa22392f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1937,7 +1937,11 @@ static const struct seq_operations vmstat_op = {
> #endif /* CONFIG_PROC_FS */
>
> #ifdef CONFIG_SMP
> -static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
> +struct vmstat_work {
> + struct delayed_work dwork;
> + bool enabled;
> +};
> +static DEFINE_PER_CPU(struct vmstat_work, vmstat_work);
> int sysctl_stat_interval __read_mostly = HZ;
> static int vmstat_late_init_done;
>
> @@ -2015,7 +2019,7 @@ static void vmstat_update(struct work_struct *w)
> * update worker thread.
> */
> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> - this_cpu_ptr(&vmstat_work),
> + &this_cpu_ptr(&vmstat_work)->dwork,
> round_jiffies_relative(sysctl_stat_interval));
> }
> }
> @@ -2059,7 +2063,7 @@ void quiet_vmstat(void)
> if (system_state != SYSTEM_RUNNING)
> return;
>
> - if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
> + if (!delayed_work_pending(&this_cpu_ptr(&vmstat_work)->dwork))
> return;
>
> if (!need_update(smp_processor_id()))
> @@ -2091,7 +2095,7 @@ static void vmstat_shepherd(struct work_struct *w)
> cpus_read_lock();
> /* Check processors whose vmstat worker threads have been disabled */
> for_each_online_cpu(cpu) {
> - struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> + struct delayed_work *dw = &per_cpu(vmstat_work, cpu).dwork;
>
> /*
> * In kernel users of vmstat counters either require the precise value and
> @@ -2120,11 +2124,14 @@ static void vmstat_shepherd(struct work_struct *w)
>
> static void __init start_shepherd_timer(void)
> {
> + struct vmstat_work *vmstat;
> int cpu;
>
> - for_each_possible_cpu(cpu)
> - INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> - vmstat_update);
> + for_each_possible_cpu(cpu) {
> + vmstat = per_cpu_ptr(&vmstat_work, cpu);
> + INIT_DEFERRABLE_WORK(&vmstat->dwork, vmstat_update);
> + vmstat->enabled = true;
> + }
>
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> @@ -2142,20 +2149,28 @@ static void __init init_cpu_node_state(void)
>
> static int vmstat_cpu_online(unsigned int cpu)
> {
> + struct vmstat_work *vmstat = per_cpu_ptr(&vmstat_work, cpu);
> +
> if (vmstat_late_init_done)
> refresh_zone_stat_thresholds();
>
> if (!node_state(cpu_to_node(cpu), N_CPU)) {
> node_set_state(cpu_to_node(cpu), N_CPU);
> }
> - enable_delayed_work(&per_cpu(vmstat_work, cpu));
> +
> + if (!vmstat->enabled) {
> + enable_delayed_work(&vmstat->dwork);
> + vmstat->enabled = true;
> + }
>
> return 0;
> }
>
> static int vmstat_cpu_down_prep(unsigned int cpu)
> {
> - disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> + struct vmstat_work *vmstat = per_cpu_ptr(&vmstat_work, cpu);
> +
> + disable_delayed_work_sync(&vmstat->dwork);
The line `vmstat->enabled = false;` was missing here.
With that fixed, I've conducted the following tests:
* Boot-up check: confirmed that vmstat_update() runs on all cpus after boot-up.
* Cpuhp transitions: verified resilience during random CPU hotplug
transitions using the steps below:
1. enabled dynamic ftrace on vmstat_update()
2. ran the following simple silly script for a while
#!/bin/bash
while :; do
now=$(cat /sys/devices/system/cpu/cpu7/hotplug/state)
next=$(shuf --random-source=/dev/urandom -i 0-238 -n 1)
echo "${now} -> ${next}"
echo $next > /sys/devices/system/cpu/cpu7/hotplug/target 2>/dev/null
sleep 0.001
done
3. after stopping the script, ensured that the cpu was fully online again
by writing the largest value to the 'state' file.
4. verified that vmstat_update() was working on the cpu.
Let me know if you have any suggestions or concerns.
Thanks.
-Koichiro Den
> return 0;
> }
>
> Lorenzo, and Thomas and Peter if you're available, I'd greatly appreciate
> any thoughts or feedback on this.
>
> Thanks.
>
> -Koichiro Den
>
>
> >
> > >
> > > Thanks.
> > >
> > > -Koichiro
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
> +cc tglx, peterz for insight on CPU hot plug
>
> On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote:
>> On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
>> > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
>> > > Even after mm/vmstat:online teardown, shepherd may still queue work for
>> > > the dying cpu until the cpu is removed from online mask. While it's
>> > > quite rare, this means that after unbind_workers() unbinds a per-cpu
>> > > kworker, it potentially runs vmstat_update for the dying CPU on an
>> > > irrelevant cpu before entering atomic AP states.
>> > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
>> > > backtrace.
>> > >
>> > > BUG: using smp_processor_id() in preemptible [00000000] code: \
>> > > kworker/7:3/1702
>> > > caller is refresh_cpu_vm_stats+0x235/0x5f0
>> > > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
>> > > Tainted: [N]=TEST
>> > > Workqueue: mm_percpu_wq vmstat_update
>> > > Call Trace:
>> > > <TASK>
>> > > dump_stack_lvl+0x8d/0xb0
>> > > check_preemption_disabled+0xce/0xe0
>> > > refresh_cpu_vm_stats+0x235/0x5f0
>> > > vmstat_update+0x17/0xa0
>> > > process_one_work+0x869/0x1aa0
>> > > worker_thread+0x5e5/0x1100
>> > > kthread+0x29e/0x380
>> > > ret_from_fork+0x2d/0x70
>> > > ret_from_fork_asm+0x1a/0x30
>> > > </TASK>
>> > >
>> > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
>> > > symmetrically enable it on startup.
>> > >
>> > > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
>> >
>> > Hi,
>> >
>> > I observed a warning in my qemu and real hardware, which I bisected to this commit:
>> >
>> > [ 0.087733] ------------[ cut here ]------------
>> > [ 0.087733] workqueue: work disable count underflowed
>> > [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
>> >
>> > This is:
>> >
>> > static void work_offqd_enable(struct work_offq_data *offqd)
>> > {
>> > if (likely(offqd->disable > 0))
>> > offqd->disable--;
>> > else
>> > WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
>> > }
>> >
>> > So (based on this code) presumably an enable is only required if previously
>> > disabled, and this code is being called on startup unconditionally without
>> > the work having been disabled previously? I'm not hugely familiar with
>> > delayed workqueue implementation details.
>> >
>> > [ 0.087733] Modules linked in:
>> > [ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
>> > [ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
>> > [ 0.087733] RIP: 0010:enable_work+0xb5/0xc0
>> > [ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
>> > [ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
>> > [ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
>> > [ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
>> > [ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
>> > [ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
>> > [ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
>> > [ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
>> > [ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
>> > [ 0.087733] PKRU: 55555554
>> > [ 0.087733] Call Trace:
>> > [ 0.087733] <TASK>
>> > [ 0.087733] ? enable_work+0xb5/0xc0
>> > [ 0.087733] ? __warn.cold+0x93/0xf2
>> > [ 0.087733] ? enable_work+0xb5/0xc0
>> > [ 0.087733] ? report_bug+0xff/0x140
>> > [ 0.087733] ? handle_bug+0x54/0x90
>> > [ 0.087733] ? exc_invalid_op+0x17/0x70
>> > [ 0.087733] ? asm_exc_invalid_op+0x1a/0x20
>> > [ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10
>> > [ 0.087733] ? enable_work+0xb5/0xc0
>> > [ 0.087733] vmstat_cpu_online+0x5c/0x70
>> > [ 0.087733] cpuhp_invoke_callback+0x133/0x440
>> > [ 0.087733] cpuhp_thread_fun+0x95/0x150
>> > [ 0.087733] smpboot_thread_fn+0xd5/0x1d0
>> > [ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10
>> > [ 0.087735] kthread+0xc8/0xf0
>> > [ 0.087737] ? __pfx_kthread+0x10/0x10
>> > [ 0.087738] ret_from_fork+0x2c/0x50
>> > [ 0.087739] ? __pfx_kthread+0x10/0x10
>> > [ 0.087740] ret_from_fork_asm+0x1a/0x30
>> > [ 0.087742] </TASK>
>> > [ 0.087742] ---[ end trace 0000000000000000 ]---
>> >
>> >
>> > > ---
>> > > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
>> > > ---
>> > > mm/vmstat.c | 3 ++-
>> > > 1 file changed, 2 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > > index 4d016314a56c..0889b75cef14 100644
>> > > --- a/mm/vmstat.c
>> > > +++ b/mm/vmstat.c
>> > > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
>> > > if (!node_state(cpu_to_node(cpu), N_CPU)) {
>> > > node_set_state(cpu_to_node(cpu), N_CPU);
>> > > }
>> > > + enable_delayed_work(&per_cpu(vmstat_work, cpu));
>> >
>> > Probably needs to be 'if disabled' here, as this is invoked on normal
>> > startup when the work won't have been disabled?
>> >
>> > Had a brief look at code and couldn't see how that could be done
>> > however... and one would need to be careful about races... Tricky!
>> >
>> > >
>> > > return 0;
>> > > }
>> > >
>> > > static int vmstat_cpu_down_prep(unsigned int cpu)
>> > > {
>> > > - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>> > > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>> > > return 0;
>> > > }
>> > >
>> > > --
>> > > 2.43.0
>> > >
>> > >
>> >
>> > Let me know if you need any more details, .config etc.
>> >
>> > I noticed this warning on a real box too (in both cases running akpm's
>> > mm-unstable branch), FWIW.
>>
>> Thank you for the report. I was able to reproduce the warning and now
>> wonder how I missed it.. My oversight, apologies.
>>
>> In my current view, the simplest solution would be to make sure a local
>> vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
>> during boot-up. The following patch suppresses the warning:
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 0889b75cef14..19ceed5d34bf 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
>> {
>> int cpu;
>>
>> - for_each_possible_cpu(cpu)
>> + for_each_possible_cpu(cpu) {
>> INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
>> vmstat_update);
>>
>> + /* will be enabled on vmstat_cpu_online */
>> + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>> + }
>> +
>> schedule_delayed_work(&shepherd,
>> round_jiffies_relative(sysctl_stat_interval));
>> }
>>
>> If you think of a better solution later, please let me know. Otherwise,
>> I'll submit a follow-up fix patch with the above diff.
>
> Thanks, this resolves the problem, but are we sure that _all_ CPUs will
> definitely call vmstat_cpu_online()?
>
> I did a bit of printk output and it seems like this _didn't_ online CPU 0,
> presumably the boot CPU which calls this function in the first instance?
FWIW with the proposed fix I can see that all CPUs are online,
grep -H . /sys/devices/system/cpu/cpu*/online
/sys/devices/system/cpu/cpu0/online:1
/sys/devices/system/cpu/cpu1/online:1
/sys/devices/system/cpu/cpu2/online:1
/sys/devices/system/cpu/cpu3/online:1
/sys/devices/system/cpu/cpu4/online:1
/sys/devices/system/cpu/cpu5/online:1
/sys/devices/system/cpu/cpu6/online:1
/sys/devices/system/cpu/cpu7/online:1
>
> I also see that init_mm_internals() invokes cpuhp_setup_state_nocalls()
> explicitly which does _not_ call the callback, though even if it did this
> would be too early as it calls start_shepherd_timer() _after_ this anyway.
>
> So yeah, unless I'm missing something, I think this patch is broken.
>
> I have added Thomas and Peter to give some insight on the CPU hotplug side.
>
> It feels like the patch really needs an 'enable if not already enabled'
> call in vmstat_cpu_online().
>
>>
>> Thanks.
>>
>> -Koichiro
On Mon, Jan 06, 2025 at 12:53:36PM +0000, Charalampos Mitrodimas wrote: > > I did a bit of printk output and it seems like this _didn't_ online CPU 0, > > presumably the boot CPU which calls this function in the first instance? > > FWIW with the proposed fix I can see that all CPUs are online, > grep -H . /sys/devices/system/cpu/cpu*/online > /sys/devices/system/cpu/cpu0/online:1 > /sys/devices/system/cpu/cpu1/online:1 > /sys/devices/system/cpu/cpu2/online:1 > /sys/devices/system/cpu/cpu3/online:1 > /sys/devices/system/cpu/cpu4/online:1 > /sys/devices/system/cpu/cpu5/online:1 > /sys/devices/system/cpu/cpu6/online:1 > /sys/devices/system/cpu/cpu7/online:1 > Sorry maybe I phrased this badly, I'm not suggesting CPUs aren't coming online, I'm saying it doesn't look like vmstat_cpu_online() will be called for the boot CPU, which breaks the proposed fix (the delayed work for this CPU will simply never be enabled in this case). Naturally, the boot CPU is _already_ online at this point, which is I imagine why this is the case. I had wondered if the function would be invoked on the boot CPU _anyway_ but it appears not.
Hi Koichiro,
On Sat, Jan 4, 2025 at 5:00 AM Koichiro Den <koichiro.den@canonical.com> wrote:
> On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
> > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> > > Even after mm/vmstat:online teardown, shepherd may still queue work for
> > > the dying cpu until the cpu is removed from online mask. While it's
> > > quite rare, this means that after unbind_workers() unbinds a per-cpu
> > > kworker, it potentially runs vmstat_update for the dying CPU on an
> > > irrelevant cpu before entering atomic AP states.
> > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> > > backtrace.
> > >
> > > BUG: using smp_processor_id() in preemptible [00000000] code: \
> > > kworker/7:3/1702
> > > caller is refresh_cpu_vm_stats+0x235/0x5f0
> > > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
> > > Tainted: [N]=TEST
> > > Workqueue: mm_percpu_wq vmstat_update
> > > Call Trace:
> > > <TASK>
> > > dump_stack_lvl+0x8d/0xb0
> > > check_preemption_disabled+0xce/0xe0
> > > refresh_cpu_vm_stats+0x235/0x5f0
> > > vmstat_update+0x17/0xa0
> > > process_one_work+0x869/0x1aa0
> > > worker_thread+0x5e5/0x1100
> > > kthread+0x29e/0x380
> > > ret_from_fork+0x2d/0x70
> > > ret_from_fork_asm+0x1a/0x30
> > > </TASK>
> > >
> > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> > > symmetrically enable it on startup.
> > >
> > > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
> >
> > I observed a warning in my qemu and real hardware, which I bisected to this commit:
> >
> > [ 0.087733] ------------[ cut here ]------------
> > [ 0.087733] workqueue: work disable count underflowed
> > [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
I am seeing the same on arm32 (R-Car M2-W) and arm64 (R-Car H3 ES2.0).
> Thank you for the report. I was able to reproduce the warning and now
> wonder how I missed it.. My oversight, apologies.
>
> In my current view, the simplest solution would be to make sure a local
> vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
> during boot-up. The following patch suppresses the warning:
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0889b75cef14..19ceed5d34bf 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
> {
> int cpu;
>
> - for_each_possible_cpu(cpu)
> + for_each_possible_cpu(cpu) {
> INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> vmstat_update);
>
> + /* will be enabled on vmstat_cpu_online */
> + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> + }
> +
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> }
>
> If you think of a better solution later, please let me know. Otherwise,
> I'll submit a follow-up fix patch with the above diff.
Thank you, that fixes the warnings for me!
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote:
> On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
> > Hi,
> >
> > I observed a warning in my qemu and real hardware, which I bisected to this commit:
> >
> > [ 0.087733] ------------[ cut here ]------------
> > [ 0.087733] workqueue: work disable count underflowed
> > [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
FWIW, I hit similar when testing v6.13-rc6 defconfig on arm64, when
booting secondaries I always get a splat (trimmed):
| ------------[ cut here ]------------
| workqueue: work disable count underflowed
| WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4317 enable_work+0xfc/0x108
| Modules linked in:
| CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc6 #1
| Hardware name: linux,dummy-virt (DT)
| [...]
| Call trace:
| enable_work+0xfc/0x108 (P)
| enable_delayed_work+0x10/0x1c
| vmstat_cpu_online+0x88/0xbc
| cpuhp_invoke_callback+0x10c/0x208
| cpuhp_thread_fun+0xb0/0x1a0
| smpboot_thread_fn+0x20c/0x234
| kthread+0x110/0x114
| ret_from_fork+0x10/0x20
| ---[ end trace 0000000000000000 ]---
[...]
> In my current view, the simplest solution would be to make sure a local
> vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
> during boot-up. The following patch suppresses the warning:
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0889b75cef14..19ceed5d34bf 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
> {
> int cpu;
>
> - for_each_possible_cpu(cpu)
> + for_each_possible_cpu(cpu) {
> INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> vmstat_update);
>
> + /* will be enabled on vmstat_cpu_online */
> + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> + }
> +
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> }
FWIW, the above solves the warning for me.
Mark.
Koichiro Den <koichiro.den@canonical.com> writes:
> On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
>> On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
>> > Even after mm/vmstat:online teardown, shepherd may still queue work for
>> > the dying cpu until the cpu is removed from online mask. While it's
>> > quite rare, this means that after unbind_workers() unbinds a per-cpu
>> > kworker, it potentially runs vmstat_update for the dying CPU on an
>> > irrelevant cpu before entering atomic AP states.
>> > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
>> > backtrace.
>> >
>> > BUG: using smp_processor_id() in preemptible [00000000] code: \
>> > kworker/7:3/1702
>> > caller is refresh_cpu_vm_stats+0x235/0x5f0
>> > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
>> > Tainted: [N]=TEST
>> > Workqueue: mm_percpu_wq vmstat_update
>> > Call Trace:
>> > <TASK>
>> > dump_stack_lvl+0x8d/0xb0
>> > check_preemption_disabled+0xce/0xe0
>> > refresh_cpu_vm_stats+0x235/0x5f0
>> > vmstat_update+0x17/0xa0
>> > process_one_work+0x869/0x1aa0
>> > worker_thread+0x5e5/0x1100
>> > kthread+0x29e/0x380
>> > ret_from_fork+0x2d/0x70
>> > ret_from_fork_asm+0x1a/0x30
>> > </TASK>
>> >
>> > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
>> > symmetrically enable it on startup.
>> >
>> > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
>>
>> Hi,
>>
>> I observed a warning in my qemu and real hardware, which I bisected to this commit:
>>
>> [ 0.087733] ------------[ cut here ]------------
>> [ 0.087733] workqueue: work disable count underflowed
>> [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
I also encountered this in my QEMU Debian installation.
>>
>> This is:
>>
>> static void work_offqd_enable(struct work_offq_data *offqd)
>> {
>> if (likely(offqd->disable > 0))
>> offqd->disable--;
>> else
>> WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
>> }
>>
>> So (based on this code) presumably an enable is only required if previously
>> disabled, and this code is being called on startup unconditionally without
>> the work having been disabled previously? I'm not hugely familiar with
>> delayed workqueue implementation details.
>>
>> [ 0.087733] Modules linked in:
>> [ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
>> [ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
>> [ 0.087733] RIP: 0010:enable_work+0xb5/0xc0
>> [ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
>> [ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
>> [ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
>> [ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
>> [ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
>> [ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
>> [ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
>> [ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
>> [ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
>> [ 0.087733] PKRU: 55555554
>> [ 0.087733] Call Trace:
>> [ 0.087733] <TASK>
>> [ 0.087733] ? enable_work+0xb5/0xc0
>> [ 0.087733] ? __warn.cold+0x93/0xf2
>> [ 0.087733] ? enable_work+0xb5/0xc0
>> [ 0.087733] ? report_bug+0xff/0x140
>> [ 0.087733] ? handle_bug+0x54/0x90
>> [ 0.087733] ? exc_invalid_op+0x17/0x70
>> [ 0.087733] ? asm_exc_invalid_op+0x1a/0x20
>> [ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10
>> [ 0.087733] ? enable_work+0xb5/0xc0
>> [ 0.087733] vmstat_cpu_online+0x5c/0x70
>> [ 0.087733] cpuhp_invoke_callback+0x133/0x440
>> [ 0.087733] cpuhp_thread_fun+0x95/0x150
>> [ 0.087733] smpboot_thread_fn+0xd5/0x1d0
>> [ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10
>> [ 0.087735] kthread+0xc8/0xf0
>> [ 0.087737] ? __pfx_kthread+0x10/0x10
>> [ 0.087738] ret_from_fork+0x2c/0x50
>> [ 0.087739] ? __pfx_kthread+0x10/0x10
>> [ 0.087740] ret_from_fork_asm+0x1a/0x30
>> [ 0.087742] </TASK>
>> [ 0.087742] ---[ end trace 0000000000000000 ]---
>>
>>
>> > ---
>> > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
>> > ---
>> > mm/vmstat.c | 3 ++-
>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > index 4d016314a56c..0889b75cef14 100644
>> > --- a/mm/vmstat.c
>> > +++ b/mm/vmstat.c
>> > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
>> > if (!node_state(cpu_to_node(cpu), N_CPU)) {
>> > node_set_state(cpu_to_node(cpu), N_CPU);
>> > }
>> > + enable_delayed_work(&per_cpu(vmstat_work, cpu));
>>
>> Probably needs to be 'if disabled' here, as this is invoked on normal
>> startup when the work won't have been disabled?
>>
>> Had a brief look at code and couldn't see how that could be done
>> however... and one would need to be careful about races... Tricky!
>>
>> >
>> > return 0;
>> > }
>> >
>> > static int vmstat_cpu_down_prep(unsigned int cpu)
>> > {
>> > - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>> > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>> > return 0;
>> > }
>> >
>> > --
>> > 2.43.0
>> >
>> >
>>
>> Let me know if you need any more details, .config etc.
>>
>> I noticed this warning on a real box too (in both cases running akpm's
>> mm-unstable branch), FWIW.
>
> Thank you for the report. I was able to reproduce the warning and now
> wonder how I missed it.. My oversight, apologies.
>
> In my current view, the simplest solution would be to make sure a local
> vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
> during boot-up. The following patch suppresses the warning:
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0889b75cef14..19ceed5d34bf 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
> {
> int cpu;
>
> - for_each_possible_cpu(cpu)
> + for_each_possible_cpu(cpu) {
> INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> vmstat_update);
>
> + /* will be enabled on vmstat_cpu_online */
> + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> + }
> +
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> }
>
> If you think of a better solution later, please let me know. Otherwise,
> I'll submit a follow-up fix patch with the above diff.
Can't think of a better solution myself but this fixes the issue. Thanks!
>
> Thanks.
>
> -Koichiro
C. Mitrodimas
© 2016 - 2026 Red Hat, Inc.