include/linux/sched.h | 2 +- kernel/workqueue.c | 25 +++++++++++++++---------- 2 files changed, 16 insertions(+), 11 deletions(-)
The Linux kernel does not provide a way to differentiate between a kworker and a rescue kworker for user-mode. From user-mode, one can establish if a task is a kworker by testing for PF_WQ_WORKER in a specified task's flags bit mask (or bitmap) via /proc/[PID]/stat. Indeed, one can examine /proc/[PID]/stack and search for the function namely "rescuer_thread". This is only available to the root user. It can be useful to identify a rescue kworker since their CPU affinity cannot be modified and their initial CPU assignment can be safely ignored. Furthermore, a workqueue that was created with WQ_MEM_RECLAIM and WQ_SYSFS the cpumask file is not applicable to the rescue kworker. By design a rescue kworker should run anywhere. This patch series introduces PF_WQ_RESCUE_WORKER and ensures it is set and cleared appropriately and simplifies current_is_workqueue_rescuer(). Aaron Tomlin (2): workqueue: Introduce PF_WQ_RESCUE_WORKER workqueue: Simplify current_is_workqueue_rescuer() include/linux/sched.h | 2 +- kernel/workqueue.c | 25 +++++++++++++++---------- 2 files changed, 16 insertions(+), 11 deletions(-) -- 2.39.1
Hi, Just stumbled upon this series while looking into rescuers myself. :) On 29/07/23 14:53, Aaron Tomlin wrote: > The Linux kernel does not provide a way to differentiate between a > kworker and a rescue kworker for user-mode. > From user-mode, one can establish if a task is a kworker by testing for > PF_WQ_WORKER in a specified task's flags bit mask (or bitmap) via > /proc/[PID]/stat. Indeed, one can examine /proc/[PID]/stack and search > for the function namely "rescuer_thread". This is only available to the > root user. > > It can be useful to identify a rescue kworker since their CPU affinity > cannot be modified and their initial CPU assignment can be safely ignored. > Furthermore, a workqueue that was created with WQ_MEM_RECLAIM and > WQ_SYSFS the cpumask file is not applicable to the rescue kworker. > By design a rescue kworker should run anywhere. Guess this is a requirement because, if workqueue processing is stuck for some reason, getting rescuers to run on the same set of cpus workqueues have been restricted to already doesn't really have good chances of making any progress? Wonder if we still might need some sort of fail hard/warn mode in case strict isolation is in place? Or maybe we have that already? Thanks! Juri
Hello, On Mon, Dec 11, 2023 at 03:51:57PM +0100, Juri Lelli wrote: > Guess this is a requirement because, if workqueue processing is stuck > for some reason, getting rescuers to run on the same set of cpus > workqueues have been restricted to already doesn't really have good > chances of making any progress? The only problem rescuers try to solve is deadlocks caused by lack of memory, so on the cpu side, it just follows whatever worker pool it's trying to help. > Wonder if we still might need some sort of fail hard/warn mode in case > strict isolation is in place? Or maybe we have that already? For both percpu and unbound workqueues, the rescuers just follow whatever pool it's trying to help at the moment, so it shouldn't cause any surprises in terms of isolation. It just temporarily joins the already active but stuck pool. Thanks. -- tejun
Hello, Thanks for the quick reply! On 11/12/23 08:39, Tejun Heo wrote: > Hello, > > On Mon, Dec 11, 2023 at 03:51:57PM +0100, Juri Lelli wrote: > > Guess this is a requirement because, if workqueue processing is stuck > > for some reason, getting rescuers to run on the same set of cpus > > workqueues have been restricted to already doesn't really have good > > chances of making any progress? > > The only problem rescuers try to solve is deadlocks caused by lack of > memory, so on the cpu side, it just follows whatever worker pool it's trying > to help. > > > Wonder if we still might need some sort of fail hard/warn mode in case > > strict isolation is in place? Or maybe we have that already? > > For both percpu and unbound workqueues, the rescuers just follow whatever > pool it's trying to help at the moment, so it shouldn't cause any surprises > in terms of isolation. It just temporarily joins the already active but > stuck pool. Hummm, OK, but in terms of which CPU the rescuer is possibly woken up, how are we making sure that the wake up is always happening on housekeeping CPUs (assuming unbound workqueues have been restricted to those)? AFAICS, we have send_mayday -> wake_up_process(wq->rescuer->task) which is not affined to the workqueue cpumask it's called to rescue, so in theory can be woken up anywhere? Thanks, Juri
Hello, Juri. On Tue, Dec 12, 2023 at 10:56:02AM +0100, Juri Lelli wrote: > Hummm, OK, but in terms of which CPU the rescuer is possibly woken up, > how are we making sure that the wake up is always happening on > housekeeping CPUs (assuming unbound workqueues have been restricted to > those)? > > AFAICS, we have > > send_mayday -> > wake_up_process(wq->rescuer->task) > > which is not affined to the workqueue cpumask it's called to rescue, so > in theory can be woken up anywhere? Ah, was only thinking about work item execution. Yeah, it's not following the isolation rule there and we probably should affine it as we're waking it up. Thanks. -- tejun
On 12/12/23 07:14, Tejun Heo wrote:
> Hello, Juri.
>
> On Tue, Dec 12, 2023 at 10:56:02AM +0100, Juri Lelli wrote:
> > Hummm, OK, but in terms of which CPU the rescuer is possibly woken up,
> > how are we making sure that the wake up is always happening on
> > housekeeping CPUs (assuming unbound workqueues have been restricted to
> > those)?
> >
> > AFAICS, we have
> >
> > send_mayday ->
> > wake_up_process(wq->rescuer->task)
> >
> > which is not affined to the workqueue cpumask it's called to rescue, so
> > in theory can be woken up anywhere?
>
> Ah, was only thinking about work item execution. Yeah, it's not following
> the isolation rule there and we probably should affine it as we're waking it
> up.
Something like the following then maybe?
---
kernel/workqueue.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2989b57e154a7..ed73f7f80d57d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4405,6 +4405,12 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
link_pwq(ctx->dfl_pwq);
swap(ctx->wq->dfl_pwq, ctx->dfl_pwq);
+ /* rescuer needs to respect wq cpumask changes */
+ if (ctx->wq->rescuer) {
+ kthread_bind_mask(ctx->wq->rescuer->task, ctx->attrs->cpumask);
+ wake_up_process(ctx->wq->rescuer->task);
+ }
+
mutex_unlock(&ctx->wq->mutex);
}
Hello,
On Wed, Dec 13, 2023 at 09:59:42AM +0100, Juri Lelli wrote:
> Something like the following then maybe?
>
> ---
> kernel/workqueue.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2989b57e154a7..ed73f7f80d57d 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4405,6 +4405,12 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
> link_pwq(ctx->dfl_pwq);
> swap(ctx->wq->dfl_pwq, ctx->dfl_pwq);
>
> + /* rescuer needs to respect wq cpumask changes */
> + if (ctx->wq->rescuer) {
> + kthread_bind_mask(ctx->wq->rescuer->task, ctx->attrs->cpumask);
> + wake_up_process(ctx->wq->rescuer->task);
> + }
> +
> mutex_unlock(&ctx->wq->mutex);
> }
I'm not sure kthread_bind_mask() would be safe here. The rescuer might be
running a work item. wait_task_inactive() might fail and we don't want to
change cpumask while the rescuer is active anyway.
Maybe the easiest way to do this is making rescuer_thread() restore the wq's
cpumask right before going to sleep, and making apply_wqattrs_commit() just
wake up the rescuer.
Thanks.
--
tejun
On 13/12/23 05:35, Tejun Heo wrote:
> Hello,
>
> On Wed, Dec 13, 2023 at 09:59:42AM +0100, Juri Lelli wrote:
> > Something like the following then maybe?
> >
> > ---
> > kernel/workqueue.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 2989b57e154a7..ed73f7f80d57d 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -4405,6 +4405,12 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
> > link_pwq(ctx->dfl_pwq);
> > swap(ctx->wq->dfl_pwq, ctx->dfl_pwq);
> >
> > + /* rescuer needs to respect wq cpumask changes */
> > + if (ctx->wq->rescuer) {
> > + kthread_bind_mask(ctx->wq->rescuer->task, ctx->attrs->cpumask);
> > + wake_up_process(ctx->wq->rescuer->task);
> > + }
> > +
> > mutex_unlock(&ctx->wq->mutex);
> > }
>
> I'm not sure kthread_bind_mask() would be safe here. The rescuer might be
> running a work item. wait_task_inactive() might fail and we don't want to
> change cpumask while the rescuer is active anyway.
>
> Maybe the easiest way to do this is making rescuer_thread() restore the wq's
> cpumask right before going to sleep, and making apply_wqattrs_commit() just
> wake up the rescuer.
Hummm, don't think we can call that either while the rescuer is actually
running. Maybe we can simply s/kthread_bind_mask/set_cpus_allowed_ptr/
in the above?
Thanks,
Juri
On Wed, Dec 13, 2023 at 07:32:10PM +0100, Juri Lelli wrote: > > Maybe the easiest way to do this is making rescuer_thread() restore the wq's > > cpumask right before going to sleep, and making apply_wqattrs_commit() just > > wake up the rescuer. > > Hummm, don't think we can call that either while the rescuer is actually > running. Maybe we can simply s/kthread_bind_mask/set_cpus_allowed_ptr/ > in the above? So, we have to use set_cpus_allowed_ptr() but we still don't want to change the affinity of a rescuer which is already running a task for a pool. Thanks. -- tejun
On 13/12/23 08:38, Tejun Heo wrote: > On Wed, Dec 13, 2023 at 07:32:10PM +0100, Juri Lelli wrote: > > > Maybe the easiest way to do this is making rescuer_thread() restore the wq's > > > cpumask right before going to sleep, and making apply_wqattrs_commit() just > > > wake up the rescuer. > > > > Hummm, don't think we can call that either while the rescuer is actually > > running. Maybe we can simply s/kthread_bind_mask/set_cpus_allowed_ptr/ > > in the above? > > So, we have to use set_cpus_allowed_ptr() but we still don't want to change > the affinity of a rescuer which is already running a task for a pool. But then, even today, a rescuer might keep handling work on a cpu outside its wq cpumask if the associated wq cpumask change can proceed w/o waiting for it to finish the iteration? BTW, apologies for all the questions, but I'd like to make sure I can get the implications hopefully right. :) Thanks, Juri
Hello, On Thu, Dec 14, 2023 at 12:25:25PM +0100, Juri Lelli wrote: > > So, we have to use set_cpus_allowed_ptr() but we still don't want to change > > the affinity of a rescuer which is already running a task for a pool. > > But then, even today, a rescuer might keep handling work on a cpu > outside its wq cpumask if the associated wq cpumask change can proceed > w/o waiting for it to finish the iteration? Yeah, that can happen and pool cpumasks naturally being subsets of the wq's cpumask that they're serving, your original approach likely isn't broken either. > BTW, apologies for all the questions, but I'd like to make sure I can > get the implications hopefully right. :) I obviously haven't thought through it very well, so thanks for the questions. So, yeah, I think we actually need to set the rescuer's cpumask when wq's cpumask changes and doing it where you were suggesting should probably work. Thanks. -- tejun
On 14/12/23 09:47, Tejun Heo wrote: > Hello, > > On Thu, Dec 14, 2023 at 12:25:25PM +0100, Juri Lelli wrote: > > > So, we have to use set_cpus_allowed_ptr() but we still don't want to change > > > the affinity of a rescuer which is already running a task for a pool. > > > > But then, even today, a rescuer might keep handling work on a cpu > > outside its wq cpumask if the associated wq cpumask change can proceed > > w/o waiting for it to finish the iteration? > > Yeah, that can happen and pool cpumasks naturally being subsets of the wq's > cpumask that they're serving, your original approach likely isn't broken > either. > > > BTW, apologies for all the questions, but I'd like to make sure I can > > get the implications hopefully right. :) > > I obviously haven't thought through it very well, so thanks for the > questions. So, yeah, I think we actually need to set the rescuer's cpumask > when wq's cpumask changes and doing it where you were suggesting should > probably work. OK. Going to send a proper patch asap. Thanks! Juri
Hello again,
On 15/12/23 07:50, Juri Lelli wrote:
> On 14/12/23 09:47, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, Dec 14, 2023 at 12:25:25PM +0100, Juri Lelli wrote:
> > > > So, we have to use set_cpus_allowed_ptr() but we still don't want to change
> > > > the affinity of a rescuer which is already running a task for a pool.
> > >
> > > But then, even today, a rescuer might keep handling work on a cpu
> > > outside its wq cpumask if the associated wq cpumask change can proceed
> > > w/o waiting for it to finish the iteration?
> >
> > Yeah, that can happen and pool cpumasks naturally being subsets of the wq's
> > cpumask that they're serving, your original approach likely isn't broken
> > either.
> >
> > > BTW, apologies for all the questions, but I'd like to make sure I can
> > > get the implications hopefully right. :)
> >
> > I obviously haven't thought through it very well, so thanks for the
> > questions. So, yeah, I think we actually need to set the rescuer's cpumask
> > when wq's cpumask changes and doing it where you were suggesting should
> > probably work.
>
> OK. Going to send a proper patch asap.
I actually didn't do that yet as it turns out the proposed approach
doesn't cover !WQ_SYSFS unbounded wqs. Well, I thought those should be
covered as well, since we have (initiated by echo <mask> into
/sys/devices/virtual/workqueue/cpumask)
workqueue_apply_unbound_cpumask ->
apply_wqattrs_commit
but for some reason the mask change is not reflected into rescuers
affinity.
Trying to dig deeper I went ahead and extended the recent wq_dump.py
addition with the following
---
ls/workqueue/wq_dump.py | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index d0df5833f2c18..6da621989e210 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -175,3 +175,32 @@ for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(
if wq.flags & WQ_UNBOUND:
print(f' {wq.dfl_pwq.pool.id.value_():{max_pool_id_len}}', end='')
print('')
+
+print('')
+print('Workqueue -> rescuer')
+print('=====================')
+print(f'wq_unbound_cpumask={cpumask_str(wq_unbound_cpumask)}')
+print('')
+print('[ workqueue \ type unbound_cpumask rescuer pid cpumask]')
+
+for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'):
+ print(f'{wq.name.string_().decode()[-24:]:24}', end='')
+ if wq.flags & WQ_UNBOUND:
+ if wq.flags & WQ_ORDERED:
+ print(' ordered ', end='')
+ else:
+ print(' unbound', end='')
+ if wq.unbound_attrs.affn_strict:
+ print(',S ', end='')
+ else:
+ print(' ', end='')
+ print(f' {cpumask_str(wq.unbound_attrs.cpumask):24}', end='')
+ else:
+ print(' percpu ', end='')
+ print(' ', end='')
+
+ if wq.flags & WQ_MEM_RECLAIM:
+ print(f' {wq.rescuer.task.comm.string_().decode()[-24:]:24}', end='')
+ print(f' {wq.rescuer.task.pid.value_():5}', end='')
+ print(f' {cpumask_str(wq.rescuer.task.cpus_ptr)}', end='')
+ print('')
---
which shows the following situation after an
# echo 00,00000003 > /sys/devices/virtual/workqueue/cpumask
on the system I'm testing with:
...
Workqueue -> rescuer
=====================
wq_unbound_cpumask=00000003
[ workqueue \ type unbound_cpumask rescuer pid cpumask]
events percpu
events_highpri percpu
events_long percpu
events_unbound unbound 0xffffffff 000000ff
events_freezable percpu
events_power_efficient percpu
events_freezable_power_ percpu
rcu_gp percpu kworker/R-rcu_g 4 0xffffffff 000000ff
rcu_par_gp percpu kworker/R-rcu_p 5 0xffffffff 000000ff
slub_flushwq percpu kworker/R-slub_ 6 0xffffffff 000000ff
netns ordered 0xffffffff 000000ff kworker/R-netns 7 0xffffffff 000000ff
mm_percpu_wq percpu kworker/R-mm_pe 13 0xffffffff 000000ff
cpuset_migrate_mm ordered 0xffffffff 000000ff
inet_frag_wq percpu kworker/R-inet_ 300 0xffffffff 000000ff
pm percpu
cgroup_destroy percpu
cgroup_pidlist_destroy percpu
writeback unbound 0xffffffff 000000ff kworker/R-write 308 0xffffffff 000000ff
cgwb_release percpu
cryptd percpu kworker/R-crypt 314 0xffffffff 000000ff
kintegrityd percpu kworker/R-kinte 315 0xffffffff 000000ff
kblockd percpu kworker/R-kbloc 316 0xffffffff 000000ff
kacpid percpu
kacpi_notify percpu
kacpi_hotplug ordered 0xffffffff 000000ff
kec ordered 0xffffffff 000000ff
kec_query percpu
tpm_dev_wq percpu kworker/R-tpm_d 352 0xffffffff 000000ff
usb_hub_wq percpu
md percpu kworker/R-md 353 0xffffffff 000000ff
md_misc percpu
md_bitmap unbound 0xffffffff 000000ff kworker/R-md_bi 354 0xffffffff 000000ff
edac-poller ordered 0xffffffff 000000ff kworker/R-edac- 355 0xffffffff 000000ff
...
I guess I expected wq_unbound_cpumask and unbound_cpumask for each
unbound wq to be kept in sync, so I'm evidently missing details. :)
Can you please help me here understanding what am I missing?
Thanks!
Juri
On Tue, Dec 12, 2023 at 07:14:48AM -1000, Tejun Heo wrote: > Hello, Juri. > > On Tue, Dec 12, 2023 at 10:56:02AM +0100, Juri Lelli wrote: > > Hummm, OK, but in terms of which CPU the rescuer is possibly woken up, > > how are we making sure that the wake up is always happening on > > housekeeping CPUs (assuming unbound workqueues have been restricted to > > those)? > > > > AFAICS, we have > > > > send_mayday -> > > wake_up_process(wq->rescuer->task) > > > > which is not affined to the workqueue cpumask it's called to rescue, so > > in theory can be woken up anywhere? > > Ah, was only thinking about work item execution. Yeah, it's not following > the isolation rule there and we probably should affine it as we're waking it > up. Hi Tejun, I am confused. I thought by design we want a rescuer kthread to execute on any CPU, no? Kind regards, -- Aaron Tomlin
On Tue, Dec 12, 2023 at 07:06:48PM +0000, Aaron Tomlin wrote: > I thought by design we want a rescuer kthread to execute on any CPU, no? Well, it needs to be able to move around because it dynamically attaches to the worker pool it's rescuing and needs to take on its cpumask, but it doesn't have to be able to run on all cpus all the time. Thanks. -- tejun
Hello, On Sat, Jul 29, 2023 at 02:53:32PM +0100, Aaron Tomlin wrote: > It can be useful to identify a rescue kworker since their CPU affinity > cannot be modified and their initial CPU assignment can be safely ignored. You really shouldn't be setting affinities on kworkers manually. There's no way of knowing which kworker is going to execute which workqueue. Please use the attributes API and sysfs interface to modify per-workqueue worker attributes. If that's not sufficient and you need finer grained control, the right thing to do is using kthread_worker which gives you a dedicated kthread that you can manipulate as appropriate. Thanks. -- tejun
> You really shouldn't be setting affinities on kworkers manually. There's > no way of knowing which kworker is going to execute which workqueue. > Please use the attributes API and sysfs interface to modify per-workqueue > worker attributes. If that's not sufficient and you need finer grained > control, the right thing to do is using kthread_worker which gives you a > dedicated kthread that you can manipulate as appropriate. Hi Tejun, I completely agree. Each kworker has PF_NO_SETAFFINITY applied anyway. If I understand correctly, only an unbound kworker can have their CPU affinity modified via sysfs. The objective of this series was to easily identify a rescuer kworker from user-mode. Kind regards, -- Aaron Tomlin
Hello, On Tue, Aug 01, 2023 at 11:53:01AM +0100, Aaron Tomlin wrote: > > You really shouldn't be setting affinities on kworkers manually. There's > > no way of knowing which kworker is going to execute which workqueue. > > Please use the attributes API and sysfs interface to modify per-workqueue > > worker attributes. If that's not sufficient and you need finer grained > > control, the right thing to do is using kthread_worker which gives you a > > dedicated kthread that you can manipulate as appropriate. > > I completely agree. Each kworker has PF_NO_SETAFFINITY applied anyway. > If I understand correctly, only an unbound kworker can have their CPU > affinity modified via sysfs. The objective of this series was to easily > identify a rescuer kworker from user-mode. But why do you need to identify rescue workers? What are you trying to achieve? Thanks. -- tejun
> But why do you need to identify rescue workers? What are you trying to > achieve? Hi Tejun, I had a conversation with a colleague of mine. It can be useful to identify and account for all kernel threads. From the perspective of user-mode, the name given currently to the rescuer kworker is ambiguous. For instance, "kworker/u16:9-kcryptd/253:0" is clearly identifiable as an unbound kworker for the specified workqueue which can have their CPU affinity adjusted as you mentioned before. I think if we followed the same naming convention for a rescuer kworker then it would be more consistent. I'll send a patch so it can be discussed further. Kind regards, -- Aaron Tomlin
On Thu, Aug 03, 2023 at 09:19:14PM +0100, Aaron Tomlin wrote: > > But why do you need to identify rescue workers? What are you trying to > > achieve? > > Hi Tejun, > > I had a conversation with a colleague of mine. It can be useful to identify > and account for all kernel threads. From the perspective of user-mode, the > name given currently to the rescuer kworker is ambiguous. For instance, > "kworker/u16:9-kcryptd/253:0" is clearly identifiable as an unbound kworker > for the specified workqueue which can have their CPU affinity adjusted as Note that the name changes to the work item the worker is currently executing. It won't stay that way. Workers are shared across the workqueues, so I'm not sure "identify and account all kernel threads" is working as well as you think it is. > you mentioned before. I think if we followed the same naming convention > for a rescuer kworker then it would be more consistent. I'll send a patch > so it can be discussed further. We can certainly rename them to indicate that they are rescuers - e.g. maybe krescuer? But, at the moment, the proposed reason seems rather dubious. Thanks. -- tejun
> Note that the name changes to the work item the worker is currently > executing. It won't stay that way. Workers are shared across the > workqueues, so I'm not sure "identify and account all kernel threads" is > working as well as you think it is. Hi Tejun, Indeed. The point is that these kworker kthreads are easily identifiable. > We can certainly rename them to indicate that they are rescuers - e.g. > maybe krescuer? But, at the moment, the proposed reason seems rather > dubious. Personally, I would prefer "kworker/r-%s" and then include the specified workqueue's name e.g. "kworker/r-ext4-rsv-conver". So the rescuer task's name is more consistent with the current naming scheme. I will send a follow up patch. Kind regards, -- Aaron Tomlin
© 2016 - 2026 Red Hat, Inc.