workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

[PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Sebastian Andrzej Siewior 11 months, 3 weeks ago

If the user did not specify a CPU while enqueuing a work item then
WORK_CPU_UNBOUND is passed. In this case, for WQ_UNBOUND a CPU is
selected based on wq_unbound_cpumask while the local CPU is preferred.
For !WQ_UNBOUND the local CPU is selected.
For NOHZ_FULL system with isolated CPU wq_unbound_cpumask is set to the
not isolated (housekeeping) CPUs. This leads to different behaviour if a
work item is scheduled on an isolated CPU where
	schedule_delayed_work(, 1);

will move the timer to the housekeeping CPU and then schedule the work
there (on the housekeeping CPU) while
	schedule_delayed_work(, 0);

will schedule the work item on the isolated CPU.

The documentation says WQ_UNBOUND prefers the local CPU. It can
preferer the local CPU if it is part of wq_unbound_cpumask.

Restrict WORK_CPU_UNBOUND to wq_unbound_cpumask via
wq_select_unbound_cpu().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/workqueue.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bfe030b443e27..134d9550538aa 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2261,12 +2261,8 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	rcu_read_lock();
 retry:
 	/* pwq which will be used unless @work is executing elsewhere */
-	if (req_cpu == WORK_CPU_UNBOUND) {
-		if (wq->flags & WQ_UNBOUND)
-			cpu = wq_select_unbound_cpu(raw_smp_processor_id());
-		else
-			cpu = raw_smp_processor_id();
-	}
+	if (req_cpu == WORK_CPU_UNBOUND)
+		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
 
 	pwq = rcu_dereference(*per_cpu_ptr(wq->cpu_pwq, cpu));
 	pool = pwq->pool;
-- 
2.47.2

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Frederic Weisbecker 11 months, 3 weeks ago

Le Fri, Feb 21, 2025 at 12:20:03PM +0100, Sebastian Andrzej Siewior a écrit :
> If the user did not specify a CPU while enqueuing a work item then
> WORK_CPU_UNBOUND is passed. In this case, for WQ_UNBOUND a CPU is
> selected based on wq_unbound_cpumask while the local CPU is preferred.
> For !WQ_UNBOUND the local CPU is selected.
> For NOHZ_FULL system with isolated CPU wq_unbound_cpumask is set to the
> not isolated (housekeeping) CPUs. This leads to different behaviour if a
> work item is scheduled on an isolated CPU where
> 	schedule_delayed_work(, 1);
> 
> will move the timer to the housekeeping CPU and then schedule the work
> there (on the housekeeping CPU) while
> 	schedule_delayed_work(, 0);
> 
> will schedule the work item on the isolated CPU.
> 
> The documentation says WQ_UNBOUND prefers the local CPU. It can
> preferer the local CPU if it is part of wq_unbound_cpumask.
> 
> Restrict WORK_CPU_UNBOUND to wq_unbound_cpumask via
> wq_select_unbound_cpu().
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

I really would like to have this patch in. I have considered
doing that a few month ago but got sort-of discouraged by the
lack of properly defined semantics for schedule_work(). And that
function has too many users to check their locality assumptions.

Its headers advertize to queue in global workqueue but the target
is system_wq and not system_unbound_wq. But then it's using
WORK_CPU_UNBOUND through queue_work().

I'm tempted to just assume that none of its users depend on the
work locality?

Thanks.

> ---
>  kernel/workqueue.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index bfe030b443e27..134d9550538aa 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2261,12 +2261,8 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
>  	rcu_read_lock();
>  retry:
>  	/* pwq which will be used unless @work is executing elsewhere */
> -	if (req_cpu == WORK_CPU_UNBOUND) {
> -		if (wq->flags & WQ_UNBOUND)
> -			cpu = wq_select_unbound_cpu(raw_smp_processor_id());
> -		else
> -			cpu = raw_smp_processor_id();
> -	}
> +	if (req_cpu == WORK_CPU_UNBOUND)
> +		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
>  
>  	pwq = rcu_dereference(*per_cpu_ptr(wq->cpu_pwq, cpu));
>  	pool = pwq->pool;
> -- 
> 2.47.2
>

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Tejun Heo 11 months, 3 weeks ago

Hello,

On Fri, Feb 21, 2025 at 03:49:18PM +0100, Frederic Weisbecker wrote:
> Le Fri, Feb 21, 2025 at 12:20:03PM +0100, Sebastian Andrzej Siewior a écrit :
> > If the user did not specify a CPU while enqueuing a work item then
> > WORK_CPU_UNBOUND is passed. In this case, for WQ_UNBOUND a CPU is
> > selected based on wq_unbound_cpumask while the local CPU is preferred.
> > For !WQ_UNBOUND the local CPU is selected.
> > For NOHZ_FULL system with isolated CPU wq_unbound_cpumask is set to the
> > not isolated (housekeeping) CPUs. This leads to different behaviour if a
> > work item is scheduled on an isolated CPU where
> > 	schedule_delayed_work(, 1);
> > 
> > will move the timer to the housekeeping CPU and then schedule the work
> > there (on the housekeeping CPU) while
> > 	schedule_delayed_work(, 0);
> > 
> > will schedule the work item on the isolated CPU.
> > 
> > The documentation says WQ_UNBOUND prefers the local CPU. It can
> > preferer the local CPU if it is part of wq_unbound_cpumask.
> > 
> > Restrict WORK_CPU_UNBOUND to wq_unbound_cpumask via
> > wq_select_unbound_cpu().
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> I really would like to have this patch in. I have considered
> doing that a few month ago but got sort-of discouraged by the
> lack of properly defined semantics for schedule_work(). And that
> function has too many users to check their locality assumptions.
> 
> Its headers advertize to queue in global workqueue but the target
> is system_wq and not system_unbound_wq. But then it's using
> WORK_CPU_UNBOUND through queue_work().
> 
> I'm tempted to just assume that none of its users depend on the
> work locality?

That's API guarantee and there are plenty of users who depend on
queue_work() and schedule_work() on per-cpu workqueues to be actually
per-cpu. I don't think we can pull the rug from under them. If we want to do
this, which I think is a good idea, we should:

1. Convert per-cpu workqueue users to unbound workqueues. Most users don't
   care whether work item is executed locally or not. However, historically,
   we've been preferring per-cpu workqueues because unbound workqueues had a
   lot worse locality properties. Unbound workqueue's topology awareness is
   a lot better now, so this should be less of a problem and we should be
   able to move a lot of users over to unbound workqueues.

2. There still are cases where local execution isn't required for
   correctness but local & concurrency controlled executions yield
   performance gains. Workqueue API currently doesn't distinguish these two
   cases. We should add a new API which prefers local execution but doesn't
   require it, which can then do what's suggested in this patch.

Unfortunately, I don't see a way forward without auditing and converting the
users.

Thanks.

-- 
tejun

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Sebastian Andrzej Siewior 11 months, 2 weeks ago

On 2025-02-21 06:48:16 [-1000], Tejun Heo wrote:
> Hello,
Hi,

> > I'm tempted to just assume that none of its users depend on the
> > work locality?
> 
> That's API guarantee and there are plenty of users who depend on
> queue_work() and schedule_work() on per-cpu workqueues to be actually
> per-cpu. 

You mean queue_work(), not queue_work_on()?
Even with the latter you need to ensure the CPU does not go away and
hardly someone does this.

>          I don't think we can pull the rug from under them. If we want to do
> this, which I think is a good idea, we should:
>
> 1. Convert per-cpu workqueue users to unbound workqueues. Most users don't
>    care whether work item is executed locally or not. However, historically,
>    we've been preferring per-cpu workqueues because unbound workqueues had a
>    lot worse locality properties. Unbound workqueue's topology awareness is
>    a lot better now, so this should be less of a problem and we should be
>    able to move a lot of users over to unbound workqueues.

you mean convert each schedule_work() to schedule_unbound_work() which
uses system_unbound_wq instead?

I would really like to make it default because otherwise most people
will stick to the old function and the "convert" is never ending.

Maybe I misunderstood you.

> 2. There still are cases where local execution isn't required for
>    correctness but local & concurrency controlled executions yield
>    performance gains. Workqueue API currently doesn't distinguish these two
>    cases. We should add a new API which prefers local execution but doesn't
>    require it, which can then do what's suggested in this patch.

I see. So we get rid of the old naming and have them something like
	schedule_bound_work()
	schedule_unbound_work()
	schedule_hopefully_local_work()

? The last one would attempt the local CPU for performance reasons
unless the CPU is not part the workqueue' cpumask. So the difference is
that the middle one would be queued on WQ_UNBOUND while the latter might
be queued on a different CPU but on WQ without WQ_UNBOUND. Both would
respect workqueue' cpumask.

> Unfortunately, I don't see a way forward without auditing and converting the
> users.

So tried to pull the "in WORK_CPU_UNBOUND the has unbound" card and
comment where it says "prefer local CPU" card.
We have already different behaviour with queue_delayed_work(,,0) vs
queue_delayed_work(,,!0) but this does not count here?
I don't insist in doing this always, just if the CPU is "isolated" as in
not part of the workmask. But then, this path gets probably less testing
so it might be not a good idea if something relies on but does not know…

> Thanks.

Sebastian

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Tejun Heo 11 months, 2 weeks ago

Hello, Sebastian.

On Wed, Feb 26, 2025 at 05:18:47PM +0100, Sebastian Andrzej Siewior wrote:
> > That's API guarantee and there are plenty of users who depend on
> > queue_work() and schedule_work() on per-cpu workqueues to be actually
> > per-cpu. 
> 
> You mean queue_work(), not queue_work_on()?
> Even with the latter you need to ensure the CPU does not go away and
> hardly someone does this.

All variants that use per-cpu workqueues.

> >          I don't think we can pull the rug from under them. If we want to do
> > this, which I think is a good idea, we should:
> >
> > 1. Convert per-cpu workqueue users to unbound workqueues. Most users don't
> >    care whether work item is executed locally or not. However, historically,
> >    we've been preferring per-cpu workqueues because unbound workqueues had a
> >    lot worse locality properties. Unbound workqueue's topology awareness is
> >    a lot better now, so this should be less of a problem and we should be
> >    able to move a lot of users over to unbound workqueues.
> 
> you mean convert each schedule_work() to schedule_unbound_work() which
> uses system_unbound_wq instead?

Yes and adding WQ_UNBOUND to custom per-cpu workqueues.

> I would really like to make it default because otherwise most people
> will stick to the old function and the "convert" is never ending.

We can rename APIs too and it's going to be a slog, which, to be fair, this
is going to be no matter what.

> > 2. There still are cases where local execution isn't required for
> >    correctness but local & concurrency controlled executions yield
> >    performance gains. Workqueue API currently doesn't distinguish these two
> >    cases. We should add a new API which prefers local execution but doesn't
> >    require it, which can then do what's suggested in this patch.
> 
> I see. So we get rid of the old naming and have them something like
> 	schedule_bound_work()
> 	schedule_unbound_work()
> 	schedule_hopefully_local_work()

If we're renaming, I'd deprecate the schedule_*() interface and always use
queue_*() and maybe:

- Replace WQ_UNBOUND with its complement WQ_PERCPU.
- Add WQ_PREFER_PERCPU.
- Rename system_wq -> system_percpu_wq.
- Add system_prefer_percpu_wq.
- Rename system_unbound_wq -> system_dfl_wq.

> ? The last one would attempt the local CPU for performance reasons
> unless the CPU is not part the workqueue' cpumask. So the difference is
> that the middle one would be queued on WQ_UNBOUND while the latter might
> be queued on a different CPU but on WQ without WQ_UNBOUND. Both would
> respect workqueue' cpumask.

I'm not sure allowing queueing on per-cpu worker on a different CPU would be
all that useful. Might as well just turn them into when necessary.

> > Unfortunately, I don't see a way forward without auditing and converting the
> > users.
> 
> So tried to pull the "in WORK_CPU_UNBOUND the has unbound" card and
> comment where it says "prefer local CPU" card.
> We have already different behaviour with queue_delayed_work(,,0) vs
> queue_delayed_work(,,!0) but this does not count here?
> I don't insist in doing this always, just if the CPU is "isolated" as in
> not part of the workmask. But then, this path gets probably less testing
> so it might be not a good idea if something relies on but does not know…

It's a correctness issue. We can't change the gurantees without auditing and
transitining the users.

Thanks.

-- 
tejun

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Frederic Weisbecker 10 months, 1 week ago

(Adding Marco who may help us in this long adventure)

Le Wed, Feb 26, 2025 at 06:44:46AM -1000, Tejun Heo a écrit :
> We can rename APIs too and it's going to be a slog, which, to be fair, this
> is going to be no matter what.
> 
> > > 2. There still are cases where local execution isn't required for
> > >    correctness but local & concurrency controlled executions yield
> > >    performance gains. Workqueue API currently doesn't distinguish these two
> > >    cases. We should add a new API which prefers local execution but doesn't
> > >    require it, which can then do what's suggested in this patch.
> > 
> > I see. So we get rid of the old naming and have them something like
> > 	schedule_bound_work()
> > 	schedule_unbound_work()
> > 	schedule_hopefully_local_work()
> 
> If we're renaming, I'd deprecate the schedule_*() interface and always use
> queue_*() and maybe:
> 
> - Replace WQ_UNBOUND with its complement WQ_PERCPU.

This one scares us a bit. Currently the default for alloc_workqueue() is
WQ_PERCPU. After that the default will be the reverse. Even if this
happens as a single patch treewide change, there will be conflicts
in the merge window with new users coming up that will happen to be
unbound whenever they were not intended to.

But there is a way out of that if we are patient:

1) For one release, do a treewide change that introduces WQ_PERCPU and apply
   it to all relevant users. Keep WQ_UNBOUND around for now and warn if neither
   WQ_PERCPU nor WQ_UNBOUND has been passed (this and grep/coccinelle will catch
   missed patches from other trees after the merge window).

2) Once that complete, remove WQ_UNBOUND and its uses on the next release.

How does that sound?

> - Add WQ_PREFER_PERCPU.

This can be done afterward case by case.

> - Rename system_wq -> system_percpu_wq.
> - Add system_prefer_percpu_wq.
> - Rename system_unbound_wq -> system_dfl_wq.

Easy steps.

Thanks!

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Tejun Heo 10 months, 1 week ago

Hello, Frederic.

On Wed, Apr 02, 2025 at 06:40:47PM +0200, Frederic Weisbecker wrote:
> > If we're renaming, I'd deprecate the schedule_*() interface and always use
> > queue_*() and maybe:
> > 
> > - Replace WQ_UNBOUND with its complement WQ_PERCPU.
> 
> This one scares us a bit. Currently the default for alloc_workqueue() is
> WQ_PERCPU. After that the default will be the reverse. Even if this
> happens as a single patch treewide change, there will be conflicts
> in the merge window with new users coming up that will happen to be
> unbound whenever they were not intended to.
> 
> But there is a way out of that if we are patient:
> 
> 1) For one release, do a treewide change that introduces WQ_PERCPU and apply
>    it to all relevant users. Keep WQ_UNBOUND around for now and warn if neither
>    WQ_PERCPU nor WQ_UNBOUND has been passed (this and grep/coccinelle will catch
>    missed patches from other trees after the merge window).
> 
> 2) Once that complete, remove WQ_UNBOUND and its uses on the next release.
> 
> How does that sound?

Yeah, that sounds way safer.

> > - Add WQ_PREFER_PERCPU.
> 
> This can be done afterward case by case.
> 
> > - Rename system_wq -> system_percpu_wq.
> > - Add system_prefer_percpu_wq.
> > - Rename system_unbound_wq -> system_dfl_wq.
> 
> Easy steps.

Sounds great to me.

Thanks.

-- 
tejun

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Frederic Weisbecker 11 months, 2 weeks ago

Le Fri, Feb 21, 2025 at 06:48:16AM -1000, Tejun Heo a écrit :
> Hello,
> 
> On Fri, Feb 21, 2025 at 03:49:18PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Feb 21, 2025 at 12:20:03PM +0100, Sebastian Andrzej Siewior a écrit :
> > > If the user did not specify a CPU while enqueuing a work item then
> > > WORK_CPU_UNBOUND is passed. In this case, for WQ_UNBOUND a CPU is
> > > selected based on wq_unbound_cpumask while the local CPU is preferred.
> > > For !WQ_UNBOUND the local CPU is selected.
> > > For NOHZ_FULL system with isolated CPU wq_unbound_cpumask is set to the
> > > not isolated (housekeeping) CPUs. This leads to different behaviour if a
> > > work item is scheduled on an isolated CPU where
> > > 	schedule_delayed_work(, 1);
> > > 
> > > will move the timer to the housekeeping CPU and then schedule the work
> > > there (on the housekeeping CPU) while
> > > 	schedule_delayed_work(, 0);
> > > 
> > > will schedule the work item on the isolated CPU.
> > > 
> > > The documentation says WQ_UNBOUND prefers the local CPU. It can
> > > preferer the local CPU if it is part of wq_unbound_cpumask.
> > > 
> > > Restrict WORK_CPU_UNBOUND to wq_unbound_cpumask via
> > > wq_select_unbound_cpu().
> > > 
> > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > 
> > I really would like to have this patch in. I have considered
> > doing that a few month ago but got sort-of discouraged by the
> > lack of properly defined semantics for schedule_work(). And that
> > function has too many users to check their locality assumptions.
> > 
> > Its headers advertize to queue in global workqueue but the target
> > is system_wq and not system_unbound_wq. But then it's using
> > WORK_CPU_UNBOUND through queue_work().
> > 
> > I'm tempted to just assume that none of its users depend on the
> > work locality?
> 
> That's API guarantee and there are plenty of users who depend on
> queue_work() and schedule_work() on per-cpu workqueues to be actually
> per-cpu. I don't think we can pull the rug from under them. If we want to do
> this, which I think is a good idea, we should:
> 
> 1. Convert per-cpu workqueue users to unbound workqueues. Most users don't
>    care whether work item is executed locally or not. However, historically,
>    we've been preferring per-cpu workqueues because unbound workqueues had a
>    lot worse locality properties. Unbound workqueue's topology awareness is
>    a lot better now, so this should be less of a problem and we should be
>    able to move a lot of users over to unbound workqueues.

But we must check those ~1951 schedule_work() users one by one to make sure they
don't rely on locality for correctness, right? :-)

> 
> 2. There still are cases where local execution isn't required for
>    correctness but local & concurrency controlled executions yield
>    performance gains. Workqueue API currently doesn't distinguish these two
>    cases. We should add a new API which prefers local execution but doesn't
>    require it, which can then do what's suggested in this patch.

That is much trickier to find out and requires to know about the subsystem
details and history.

For those that don't rely on locality for correctness, we would really like
to be able to offload them to unbound pool at least when nohz_full= is filled.
Because in that case we don't care much on workqueues performance.

> 
> Unfortunately, I don't see a way forward without auditing and converting the
> users.

I see...

Thanks.

> 
> Thanks.
> 
> -- 
> tejun

Re: [PATCH] workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND.

Posted by Tejun Heo 11 months, 2 weeks ago

Hello, Frederic.

On Wed, Feb 26, 2025 at 04:02:19PM +0100, Frederic Weisbecker wrote:
...
> > That's API guarantee and there are plenty of users who depend on
> > queue_work() and schedule_work() on per-cpu workqueues to be actually
> > per-cpu. I don't think we can pull the rug from under them. If we want to do
> > this, which I think is a good idea, we should:
> > 
> > 1. Convert per-cpu workqueue users to unbound workqueues. Most users don't
> >    care whether work item is executed locally or not. However, historically,
> >    we've been preferring per-cpu workqueues because unbound workqueues had a
> >    lot worse locality properties. Unbound workqueue's topology awareness is
> >    a lot better now, so this should be less of a problem and we should be
> >    able to move a lot of users over to unbound workqueues.
> 
> But we must check those ~1951 schedule_work() users one by one to make sure they
> don't rely on locality for correctness, right? :-)

Yes, no matter what we do, there is no way around that.

> > 2. There still are cases where local execution isn't required for
> >    correctness but local & concurrency controlled executions yield
> >    performance gains. Workqueue API currently doesn't distinguish these two
> >    cases. We should add a new API which prefers local execution but doesn't
> >    require it, which can then do what's suggested in this patch.
> 
> That is much trickier to find out and requires to know about the subsystem
> details and history.

One good thing is that for workqueues that actually should be per-CPU for
performance, there usually are a group of people, often including the
mtaintainers, that would be familiar with the performance situation and pipe
up, so it's not *that* hopeless.

> For those that don't rely on locality for correctness, we would really like
> to be able to offload them to unbound pool at least when nohz_full= is filled.
> Because in that case we don't care much on workqueues performance.

Yeah, that makes sense to me.

Thanks.

-- 
tejun