workqueue: Reduce expensive locks for unbound workqueue

[PATCH] workqueue: Reduce expensive locks for unbound workqueue

Posted by Wangyang Guo 1 week ago

For unbound workqueue, pwqs usually map to just a few pools. Most of
the time, pwqs will be linked sequentially to wq->pwqs list by cpu
index.  Usually, consecutive CPUs have the same workqueue attribute
(e.g. belong to the same NUMA node). This makes pwqs with the same
pool cluster together in the pwq list.

Only do lock/unlock if the pool has changed in flush_workqueue_prep_pwqs().
This reduces the number of expensive lock operations.

The performance data shows this change boosts FIO by 65x in some cases
when multiple concurrent threads write to xfs mount points with fsync.

FIO Benchmark Details
- FIO version: v3.35
- FIO Options: ioengine=libaio,iodepth=64,norandommap=1,rw=write,
  size=128M,bs=4k,fsync=1
- FIO Job Configs: 64 jobs in total writing to 4 mount points (ramdisks
  formatted as xfs file system).
- Kernel Codebase: v6.12-rc5
- Test Platform: Xeon 8380 (2 sockets)

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 kernel/workqueue.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9949ffad8df0..8b07576814a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3833,16 +3833,28 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
 {
 	bool wait = false;
 	struct pool_workqueue *pwq;
+	struct worker_pool *current_pool = NULL;
 
 	if (flush_color >= 0) {
 		WARN_ON_ONCE(atomic_read(&wq->nr_pwqs_to_flush));
 		atomic_set(&wq->nr_pwqs_to_flush, 1);
 	}
 
+	/*
+	 * For unbound workqueue, pwqs will map to only a few pools.
+	 * Most of the time, pwqs within the same pool will be linked
+	 * sequentially to wq->pwqs by cpu index. So in the majority
+	 * of pwq iters, the pool is the same, only doing lock/unlock
+	 * if the pool has changed. This can largely reduce expensive
+	 * lock operations.
+	 */
 	for_each_pwq(pwq, wq) {
-		struct worker_pool *pool = pwq->pool;
-
-		raw_spin_lock_irq(&pool->lock);
+		if (current_pool != pwq->pool) {
+			if (likely(current_pool))
+				raw_spin_unlock_irq(&current_pool->lock);
+			current_pool = pwq->pool;
+			raw_spin_lock_irq(&current_pool->lock);
+		}
 
 		if (flush_color >= 0) {
 			WARN_ON_ONCE(pwq->flush_color != -1);
@@ -3859,9 +3871,11 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
 			pwq->work_color = work_color;
 		}
 
-		raw_spin_unlock_irq(&pool->lock);
 	}
 
+	if (current_pool)
+		raw_spin_unlock_irq(&current_pool->lock);
+
 	if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_pwqs_to_flush))
 		complete(&wq->first_flusher->done);
 
-- 
2.43.5

Re: [PATCH] workqueue: Reduce expensive locks for unbound workqueue

Posted by Tejun Heo 6 days, 23 hours ago

On Fri, Nov 15, 2024 at 01:49:36PM +0800, Wangyang Guo wrote:
> For unbound workqueue, pwqs usually map to just a few pools. Most of
> the time, pwqs will be linked sequentially to wq->pwqs list by cpu
> index.  Usually, consecutive CPUs have the same workqueue attribute
> (e.g. belong to the same NUMA node). This makes pwqs with the same
> pool cluster together in the pwq list.
> 
> Only do lock/unlock if the pool has changed in flush_workqueue_prep_pwqs().
> This reduces the number of expensive lock operations.
> 
> The performance data shows this change boosts FIO by 65x in some cases
> when multiple concurrent threads write to xfs mount points with fsync.
> 
> FIO Benchmark Details
> - FIO version: v3.35
> - FIO Options: ioengine=libaio,iodepth=64,norandommap=1,rw=write,
>   size=128M,bs=4k,fsync=1
> - FIO Job Configs: 64 jobs in total writing to 4 mount points (ramdisks
>   formatted as xfs file system).
> - Kernel Codebase: v6.12-rc5
> - Test Platform: Xeon 8380 (2 sockets)
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>

Applied to wq/for-6.13.

Thanks.

-- 
tejun

Re: [PATCH] workqueue: Reduce expensive locks for unbound workqueue

Posted by Lai Jiangshan 1 week ago

On Fri, Nov 15, 2024 at 2:00 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>
> For unbound workqueue, pwqs usually map to just a few pools. Most of
> the time, pwqs will be linked sequentially to wq->pwqs list by cpu
> index.  Usually, consecutive CPUs have the same workqueue attribute
> (e.g. belong to the same NUMA node). This makes pwqs with the same
> pool cluster together in the pwq list.
>
> Only do lock/unlock if the pool has changed in flush_workqueue_prep_pwqs().
> This reduces the number of expensive lock operations.
>
> The performance data shows this change boosts FIO by 65x in some cases
> when multiple concurrent threads write to xfs mount points with fsync.
>
> FIO Benchmark Details
> - FIO version: v3.35
> - FIO Options: ioengine=libaio,iodepth=64,norandommap=1,rw=write,
>   size=128M,bs=4k,fsync=1
> - FIO Job Configs: 64 jobs in total writing to 4 mount points (ramdisks
>   formatted as xfs file system).
> - Kernel Codebase: v6.12-rc5
> - Test Platform: Xeon 8380 (2 sockets)
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>  kernel/workqueue.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

This is a problem caused by 636b927eba5b("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues").

Before the said commit, there is much less likely that two or more PWQs
in the same WQs share the same pool. After the commit, it becomes a common case.

I planned to make the PWQs shared for different CPUs if possible.
But the patch[1] has a problem which is easy to fix.
I will update it if it is needed.

Thanks
Lai

[1] https://lore.kernel.org/lkml/20231227145143.2399-3-jiangshanlai@gmail.com/