[PATCH V2] workqueue: Process rescuer work items one-by-one using a positional marker

Lai Jiangshan posted 1 patch 1 week, 6 days ago
kernel/workqueue.c | 106 +++++++++++++++++++++++++++++----------------
1 file changed, 69 insertions(+), 37 deletions(-)
[PATCH V2] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Lai Jiangshan 1 week, 6 days ago
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.

Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.

Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives
normal worker threads a chance to process other items which gives them
the opportinity to be processed instead of waiting on the rescuer's
queue and prevents a blocking work item from stalling the rest once
memory pressure is relieved.

Introduce a positional dummy work item to avoid potentially O(N^2)
rescans of the work list.  The marker records the resume position for
the next scan, eliminating redundant traversals.

Processing items one-by-one has additional benefits:
1) If the pwq does not need rescue (normal workers have been crated or
   become available), the rescuer can change to other stalled pwqs
   immediately.

2) It also naturally handles newly created work items during rescue.
   For example, work items dispatched via pwq_activate_first_inactive()
   or chained queueing will be picked up in subsequent rescuer
   iterations without special handling.

Cc: ying chen <yc1082463@gmail.com>
Reported-by: ying chen <yc1082463@gmail.com>
Fixes: e22bee782b3b ("workqueue: implement concurrency managed dynamic worker pool")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
Changed from v1:

Insert the marker and use it purely as iteration marker as Tejun
request.
  It is hard to maintain the proper state of it, expecially maintaining
  it only in pool->worklist which also requires an unlikely path in the
  worker_thread() path.

Add an unlikely path in the worker_thread() path(in assign_work()).

Add other code in other place to make it not be treated like a work
item.  I'm sure all the paths have been covered now, but I still feel
a bit nevous about it for future changes.

Extra the code to assign work in the rescuer_thread() out as
assign_rescue_work().

V1: https://lore.kernel.org/lkml/20251113163426.2950-1-jiangshanlai@gmail.com/

 kernel/workqueue.c | 106 +++++++++++++++++++++++++++++----------------
 1 file changed, 69 insertions(+), 37 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 19bc7ea931d8..29f7610ee8c5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -286,6 +286,7 @@ struct pool_workqueue {
 	struct list_head	pending_node;	/* LN: node on wq_node_nr_active->pending_pwqs */
 	struct list_head	pwqs_node;	/* WR: node on wq->pwqs */
 	struct list_head	mayday_node;	/* MD: node on wq->maydays */
+	struct work_struct	mayday_pos_work;/* L: position on pool->worklist */
 
 	u64			stats[PWQ_NR_STATS];
 
@@ -1126,6 +1127,12 @@ static struct worker *find_worker_executing_work(struct worker_pool *pool,
 	return NULL;
 }
 
+static void mayday_pos_func(struct work_struct *work)
+{
+	/* should not be processed, only for marking position */
+	BUG();
+}
+
 /**
  * move_linked_works - move linked works to a list
  * @work: start of series of works to be scheduled
@@ -1188,6 +1195,15 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
 
 	lockdep_assert_held(&pool->lock);
 
+	/* The positional work should not be processed */
+	if (unlikely(work->func == mayday_pos_func)) {
+		/* only worker_thread() can possibly take this branch */
+		if (WARN_ON_ONCE(nextp))
+			*nextp = list_next_entry(work, entry);
+		list_del_init(&work->entry);
+		return false;
+	}
+
 	/*
 	 * A single work shouldn't be executed concurrently by multiple workers.
 	 * __queue_work() ensures that @work doesn't jump to a different pool
@@ -2991,7 +3007,7 @@ static void send_mayday(struct work_struct *work)
 		return;
 
 	/* mayday mayday mayday */
-	if (list_empty(&pwq->mayday_node)) {
+	if (list_empty(&pwq->mayday_node) && list_empty(&pwq->mayday_pos_work.entry)) {
 		/*
 		 * If @pwq is for an unbound wq, its base ref may be put at
 		 * any time due to an attribute change.  Pin @pwq until the
@@ -3001,6 +3017,8 @@ static void send_mayday(struct work_struct *work)
 		list_add_tail(&pwq->mayday_node, &wq->maydays);
 		wake_up_process(wq->rescuer->task);
 		pwq->stats[PWQ_STAT_MAYDAY]++;
+
+		list_add_tail(&pwq->mayday_pos_work.entry, &work->entry);
 	}
 }
 
@@ -3439,6 +3457,37 @@ static int worker_thread(void *__worker)
 	goto woke_up;
 }
 
+static bool assign_rescue_work(struct pool_workqueue *pwq, struct worker *rescuer)
+{
+	struct worker_pool *pool = pwq->pool;
+	struct work_struct *work, *n;
+
+	/* from where to search */
+	if (list_empty(&pwq->mayday_pos_work.entry))
+		work = list_first_entry(&pool->worklist, struct work_struct, entry);
+	else {
+		work = list_next_entry(&pwq->mayday_pos_work, entry);
+		/* It might be at a new position or not need position anymore */
+		list_del_init(&pwq->mayday_pos_work.entry);
+	}
+
+	/* need rescue? */
+	if (!need_to_create_worker(pool))
+		return false;
+
+	/* try to assign a work to rescue */
+	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
+		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
+			pwq->stats[PWQ_STAT_RESCUED]++;
+			/* mark the position for next search */
+			list_add_tail(&pwq->mayday_pos_work.entry, &n->entry);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 /**
  * rescuer_thread - the rescuer thread function
  * @__rescuer: self
@@ -3493,7 +3542,6 @@ static int rescuer_thread(void *__rescuer)
 		struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
 					struct pool_workqueue, mayday_node);
 		struct worker_pool *pool = pwq->pool;
-		struct work_struct *work, *n;
 
 		__set_current_state(TASK_RUNNING);
 		list_del_init(&pwq->mayday_node);
@@ -3504,43 +3552,9 @@ static int rescuer_thread(void *__rescuer)
 
 		raw_spin_lock_irq(&pool->lock);
 
-		/*
-		 * Slurp in all works issued via this workqueue and
-		 * process'em.
-		 */
-		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
-		list_for_each_entry_safe(work, n, &pool->worklist, entry) {
-			if (get_work_pwq(work) == pwq &&
-			    assign_work(work, rescuer, &n))
-				pwq->stats[PWQ_STAT_RESCUED]++;
-		}
-
-		if (!list_empty(&rescuer->scheduled)) {
+		while (assign_rescue_work(pwq, rescuer))
 			process_scheduled_works(rescuer);
 
-			/*
-			 * The above execution of rescued work items could
-			 * have created more to rescue through
-			 * pwq_activate_first_inactive() or chained
-			 * queueing.  Let's put @pwq back on mayday list so
-			 * that such back-to-back work items, which may be
-			 * being used to relieve memory pressure, don't
-			 * incur MAYDAY_INTERVAL delay inbetween.
-			 */
-			if (pwq->nr_active && need_to_create_worker(pool)) {
-				raw_spin_lock(&wq_mayday_lock);
-				/*
-				 * Queue iff we aren't racing destruction
-				 * and somebody else hasn't queued it already.
-				 */
-				if (wq->rescuer && list_empty(&pwq->mayday_node)) {
-					get_pwq(pwq);
-					list_add_tail(&pwq->mayday_node, &wq->maydays);
-				}
-				raw_spin_unlock(&wq_mayday_lock);
-			}
-		}
-
 		/*
 		 * Leave this pool. Notify regular workers; otherwise, we end up
 		 * with 0 concurrency and stalling the execution.
@@ -5153,6 +5167,20 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	kthread_init_work(&pwq->release_work, pwq_release_workfn);
+
+	/*
+	 * Set the dumpy positional work with valid function and get_work_pwq().
+	 *
+	 * The positional work should only be in the pwq->pool->worklist, and
+	 * should never be queued, processed, flushed, cancelled or even examed
+	 * as a work item.
+	 *
+	 * WORK_STRUCT_PENDING and WORK_STRUCT_INACTIVE just make it less
+	 * surprise for kernel debuging tools and reviewers.
+	 */
+	INIT_WORK(&pwq->mayday_pos_work, mayday_pos_func);
+	atomic_long_set(&pwq->mayday_pos_work.data, (unsigned long)pwq |
+			WORK_STRUCT_PENDING | WORK_STRUCT_PWQ | WORK_STRUCT_INACTIVE);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
@@ -6300,6 +6328,8 @@ static void show_pwq(struct pool_workqueue *pwq)
 
 	list_for_each_entry(work, &pool->worklist, entry) {
 		if (get_work_pwq(work) == pwq) {
+			if (work->func == mayday_pos_func)
+				continue;
 			has_pending = true;
 			break;
 		}
@@ -6311,6 +6341,8 @@ static void show_pwq(struct pool_workqueue *pwq)
 		list_for_each_entry(work, &pool->worklist, entry) {
 			if (get_work_pwq(work) != pwq)
 				continue;
+			if (work->func == mayday_pos_func)
+				continue;
 
 			pr_cont_work(comma, work, &pcws);
 			comma = !(*work_data_bits(work) & WORK_STRUCT_LINKED);
-- 
2.19.1.6.gb485710b
Re: [PATCH V2] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Tejun Heo 1 week, 4 days ago
Hello, Lai.

On Tue, Nov 18, 2025 at 05:38:31PM +0800, Lai Jiangshan wrote:
> @@ -286,6 +286,7 @@ struct pool_workqueue {
>  	struct list_head	pending_node;	/* LN: node on wq_node_nr_active->pending_pwqs */
>  	struct list_head	pwqs_node;	/* WR: node on wq->pwqs */
>  	struct list_head	mayday_node;	/* MD: node on wq->maydays */
> +	struct work_struct	mayday_pos_work;/* L: position on pool->worklist */

Maybe mayday_cursor?

> @@ -1188,6 +1195,15 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
>  
>  	lockdep_assert_held(&pool->lock);
>  
> +	/* The positional work should not be processed */
> +	if (unlikely(work->func == mayday_pos_func)) {
> +		/* only worker_thread() can possibly take this branch */
> +		if (WARN_ON_ONCE(nextp))
> +			*nextp = list_next_entry(work, entry);

I find it confusing to conditionalize the check on @nextp as the fact that
@nextp is only not NULL for worker_thread() is rather incidental. Maybe just
do this in the caller instead?

> +static bool assign_rescue_work(struct pool_workqueue *pwq, struct worker *rescuer)
> +{
> +	struct worker_pool *pool = pwq->pool;
> +	struct work_struct *work, *n;
> +
> +	/* from where to search */
> +	if (list_empty(&pwq->mayday_pos_work.entry))
> +		work = list_first_entry(&pool->worklist, struct work_struct, entry);

Should be fully winged - if () {} else {}. Also, I wonder whether the cursor
handling can be contained on this side. ie. Why does send_mayday() need to
check whether the cursor is on the list?

> +	else {
> +		work = list_next_entry(&pwq->mayday_pos_work, entry);
> +		/* It might be at a new position or not need position anymore */
> +		list_del_init(&pwq->mayday_pos_work.entry);
> +	}
> +
> +	/* need rescue? */
> +	if (!need_to_create_worker(pool))
> +		return false;
> +
> +	/* try to assign a work to rescue */
> +	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
> +		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
> +			pwq->stats[PWQ_STAT_RESCUED]++;
> +			/* mark the position for next search */
> +			list_add_tail(&pwq->mayday_pos_work.entry, &n->entry);
> +			return true;
> +		}
> +	}

Would splitting it into two patches make it easier to follow? ie. First
patch to factor out assign_rescuer_work(), the second one to implement
one-at-a-time operation.

>  /* sync @pwq with the current state of its associated wq and link it */
> @@ -6300,6 +6328,8 @@ static void show_pwq(struct pool_workqueue *pwq)
>  
>  	list_for_each_entry(work, &pool->worklist, entry) {
>  		if (get_work_pwq(work) == pwq) {
> +			if (work->func == mayday_pos_func)
> +				continue;

Do we need to skip these? These are debug dumps anyway. Can't we just show
them?

>  			has_pending = true;
>  			break;
>  		}
> @@ -6311,6 +6341,8 @@ static void show_pwq(struct pool_workqueue *pwq)
>  		list_for_each_entry(work, &pool->worklist, entry) {
>  			if (get_work_pwq(work) != pwq)
>  				continue;
> +			if (work->func == mayday_pos_func)
> +				continue;

Ditto.

Thanks.

-- 
tejun
Re: [PATCH V2] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Lai Jiangshan 1 week, 3 days ago
Hello Tejun,

On Fri, Nov 21, 2025 at 10:14 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Lai.
>
> On Tue, Nov 18, 2025 at 05:38:31PM +0800, Lai Jiangshan wrote:
> > @@ -286,6 +286,7 @@ struct pool_workqueue {
> >       struct list_head        pending_node;   /* LN: node on wq_node_nr_active->pending_pwqs */
> >       struct list_head        pwqs_node;      /* WR: node on wq->pwqs */
> >       struct list_head        mayday_node;    /* MD: node on wq->maydays */
> > +     struct work_struct      mayday_pos_work;/* L: position on pool->worklist */
>
> Maybe mayday_cursor?
>
> > @@ -1188,6 +1195,15 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
> >
> >       lockdep_assert_held(&pool->lock);
> >
> > +     /* The positional work should not be processed */
> > +     if (unlikely(work->func == mayday_pos_func)) {
> > +             /* only worker_thread() can possibly take this branch */
> > +             if (WARN_ON_ONCE(nextp))
> > +                     *nextp = list_next_entry(work, entry);
>
> I find it confusing to conditionalize the check on @nextp as the fact that
> @nextp is only not NULL for worker_thread() is rather incidental. Maybe just
> do this in the caller instead?

I will just remove the WARN_ON_ONCE() and the comments.

The check of mayday_pos_func here intends to make it clear that
either the worker is regular or rescuer, it must not move the faked
mayday_pos_func work to its ->scheduled since it is not inserted
by insert_work().

worker_thread() needs this check and removes the cursor.

rescuer_thread() only searches after the cursor, so it will never try
to assign itself with the pwq->mayday_cursor work. But having the
checking is harmless and clearer.

Or may:
    /* only worker_thread() can possibly take this branch */
    WARN_ON_ONCE(worker->rescue_wq);
    if (nextp)
        *nextp = list_next_entry(work, entry);

>
> > +static bool assign_rescue_work(struct pool_workqueue *pwq, struct worker *rescuer)
> > +{
> > +     struct worker_pool *pool = pwq->pool;
> > +     struct work_struct *work, *n;
> > +
> > +     /* from where to search */
> > +     if (list_empty(&pwq->mayday_pos_work.entry))
> > +             work = list_first_entry(&pool->worklist, struct work_struct, entry);
>
> Should be fully winged - if () {} else {}. Also, I wonder whether the cursor
> handling can be contained on this side. ie. Why does send_mayday() need to
> check whether the cursor is on the list?


send_mayday() has already traversed the worklist, so it seems reasonable
to insert a cursor to avoid repeating the work. The worklist may be long,
and the first work of a pwq may be near the end.

If the cursor is on the list, it means the rescuer is handling the pwq,
not queuing mayday_node can reduce the work of an extra pool
attach/detach.

Since it is a rescuer, I will remove the check/insertion in send_mayday().

>
> > +     else {
> > +             work = list_next_entry(&pwq->mayday_pos_work, entry);
> > +             /* It might be at a new position or not need position anymore */
> > +             list_del_init(&pwq->mayday_pos_work.entry);
> > +     }
> > +
> > +     /* need rescue? */
> > +     if (!need_to_create_worker(pool))
> > +             return false;
> > +
> > +     /* try to assign a work to rescue */
> > +     list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
> > +             if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
> > +                     pwq->stats[PWQ_STAT_RESCUED]++;
> > +                     /* mark the position for next search */
> > +                     list_add_tail(&pwq->mayday_pos_work.entry, &n->entry);
> > +                     return true;
> > +             }
> > +     }
>
> Would splitting it into two patches make it easier to follow? ie. First
> patch to factor out assign_rescuer_work(), the second one to implement
> one-at-a-time operation.

will do.

>
> >  /* sync @pwq with the current state of its associated wq and link it */
> > @@ -6300,6 +6328,8 @@ static void show_pwq(struct pool_workqueue *pwq)
> >
> >       list_for_each_entry(work, &pool->worklist, entry) {
> >               if (get_work_pwq(work) == pwq) {
> > +                     if (work->func == mayday_pos_func)
> > +                             continue;
>
> Do we need to skip these? These are debug dumps anyway. Can't we just show
> them?

will do.

>
> >                       has_pending = true;
> >                       break;
> >               }
> > @@ -6311,6 +6341,8 @@ static void show_pwq(struct pool_workqueue *pwq)
> >               list_for_each_entry(work, &pool->worklist, entry) {
> >                       if (get_work_pwq(work) != pwq)
> >                               continue;
> > +                     if (work->func == mayday_pos_func)
> > +                             continue;
>
> Ditto.

will do.

Thanks
Lai
Re: [PATCH V2] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Lai Jiangshan 1 week, 6 days ago
Hello

On Tue, Nov 18, 2025 at 5:34 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:

> ---
> Changed from v1:
>
> Insert the marker and use it purely as iteration marker as Tejun
> request.
>   It is hard to maintain the proper state of it, expecially maintaining
>   it only in pool->worklist which also requires an unlikely path in the
>   worker_thread() path.

It is hard to maintain the proper state of it if it is treated as a work
item as V1, especially maintaining it only in pool->worklist which
also requires an unlikely path in the worker_thread() path to
defeat the collision requeuing.

Thanks
Lai

>
> Add an unlikely path in the worker_thread() path(in assign_work()).
>
> Add other code in other place to make it not be treated like a work
> item.  I'm sure all the paths have been covered now, but I still feel
> a bit nevous about it for future changes.
>
> Extra the code to assign work in the rescuer_thread() out as
> assign_rescue_work().
>
> V1: https://lore.kernel.org/lkml/20251113163426.2950-1-jiangshanlai@gmail.com/
>