[PATCH] workqueue: Process rescuer work items one-by-one using a positional marker

Lai Jiangshan posted 1 patch 2 months, 3 weeks ago
There is a newer version of this series
kernel/workqueue.c | 99 +++++++++++++++++++++++++++++++---------------
1 file changed, 67 insertions(+), 32 deletions(-)
[PATCH] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Lai Jiangshan 2 months, 3 weeks ago
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.

Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.

Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives normal
worker threads a chance to process other items and prevents a blocking
work item from stalling the rest once memory pressure is relieved.

Introduce a positional dummy work item to avoid potentially O(N^2)
rescans of the work list.  The marker records the resume position for
the next scan, eliminating redundant traversals.

The positional marker also has an additional benefit:
if it is processed by a normal worker rather than the rescuer, this
indicates that the pool has made forward progress. In that case, the pwq
does not need to be rescued until the next mayday event, allowing the
rescuer to focus on other stalled pools.

Processing items one-by-one also naturally handles newly scheduled work
items during rescue.  For example, work items dispatched via
pwq_activate_first_inactive() or chained queueing will be picked up in
subsequent rescuer iterations without special handling.

Reported-by: ying chen <yc1082463@gmail.com>
Cc: ying chen <yc1082463@gmail.com>
Fixes: e22bee782b3b ("workqueue: implement concurrency managed dynamic worker pool")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 99 +++++++++++++++++++++++++++++++---------------
 1 file changed, 67 insertions(+), 32 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 45320e27a16c..af182a19a8b1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -286,6 +286,7 @@ struct pool_workqueue {
 	struct list_head	pending_node;	/* LN: node on wq_node_nr_active->pending_pwqs */
 	struct list_head	pwqs_node;	/* WR: node on wq->pwqs */
 	struct list_head	mayday_node;	/* MD: node on wq->maydays */
+	struct work_struct	mayday_pos_work;/* L: position on pool->worklist */
 
 	u64			stats[PWQ_NR_STATS];
 
@@ -2982,6 +2983,50 @@ static void idle_cull_fn(struct work_struct *work)
 	reap_dying_workers(&cull_list);
 }
 
+static void mayday_pos_func(struct work_struct *work)
+{
+}
+
+/**
+ * insert_mayday_pos - Insert the positional work for mayday
+ * @pwq: The pwq that needs to be rescued
+ * @next: The next work where positional work will be inserted before
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock).
+ */
+static void insert_mayday_pos(struct pool_workqueue *pwq, struct work_struct *next)
+{
+	unsigned int work_flags;
+	unsigned int work_color;
+
+	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&pwq->mayday_pos_work));
+
+	/* The mayday positional work item does not participate in nr_active. */
+	work_flags = WORK_STRUCT_INACTIVE;
+	work_color = pwq->work_color;
+	work_flags |= work_color_to_flags(work_color);
+	pwq->nr_in_flight[work_color]++;
+	insert_work(pwq, &pwq->mayday_pos_work, &next->entry, work_flags);
+}
+
+/**
+ * remove_mayday_pos - Remove the inserted positional work
+ * @pwq: The pwq that needs to be rescued
+ *
+ * WORK_STRUCT_INACTIVE ensures that the pool lock is not dropped in
+ * pwq_dec_nr_in_flight()
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock).
+ */
+static void remove_mayday_pos(struct pool_workqueue *pwq)
+{
+	list_del_init(&pwq->mayday_pos_work.entry);
+	pwq_dec_nr_in_flight(pwq, *work_data_bits(&pwq->mayday_pos_work));
+	INIT_WORK(&pwq->mayday_pos_work, mayday_pos_func);
+}
+
 static void send_mayday(struct work_struct *work)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
@@ -2992,6 +3037,9 @@ static void send_mayday(struct work_struct *work)
 	if (!wq->rescuer)
 		return;
 
+	if (!work_pending(&pwq->mayday_pos_work))
+		insert_mayday_pos(pwq, work);
+
 	/* mayday mayday mayday */
 	if (list_empty(&pwq->mayday_node)) {
 		/*
@@ -3508,41 +3556,27 @@ static int rescuer_thread(void *__rescuer)
 
 		raw_spin_lock_irq(&pool->lock);
 
-		/*
-		 * Slurp in all works issued via this workqueue and
-		 * process'em.
-		 */
-		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
-		list_for_each_entry_safe(work, n, &pool->worklist, entry) {
-			if (get_work_pwq(work) == pwq &&
-			    assign_work(work, rescuer, &n))
+rescan:
+		/* If the positional work is processed by a normal worker,
+		 * the pwq doesn't need to be rescued. */
+		if (work_pending(&pwq->mayday_pos_work)) {
+			/* scan from the positional work to avoid potentially O(N^2) scanning */
+			work = &pwq->mayday_pos_work;
+			list_for_each_entry_safe_continue(work, n, &pool->worklist, entry) {
+				if (get_work_pwq(work) != pwq)
+					continue;
+				if (!assign_work(work, rescuer, &n))
+					continue;
 				pwq->stats[PWQ_STAT_RESCUED]++;
-		}
-
-		if (!list_empty(&rescuer->scheduled)) {
-			process_scheduled_works(rescuer);
-
-			/*
-			 * The above execution of rescued work items could
-			 * have created more to rescue through
-			 * pwq_activate_first_inactive() or chained
-			 * queueing.  Let's put @pwq back on mayday list so
-			 * that such back-to-back work items, which may be
-			 * being used to relieve memory pressure, don't
-			 * incur MAYDAY_INTERVAL delay inbetween.
-			 */
-			if (pwq->nr_active && need_to_create_worker(pool)) {
-				raw_spin_lock(&wq_mayday_lock);
-				/*
-				 * Queue iff we aren't racing destruction
-				 * and somebody else hasn't queued it already.
-				 */
-				if (wq->rescuer && list_empty(&pwq->mayday_node)) {
-					get_pwq(pwq);
-					list_add_tail(&pwq->mayday_node, &wq->maydays);
+				/* reset the position and handle the assigned work */
+				if (list_next_entry(&pwq->mayday_pos_work, entry) != n) {
+					remove_mayday_pos(pwq);
+					insert_mayday_pos(pwq, n);
 				}
-				raw_spin_unlock(&wq_mayday_lock);
+				process_scheduled_works(rescuer);
+				goto rescan;
 			}
+			remove_mayday_pos(pwq);
 		}
 
 		/*
@@ -5162,6 +5196,7 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	INIT_LIST_HEAD(&pwq->pending_node);
 	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
+	INIT_WORK(&pwq->mayday_pos_work, mayday_pos_func);
 	kthread_init_work(&pwq->release_work, pwq_release_workfn);
 }
 
-- 
2.19.1.6.gb485710b
Re: [PATCH] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Tejun Heo 2 months, 3 weeks ago
Hello, Lai.

On Fri, Nov 14, 2025 at 12:34:26AM +0800, Lai Jiangshan wrote:
> +static void insert_mayday_pos(struct pool_workqueue *pwq, struct work_struct *next)
> +{
> +	unsigned int work_flags;
> +	unsigned int work_color;
> +
> +	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&pwq->mayday_pos_work));

Maybe use test_and_set_bit() here like normal work item?

> +static void remove_mayday_pos(struct pool_workqueue *pwq)
> +{
> +	list_del_init(&pwq->mayday_pos_work.entry);
> +	pwq_dec_nr_in_flight(pwq, *work_data_bits(&pwq->mayday_pos_work));
> +	INIT_WORK(&pwq->mayday_pos_work, mayday_pos_func);

and maybe we can init the work item once and

>  static void send_mayday(struct work_struct *work)
>  {
>  	struct pool_workqueue *pwq = get_work_pwq(work);
> @@ -2992,6 +3037,9 @@ static void send_mayday(struct work_struct *work)
>  	if (!wq->rescuer)
>  		return;
>  
> +	if (!work_pending(&pwq->mayday_pos_work))

drop this conditional too?

Also, I wonder whether it'd be simpler to think about if we just exclude the
work item from flush color management. e.g. we can just flag the work item
and then make work scanning skip them, so that they really are just markers;
then, we don't have to worry about colors or othre states at all. We just
insert the marker and use it purely as iteration marker.

> +				/* reset the position and handle the assigned work */
> +				if (list_next_entry(&pwq->mayday_pos_work, entry) != n) {
> +					remove_mayday_pos(pwq);
> +					insert_mayday_pos(pwq, n);

and this would become simple list_move_tail().

Thanks.

-- 
tejun
Re: [PATCH] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Lai Jiangshan 2 months, 3 weeks ago
On Fri, Nov 14, 2025 at 1:15 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Lai.
>
> On Fri, Nov 14, 2025 at 12:34:26AM +0800, Lai Jiangshan wrote:
> > +static void insert_mayday_pos(struct pool_workqueue *pwq, struct work_struct *next)
> > +{
> > +     unsigned int work_flags;
> > +     unsigned int work_color;
> > +
> > +     __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&pwq->mayday_pos_work));
>
> Maybe use test_and_set_bit() here like normal work item?

This code is copied from insert_wq_barrier().
I will change it to use test_and_set_bit().

>
> > +static void remove_mayday_pos(struct pool_workqueue *pwq)
> > +{
> > +     list_del_init(&pwq->mayday_pos_work.entry);
> > +     pwq_dec_nr_in_flight(pwq, *work_data_bits(&pwq->mayday_pos_work));
> > +     INIT_WORK(&pwq->mayday_pos_work, mayday_pos_func);
>
> and maybe we can init the work item once and


It has to clear the pending bit, I think INIT_WORK() is quite convenient to
do so. I will use set_work_pool_and_keep_pending(&pwq->mayday_pos_work,
WORK_OFFQ_POOL_NONE, 0) instead.

remove_mayday_pos() works as the owner rescuer processes/cancels a positional
work without releasing the pool lock.  Maybe the function should be renamed.

>
> >  static void send_mayday(struct work_struct *work)
> >  {
> >       struct pool_workqueue *pwq = get_work_pwq(work);
> > @@ -2992,6 +3037,9 @@ static void send_mayday(struct work_struct *work)
> >       if (!wq->rescuer)
> >               return;
> >
> > +     if (!work_pending(&pwq->mayday_pos_work))
>
> drop this conditional too?
>
> Also, I wonder whether it'd be simpler to think about if we just exclude the
> work item from flush color management. e.g. we can just flag the work item
> and then make work scanning skip them, so that they really are just markers;
> then, we don't have to worry about colors or othre states at all. We just
> insert the marker and use it purely as iteration marker.

I think it should be processable by the normal workers, so that we don't
have to add special code in the fast path in worker_thread(), and it should work
like a normal work item or the wq_barrier.

The ability of processable by the normal workers has an additional benefit:
if it is processed by a normal worker rather than the rescuer, this
indicates that the pool has made forward progress by normal workers
and  the pwq does not need to be rescued until the next mayday event.

Like the wq_barrier, the positional work should not contribute to nr_active,
e.g. it is INACTIVE.

The commit d812796eb390 ("workqueue: Assign a color to barrier work items")
explains why we need a color for the INACTIVE, I don't think it is a good idea
not to assign a color to it.

>
> > +                             /* reset the position and handle the assigned work */
> > +                             if (list_next_entry(&pwq->mayday_pos_work, entry) != n) {
> > +                                     remove_mayday_pos(pwq);
> > +                                     insert_mayday_pos(pwq, n);
>
> and this would become simple list_move_tail().
>

It intended to update the positional work color to avoid back-to-back
work items to
stall the flush_workqueue().

But it did it wrong. Maybe this is better:

if (get_work_color(*work_data_bits(&pwq->mayday_pos_work)) != pwq->work_color) {
{
  remove_mayday_pos(pwq);
  insert_mayday_pos(pwq, n);
} else {
  list_move_tail(&pwq->mayday_pos_work.entry, &n->entry);
}

Thanks
Lai
Re: [PATCH] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Tejun Heo 2 months, 3 weeks ago
Hello,

On Fri, Nov 14, 2025 at 10:41:04AM +0800, Lai Jiangshan wrote:
> > Also, I wonder whether it'd be simpler to think about if we just exclude the
> > work item from flush color management. e.g. we can just flag the work item
> > and then make work scanning skip them, so that they really are just markers;
> > then, we don't have to worry about colors or othre states at all. We just
> > insert the marker and use it purely as iteration marker.
> 
> I think it should be processable by the normal workers, so that we don't
> have to add special code in the fast path in worker_thread(), and it should work
> like a normal work item or the wq_barrier.

I don't know. I don't think a single unlikely() test is going to be
noticeable.

> The ability of processable by the normal workers has an additional benefit:
> if it is processed by a normal worker rather than the rescuer, this
> indicates that the pool has made forward progress by normal workers
> and  the pwq does not need to be rescued until the next mayday event.

Is that correct tho? A worker thread executes any work item on the list and
the pool may be shared by any number of workqueues, so the cursor can be
cleared without any of the work items that belong to the rescuer's workqueue
being executed, no?

> It intended to update the positional work color to avoid back-to-back
> work items to
> stall the flush_workqueue().
> 
> But it did it wrong. Maybe this is better:
> 
> if (get_work_color(*work_data_bits(&pwq->mayday_pos_work)) != pwq->work_color) {
> {
>   remove_mayday_pos(pwq);
>   insert_mayday_pos(pwq, n);
> } else {
>   list_move_tail(&pwq->mayday_pos_work.entry, &n->entry);
> }

The colored flush mechanism is already complex enough and difficult wrap
one's head around. I don't really like adding more complications there.

Thanks.

-- 
tejun
Re: [PATCH] workqueue: Process rescuer work items one-by-one using a positional marker
Posted by Lai Jiangshan 2 months, 3 weeks ago
Hello

Some extra stuff.

On Fri, Nov 14, 2025 at 12:30 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:

> @@ -286,6 +286,7 @@ struct pool_workqueue {
>         struct list_head        pending_node;   /* LN: node on wq_node_nr_active->pending_pwqs */
>         struct list_head        pwqs_node;      /* WR: node on wq->pwqs */
>         struct list_head        mayday_node;    /* MD: node on wq->maydays */
> +       struct work_struct      mayday_pos_work;/* L: position on pool->worklist */

pwq->mayday_pos_work, pwq->release_work and pwq->rcu can be made to share
the same memory.  It will be done in a later cleanup patch.

> +static void insert_mayday_pos(struct pool_workqueue *pwq, struct work_struct *next)
> +{
> +       unsigned int work_flags;
> +       unsigned int work_color;
> +
> +       __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&pwq->mayday_pos_work));
> +
> +       /* The mayday positional work item does not participate in nr_active. */
> +       work_flags = WORK_STRUCT_INACTIVE;
> +       work_color = pwq->work_color;

I thought quite a while about the color for INACTIVE, I think the color doesn't
make sense as long as the color is "eligible", so pwq->work_color is used here
rather than inheriting from an associated one even though I can easily get the
associated one for this special positional work item.

And I think struct wq_barrier can also use pwq->work_color, and the common
code in multiple places can be moved into insert_work() which results in
much simpler code.

It will be done in a later cleanup patch.

> +       work_flags |= work_color_to_flags(work_color);
> +       pwq->nr_in_flight[work_color]++;
> +       insert_work(pwq, &pwq->mayday_pos_work, &next->entry, work_flags);
> +}
> +

Thanks
Lai