Documentation/admin-guide/kernel-parameters.txt | 4 ++++ kernel/workqueue.c | 5 ++++- 2 files changed, 8 insertions(+), 1 deletion(-)
If creating a new worker takes longer than MAYDAY_INITIAL_TIMEOUT,
the rescuer thread will be woken up to process works scheduled on
@pool, resulting in sequential execution of all works. This may lead
to a situation where one work blocks others. However, the initial
rescue timeout defaults to 10 milliseconds, which can easily be
triggered in heavy-load environments.
---
Documentation/admin-guide/kernel-parameters.txt | 4 ++++
kernel/workqueue.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt
b/Documentation/admin-guide/kernel-parameters.txt
index 149bfa7..be3f488 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7376,6 +7376,10 @@
When enabled, memory and cache locality will be
impacted.
+ workqueue.mayday_initial_timeout
+ Set the initial timeout (jiffies) for the mayday timer.
+ Default is MAYDAY_INITIAL_TIMEOUT.
+
writecombine= [LOONGARCH,EARLY] Control the MAT (Memory Access
Type) of ioremap_wc().
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 003474c..c810b61 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -481,6 +481,9 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct
irq_work [NR_STD_WORKER_POOLS],
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
bh_worker_pools);
+static unsigned long wq_mayday_initial_timeout = MAYDAY_INITIAL_TIMEOUT;
+module_param_named(mayday_initial_timeout, wq_mayday_initial_timeout,
ulong, 0644);
+
/* the per-cpu worker pools */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
cpu_worker_pools);
@@ -3050,7 +3053,7 @@ static void maybe_create_worker(struct worker_pool *pool)
raw_spin_unlock_irq(&pool->lock);
/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
- mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
+ mod_timer(&pool->mayday_timer, jiffies + wq_mayday_initial_timeout);
while (true) {
if (create_worker(pool) || !need_to_create_worker(pool))
--
1.8.3.1
Hello, On Tue, Nov 11, 2025 at 10:52:44AM +0800, ying chen wrote: > If creating a new worker takes longer than MAYDAY_INITIAL_TIMEOUT, > the rescuer thread will be woken up to process works scheduled on > @pool, resulting in sequential execution of all works. This may lead > to a situation where one work blocks others. However, the initial > rescue timeout defaults to 10 milliseconds, which can easily be > triggered in heavy-load environments. This is not how workqueue works. Rescuer doesn't exclude other workers. If other workers become available, they will run the workqueue concurrently. All that initial timeout achieves is delaying the initial execution from the rescuer. Is this from observing real behaviors? If so, what was the test case and how did the behavior after the patch? It couldn't have gotten better. Thanks. -- tejun
On Wed, Nov 12, 2025 at 4:40 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Nov 11, 2025 at 10:52:44AM +0800, ying chen wrote:
> > If creating a new worker takes longer than MAYDAY_INITIAL_TIMEOUT,
> > the rescuer thread will be woken up to process works scheduled on
> > @pool, resulting in sequential execution of all works. This may lead
> > to a situation where one work blocks others. However, the initial
> > rescue timeout defaults to 10 milliseconds, which can easily be
> > triggered in heavy-load environments.
>
> This is not how workqueue works. Rescuer doesn't exclude other workers. If
> other workers become available, they will run the workqueue concurrently.
> All that initial timeout achieves is delaying the initial execution from the
> rescuer.
>
> Is this from observing real behaviors? If so, what was the test case and how
> did the behavior after the patch? It couldn't have gotten better.
>
> Thanks.
>
> --
> tejun
We encountered an XFS deadlock issue. However, unlike the scenario
described in the patch,
in our case the rescuer thread was still woken up even when memory
was sufficient, likely due to heavy load.
patch: xfs: don't use BMBT btree split workers for IO completion
(c85007e2e3942da1f9361e4b5a9388ea3a8dcc5b)
Works that have already been scheduled will be executed sequentially
within the rescuer thread.
static int rescuer_thread(void *__rescuer)
{
......
/*
* Slurp in all works issued via this workqueue and
* process'em.
*/
WARN_ON_ONCE(!list_empty(scheduled));
list_for_each_entry_safe(work, n, &pool->worklist, entry) {
if (get_work_pwq(work) == pwq) {
if (first)
pool->watchdog_ts = jiffies;
move_linked_works(work, scheduled, &n);
}
first = false;
}
if (!list_empty(scheduled)) {
process_scheduled_works(rescuer);
......
}
Hello,
On Wed, Nov 12, 2025 at 10:01:10AM +0800, ying chen wrote:
> Works that have already been scheduled will be executed sequentially
> within the rescuer thread.
> static int rescuer_thread(void *__rescuer)
> {
> ......
> /*
> * Slurp in all works issued via this workqueue and
> * process'em.
> */
> WARN_ON_ONCE(!list_empty(scheduled));
> list_for_each_entry_safe(work, n, &pool->worklist, entry) {
> if (get_work_pwq(work) == pwq) {
> if (first)
> pool->watchdog_ts = jiffies;
> move_linked_works(work, scheduled, &n);
> }
> first = false;
> }
>
> if (!list_empty(scheduled)) {
> process_scheduled_works(rescuer);
> ......
> }
Ah, I see what you mean. The slurping is to avoid potentially O(N^2)
scanning but that probably the wrong trade-off to make here. I think the
right solution is making it break out after finding the first matching work
item and loop outside so that it processes work item one by one.
Thanks.
--
tejun
Hello. On Thu, Nov 13, 2025 at 12:03 AM Tejun Heo <tj@kernel.org> wrote: > > > Ah, I see what you mean. The slurping is to avoid potentially O(N^2) > scanning but that probably the wrong trade-off to make here. I think the > right solution is making it break out after finding the first matching work > item and loop outside so that it processes work item one by one. > I'm implementing it. A positional dummy work item is added to mark the position for the next scan after processing the previous one. I’ll send it soon. Still, I suggest that the workqueue user eliminate any dependencies among the work items when using a rescuer. If the memory pressure is not relieved and no normal workers come to help, it simply won’t work. thanks Lai
On Thu, Nov 13, 2025 at 12:03 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Nov 12, 2025 at 10:01:10AM +0800, ying chen wrote:
> > Works that have already been scheduled will be executed sequentially
> > within the rescuer thread.
> > static int rescuer_thread(void *__rescuer)
> > {
> > ......
> > /*
> > * Slurp in all works issued via this workqueue and
> > * process'em.
> > */
> > WARN_ON_ONCE(!list_empty(scheduled));
> > list_for_each_entry_safe(work, n, &pool->worklist, entry) {
> > if (get_work_pwq(work) == pwq) {
> > if (first)
> > pool->watchdog_ts = jiffies;
> > move_linked_works(work, scheduled, &n);
> > }
> > first = false;
> > }
> >
> > if (!list_empty(scheduled)) {
> > process_scheduled_works(rescuer);
> > ......
> > }
>
> Ah, I see what you mean. The slurping is to avoid potentially O(N^2)
> scanning but that probably the wrong trade-off to make here. I think the
> right solution is making it break out after finding the first matching work
> item and loop outside so that it processes work item one by one.
Processing work items one-by-one is indeed an excellent solution.
However, wouldn't it also be necessary to provide a method for
adjusting the mayday initial timeout?
Thanks.
Hello, On Thu, Nov 13, 2025 at 10:34:43AM +0800, ying chen wrote: > Processing work items one-by-one is indeed an excellent solution. > However, wouldn't it also be necessary to provide a method for > adjusting the mayday initial timeout? Adding an interface like that isn't difficult but I'm not sure what that would achieve. A rescuer is there to guarantee forward progress when the system is under memory pressure and processing work items of the workqueue may be required to free up memory. IOW, when that workqueue not making forward progress can lead to system deadlock. As such, this doesn't have that much system performance implications (aside from the serialization effect that you raised). If rescuer is needed, the system is in tatters anyway, especially in terms of latency response, so I'm not sure what fine-tuning rescuer response time would help with. Thanks. -- tejun
On Tue, Nov 11, 2025 at 10:52:44AM +0800, ying chen wrote: > If creating a new worker takes longer than MAYDAY_INITIAL_TIMEOUT, > the rescuer thread will be woken up to process works scheduled on > @pool, resulting in sequential execution of all works. This may lead > to a situation where one work blocks others. However, the initial > rescue timeout defaults to 10 milliseconds, which can easily be > triggered in heavy-load environments. Then we should tune it automatically, not blame the admin for not tuning this parameter.
Hello, Chen Thanks for trying to improve. On Tue, Nov 11, 2025 at 10:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Nov 11, 2025 at 10:52:44AM +0800, ying chen wrote: > > If creating a new worker takes longer than MAYDAY_INITIAL_TIMEOUT, > > the rescuer thread will be woken up to process works scheduled on > > @pool, resulting in sequential execution of all works. This may lead > > to a situation where one work blocks others. However, the initial > > rescue timeout defaults to 10 milliseconds, which can easily be > > triggered in heavy-load environments. This is what it was intended for. A workqueue equipped with a rescuer is considered a single-threaded workqueue, where deadlock can occur when there are works depending on each other. It is recommended that the user modify the works to eliminate any dependencies or consider using multiple workqueues. Increasing the timeout doesn’t solve the problem; it merely hides it. Changing the workqueue implementation to support concurrency for the rescuer would complicate the design unnecessarily. The use of a rescuer is only for memory reclamation, where the works should be simple, focusing on freeing memory rather than waiting each other. Thanks Lai > > Then we should tune it automatically, not blame the admin for not tuning > this parameter.
On Tue, Nov 11, 2025 at 10:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Nov 11, 2025 at 10:52:44AM +0800, ying chen wrote: > > If creating a new worker takes longer than MAYDAY_INITIAL_TIMEOUT, > > the rescuer thread will be woken up to process works scheduled on > > @pool, resulting in sequential execution of all works. This may lead > > to a situation where one work blocks others. However, the initial > > rescue timeout defaults to 10 milliseconds, which can easily be > > triggered in heavy-load environments. > > Then we should tune it automatically, not blame the admin for not tuning > this parameter. Currently, dynamic tuning of the mayday timer's initial timeout is not supported.
© 2016 - 2026 Red Hat, Inc.