While a BH work item is canceled, the core code spins until it
determines that the item completed. On PREEMPT_RT the spinning relies on
a lock in local_bh_disable() to avoid a live lock if the canceling
thread has higher priority than the BH-worker and preempts it. This lock
ensures that the BH-worker makes progress by PI-boosting it.
This lock in local_bh_disable() is a central per-CPU BKL and about to be
removed.
To provide the required synchronisation add a per pool lock. The lock is
acquired by the bh_worker at the begin while the individual callbacks
are invoked. To enforce progress in case of interruption, __flush_work()
needs to acquire the lock.
This will flush all BH-work items assigned to that pool.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++++++--------
1 file changed, 42 insertions(+), 9 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c6b79b3675c31..94e226f637992 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -222,7 +222,9 @@ struct worker_pool {
struct workqueue_attrs *attrs; /* I: worker attributes */
struct hlist_node hash_node; /* PL: unbound_pool_hash node */
int refcnt; /* PL: refcnt for unbound pools */
-
+#ifdef CONFIG_PREEMPT_RT
+ spinlock_t cb_lock; /* BH worker cancel lock */
+#endif
/*
* Destruction of pool is RCU protected to allow dereferences
* from get_work_pool().
@@ -3078,6 +3080,31 @@ __acquires(&pool->lock)
goto restart;
}
+#ifdef CONFIG_PREEMPT_RT
+static void worker_lock_callback(struct worker_pool *pool)
+{
+ spin_lock(&pool->cb_lock);
+}
+
+static void worker_unlock_callback(struct worker_pool *pool)
+{
+ spin_unlock(&pool->cb_lock);
+}
+
+static void workqueue_callback_cancel_wait_running(struct worker_pool *pool)
+{
+ spin_lock(&pool->cb_lock);
+ spin_unlock(&pool->cb_lock);
+}
+
+#else
+
+static void worker_lock_callback(struct worker_pool *pool) { }
+static void worker_unlock_callback(struct worker_pool *pool) { }
+static void workqueue_callback_cancel_wait_running(struct worker_pool *pool) { }
+
+#endif
+
/**
* manage_workers - manage worker pool
* @worker: self
@@ -3557,6 +3584,7 @@ static void bh_worker(struct worker *worker)
int nr_restarts = BH_WORKER_RESTARTS;
unsigned long end = jiffies + BH_WORKER_JIFFIES;
+ worker_lock_callback(pool);
raw_spin_lock_irq(&pool->lock);
worker_leave_idle(worker);
@@ -3585,6 +3613,7 @@ static void bh_worker(struct worker *worker)
worker_enter_idle(worker);
kick_pool(pool);
raw_spin_unlock_irq(&pool->lock);
+ worker_unlock_callback(pool);
}
/*
@@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
(data & WORK_OFFQ_BH)) {
/*
* On RT, prevent a live lock when %current preempted
- * soft interrupt processing or prevents ksoftirqd from
- * running by keeping flipping BH. If the BH work item
- * runs on a different CPU then this has no effect other
- * than doing the BH disable/enable dance for nothing.
- * This is copied from
- * kernel/softirq.c::tasklet_unlock_spin_wait().
+ * soft interrupt processing by blocking on lock which
+ * is owned by the thread invoking the callback.
*/
while (!try_wait_for_completion(&barr.done)) {
if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
- local_bh_disable();
- local_bh_enable();
+ struct worker_pool *pool;
+
+ mutex_lock(&wq_pool_mutex);
+ pool = get_work_pool(work);
+ if (pool)
+ workqueue_callback_cancel_wait_running(pool);
+ mutex_unlock(&wq_pool_mutex);
} else {
cpu_relax();
}
@@ -4782,6 +4812,9 @@ static int init_worker_pool(struct worker_pool *pool)
ida_init(&pool->worker_ida);
INIT_HLIST_NODE(&pool->hash_node);
pool->refcnt = 1;
+#ifdef CONFIG_PREEMPT_RT
+ spin_lock_init(&pool->cb_lock);
+#endif
/* shouldn't fail above this point */
pool->attrs = alloc_workqueue_attrs();
--
2.51.0
On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > @@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) > (data & WORK_OFFQ_BH)) { > /* > * On RT, prevent a live lock when %current preempted > - * soft interrupt processing or prevents ksoftirqd from > - * running by keeping flipping BH. If the BH work item > - * runs on a different CPU then this has no effect other > - * than doing the BH disable/enable dance for nothing. > - * This is copied from > - * kernel/softirq.c::tasklet_unlock_spin_wait(). > + * soft interrupt processing by blocking on lock which > + * is owned by the thread invoking the callback. > */ > while (!try_wait_for_completion(&barr.done)) { > if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > - local_bh_disable(); > - local_bh_enable(); > + struct worker_pool *pool; > + > + mutex_lock(&wq_pool_mutex); > + pool = get_work_pool(work); > + if (pool) > + workqueue_callback_cancel_wait_running(pool); > + mutex_unlock(&wq_pool_mutex); The goal is to avoid using a potentially sleeping function in __flush_work() for BH work items on PREEMPT_RT, but mutex_lock(&wq_pool_mutex) is not appropriate in this context. To obtain the pool of a work item, the preferred approach is to use rcu_read_lock() together with get_work_pool(work), as is done in start_flush_work(). > } else { > cpu_relax(); > }
On Wed, Sep 03, 2025 at 03:30:08PM +0800, Lai Jiangshan wrote: > On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: > > > @@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) > > (data & WORK_OFFQ_BH)) { > > /* > > * On RT, prevent a live lock when %current preempted > > - * soft interrupt processing or prevents ksoftirqd from > > - * running by keeping flipping BH. If the BH work item > > - * runs on a different CPU then this has no effect other > > - * than doing the BH disable/enable dance for nothing. > > - * This is copied from > > - * kernel/softirq.c::tasklet_unlock_spin_wait(). > > + * soft interrupt processing by blocking on lock which > > + * is owned by the thread invoking the callback. > > */ > > while (!try_wait_for_completion(&barr.done)) { > > if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > > - local_bh_disable(); > > - local_bh_enable(); > > + struct worker_pool *pool; > > + > > + mutex_lock(&wq_pool_mutex); > > + pool = get_work_pool(work); > > + if (pool) > > + workqueue_callback_cancel_wait_running(pool); > > + mutex_unlock(&wq_pool_mutex); > > The goal is to avoid using a potentially sleeping function in > __flush_work() for BH work items on PREEMPT_RT, but > mutex_lock(&wq_pool_mutex) is not appropriate in this context. > > To obtain the pool of a work item, the preferred approach is to use > rcu_read_lock() together with get_work_pool(work), as is done in > start_flush_work(). Yeah, Sebastian, can you please switch it to rcu_read_lock()? Thanks. -- tejun
Hello On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > While a BH work item is canceled, the core code spins until it > determines that the item completed. On PREEMPT_RT the spinning relies on > a lock in local_bh_disable() to avoid a live lock if the canceling > thread has higher priority than the BH-worker and preempts it. This lock > ensures that the BH-worker makes progress by PI-boosting it. > > This lock in local_bh_disable() is a central per-CPU BKL and about to be > removed. > > To provide the required synchronisation add a per pool lock. The lock is > acquired by the bh_worker at the begin while the individual callbacks > are invoked. To enforce progress in case of interruption, __flush_work() > needs to acquire the lock. > This will flush all BH-work items assigned to that pool. > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > --- > kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 42 insertions(+), 9 deletions(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index c6b79b3675c31..94e226f637992 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -222,7 +222,9 @@ struct worker_pool { > struct workqueue_attrs *attrs; /* I: worker attributes */ > struct hlist_node hash_node; /* PL: unbound_pool_hash node */ > int refcnt; /* PL: refcnt for unbound pools */ > - > +#ifdef CONFIG_PREEMPT_RT > + spinlock_t cb_lock; /* BH worker cancel lock */ > +#endif > /* Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock() and rt_mutex_wait_proxy_lock()? Or is it possible to add something like rt_spinlock_init_proxy_locked(), rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work the same as the rt_mutex's proxy lock primitives but for non-sleep context? I think they will work as an rt variant of struct completion and they can be used for __flush_work() for BH work for preempt_rt as the same way as wait_for_completion() is used for normal work. Thanks Lai
On 2025-09-02 18:12:10 [+0800], Lai Jiangshan wrote: > Hello Hi Lai, > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -222,7 +222,9 @@ struct worker_pool { > > struct workqueue_attrs *attrs; /* I: worker attributes */ > > struct hlist_node hash_node; /* PL: unbound_pool_hash node */ > > int refcnt; /* PL: refcnt for unbound pools */ > > - > > +#ifdef CONFIG_PREEMPT_RT > > + spinlock_t cb_lock; /* BH worker cancel lock */ > > +#endif > > /* > > Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock() > and rt_mutex_wait_proxy_lock()? > > Or is it possible to add something like rt_spinlock_init_proxy_locked(), > rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work > the same as the rt_mutex's proxy lock primitives but for non-sleep context? I don't think so. I think non-sleep context is the killer part. Those are for PI and this works by assigning waiter's priority, going to sleep until "it" is done. Now if you want non-sleep then you would have to remain on the CPU and spin until the "work" is done. This spinning would work if the other task is on a remote CPU. But if both are on the same CPU then spinning is not working. > I think they will work as an rt variant of struct completion and > they can be used for __flush_work() for BH work for preempt_rt > as the same way as wait_for_completion() is used for normal work. I completely dislike the idea of spinning until completion for the BH work. Yes, we have three tasklet users which are doing this as of today. We have no users which require this for workqueue and you can schedule a workqueue from interrupt context. Why do we even what this? My idea was to submit this and then hide later behind an "atomic" API which will hopefully remain unused. As explained it the previous thread: From RT's point of view it requires to acquire and drop the cb_lock each time and boost all items on that worker until the "barrier" work item arrives. While in general the "one" item is boosted if needed. > Thanks > Lai Sebastian
Hello, Sebastian On Tue, Sep 2, 2025 at 7:17 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > > > Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock() > > and rt_mutex_wait_proxy_lock()? > > > > Or is it possible to add something like rt_spinlock_init_proxy_locked(), > > rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work > > the same as the rt_mutex's proxy lock primitives but for non-sleep context? > > I don't think so. I think non-sleep context is the killer part. Those > are for PI and this works by assigning waiter's priority, going to sleep > until "it" is done. Now if you want non-sleep then you would have to > remain on the CPU and spin until the "work" is done. This spinning would > work if the other task is on a remote CPU. But if both are on the same > CPU then spinning is not working. > I meant to say that the supposed rt_spinlock_wait_proxy_lock() would work similarly to the rt_mutex proxy lock, which would wait until the boosted task (in this case, the kthread running the BH work) calls rt_spinlock_proxy_unlock(). It would also behave like the PREEMPT_RT version of spin_lock, where the task blocked on a spin_lock has a special style of blocked/sleep instead of spinning on the CPU and this is what the prefix "rt_spinlock" means. By the way, I’m not objecting to this patch — I just want to explore whether there might be other options. Thanks Lai
On 2025-09-02 22:19:26 [+0800], Lai Jiangshan wrote: > Hello, Sebastian Hi Lai, > On Tue, Sep 2, 2025 at 7:17 PM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: > > > Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock() > > > and rt_mutex_wait_proxy_lock()? > > > > > > Or is it possible to add something like rt_spinlock_init_proxy_locked(), > > > rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work > > > the same as the rt_mutex's proxy lock primitives but for non-sleep context? > > > > I don't think so. I think non-sleep context is the killer part. Those > > are for PI and this works by assigning waiter's priority, going to sleep > > until "it" is done. Now if you want non-sleep then you would have to > > remain on the CPU and spin until the "work" is done. This spinning would > > work if the other task is on a remote CPU. But if both are on the same > > CPU then spinning is not working. > > > > I meant to say that the supposed rt_spinlock_wait_proxy_lock() would > work similarly to the rt_mutex proxy lock, which would wait until the > boosted task (in this case, the kthread running the BH work) calls > rt_spinlock_proxy_unlock(). It would also behave like the PREEMPT_RT > version of spin_lock, where the task blocked on a spin_lock has a > special style of blocked/sleep instead of spinning on the CPU and this > is what the prefix "rt_spinlock" means. That interface is actually implementing that boosting for users which don't use it directly. By "it" I mean the proper rtmutex. This is used by the PI/ futex code where a rtmutex is created as a substitute for the lock that is held by the user in userland. The lock is acquired in userland without kernel's doing. So in case of contention the user goes to kernel and creates a rtmutex as a representation of the userland lock in the kernel and assign it to the task that is holding the userland lock. Now you can block on the lock and userland tasks is forced to go to the kernel for unlocking. For RCU, as far as I remember, a task within an RCU can get preempted and in that case it adds itself to a list during schedule() so it can be identified later on. There can be more than one task which is preempted within a RCU section and so block a GP. The boost mutex is assigned to the first task currently blocking the GP and then next one if needed. A poor substitute would be something like taking a lock during schedule() and having a list of all those locks in case boosting is needed so it be acquired one by one. > By the way, I’m not objecting to this patch — I just want to explore > whether there might be other options. Right. So you would avoid taking the cb_lock in bh_worker(). Instead you would need to assign the "acquired" lock to the bh_work() task in __flush_work() and then block on that lock in __flush_work(). In order to figure out which task it is, you need some bookkeeping for it. And a lock to synchronise adding/ removing tasks on that list (bookkeeping) as well as accessing the lock itself in case of "contention". So given all this, that approach looks slightly more complicated. You would avoid the need to acquire the lock in bh_worker() but you would also substitute it with bookkeeping and its locking elsewhere. So I am not sure it is worth it. On !RT you can have only one running softirq at a time. On RT with the removal of the lock in local_bh_disable() (patch #3) there can be multiple softirqs instances in parallel on the same CPU. The primary goal is avoid center bottleneck and make it possible to have one NIC doing throughput and another NIC doing low latency packets and allowing the low latency NIC preempt the throughput NIC in the middle of sending/ receiving packets instead of waiting for NAPI doing a handover. The lock I'm adding here adds some synchronisation here. So you see how this requirement for the three legacy users makes it slightly more complicated especially after the cleanup years ago… However I hope now to come up with a atomic API as Tejun suggested and push it behind Kconfig bars or so. > Thanks > Lai Sebastian
On Tue, Sep 2, 2025 at 11:56 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > That interface is actually implementing that boosting for users which > don't use it directly. By "it" I mean the proper rtmutex. > > This is used by the PI/ futex code where a rtmutex is created as a > substitute for the lock that is held by the user in userland. The lock > is acquired in userland without kernel's doing. So in case of contention > the user goes to kernel and creates a rtmutex as a representation of the > userland lock in the kernel and assign it to the task that is holding > the userland lock. Now you can block on the lock and userland tasks is > forced to go to the kernel for unlocking. > > For RCU, as far as I remember, a task within an RCU can get preempted > and in that case it adds itself to a list during schedule() so it can be > identified later on. There can be more than one task which is preempted > within a RCU section and so block a GP. The boost mutex is assigned to > the first task currently blocking the GP and then next one if needed. > A poor substitute would be something like taking a lock during > schedule() and having a list of all those locks in case boosting is > needed so it be acquired one by one. > Well explained why a “proxy” lock is needed in these two cases — thanks a lot. > > Right. So you would avoid taking the cb_lock in bh_worker(). Instead you > would need to assign the "acquired" lock to the bh_work() task in > __flush_work() and then block on that lock in __flush_work(). In order > to figure out which task it is, you need some bookkeeping for it. And a > lock to synchronise adding/ removing tasks on that list (bookkeeping) as > well as accessing the lock itself in case of "contention". > So given all this, that approach looks slightly more complicated. You > would avoid the need to acquire the lock in bh_worker() but you would > also substitute it with bookkeeping and its locking elsewhere. So I am > not sure it is worth it. > > On !RT you can have only one running softirq at a time. On RT with the > removal of the lock in local_bh_disable() (patch #3) there can be > multiple softirqs instances in parallel on the same CPU. The primary > goal is avoid center bottleneck and make it possible to have one NIC > doing throughput and another NIC doing low latency packets and allowing > the low latency NIC preempt the throughput NIC in the middle of sending/ > receiving packets instead of waiting for NAPI doing a handover. > The bookkeeping isn’t necessarily complicated. For bh_worker() on PREEMPT_RT, it could be done in worker_leave_idle(worker) with pool->lock already held. worker->task could be repurposed to record the task running bh_worker(), which I think would also be useful for debugging even on !RT. That said, since a proxy lock isn’t appropriate here, let’s just leave it aside.
© 2016 - 2025 Red Hat, Inc.