[v2] Allow to drop the softirq-BKL lock on PREEMPT_RT

[PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Sebastian Andrzej Siewior 1 month ago

While a BH work item is canceled, the core code spins until it
determines that the item completed. On PREEMPT_RT the spinning relies on
a lock in local_bh_disable() to avoid a live lock if the canceling
thread has higher priority than the BH-worker and preempts it. This lock
ensures that the BH-worker makes progress by PI-boosting it.

This lock in local_bh_disable() is a central per-CPU BKL and about to be
removed.

To provide the required synchronisation add a per pool lock. The lock is
acquired by the bh_worker at the begin while the individual callbacks
are invoked. To enforce progress in case of interruption, __flush_work()
needs to acquire the lock.
This will flush all BH-work items assigned to that pool.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 9 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c6b79b3675c31..94e226f637992 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -222,7 +222,9 @@ struct worker_pool {
 	struct workqueue_attrs	*attrs;		/* I: worker attributes */
 	struct hlist_node	hash_node;	/* PL: unbound_pool_hash node */
 	int			refcnt;		/* PL: refcnt for unbound pools */
-
+#ifdef CONFIG_PREEMPT_RT
+	spinlock_t		cb_lock;	/* BH worker cancel lock */
+#endif
 	/*
 	 * Destruction of pool is RCU protected to allow dereferences
 	 * from get_work_pool().
@@ -3078,6 +3080,31 @@ __acquires(&pool->lock)
 		goto restart;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+static void worker_lock_callback(struct worker_pool *pool)
+{
+	spin_lock(&pool->cb_lock);
+}
+
+static void worker_unlock_callback(struct worker_pool *pool)
+{
+	spin_unlock(&pool->cb_lock);
+}
+
+static void workqueue_callback_cancel_wait_running(struct worker_pool *pool)
+{
+	spin_lock(&pool->cb_lock);
+	spin_unlock(&pool->cb_lock);
+}
+
+#else
+
+static void worker_lock_callback(struct worker_pool *pool) { }
+static void worker_unlock_callback(struct worker_pool *pool) { }
+static void workqueue_callback_cancel_wait_running(struct worker_pool *pool) { }
+
+#endif
+
 /**
  * manage_workers - manage worker pool
  * @worker: self
@@ -3557,6 +3584,7 @@ static void bh_worker(struct worker *worker)
 	int nr_restarts = BH_WORKER_RESTARTS;
 	unsigned long end = jiffies + BH_WORKER_JIFFIES;
 
+	worker_lock_callback(pool);
 	raw_spin_lock_irq(&pool->lock);
 	worker_leave_idle(worker);
 
@@ -3585,6 +3613,7 @@ static void bh_worker(struct worker *worker)
 	worker_enter_idle(worker);
 	kick_pool(pool);
 	raw_spin_unlock_irq(&pool->lock);
+	worker_unlock_callback(pool);
 }
 
 /*
@@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
 		    (data & WORK_OFFQ_BH)) {
 			/*
 			 * On RT, prevent a live lock when %current preempted
-			 * soft interrupt processing or prevents ksoftirqd from
-			 * running by keeping flipping BH. If the BH work item
-			 * runs on a different CPU then this has no effect other
-			 * than doing the BH disable/enable dance for nothing.
-			 * This is copied from
-			 * kernel/softirq.c::tasklet_unlock_spin_wait().
+			 * soft interrupt processing by blocking on lock which
+			 * is owned by the thread invoking the callback.
 			 */
 			while (!try_wait_for_completion(&barr.done)) {
 				if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-					local_bh_disable();
-					local_bh_enable();
+					struct worker_pool *pool;
+
+					mutex_lock(&wq_pool_mutex);
+					pool = get_work_pool(work);
+					if (pool)
+						workqueue_callback_cancel_wait_running(pool);
+					mutex_unlock(&wq_pool_mutex);
 				} else {
 					cpu_relax();
 				}
@@ -4782,6 +4812,9 @@ static int init_worker_pool(struct worker_pool *pool)
 	ida_init(&pool->worker_ida);
 	INIT_HLIST_NODE(&pool->hash_node);
 	pool->refcnt = 1;
+#ifdef CONFIG_PREEMPT_RT
+	spin_lock_init(&pool->cb_lock);
+#endif
 
 	/* shouldn't fail above this point */
 	pool->attrs = alloc_workqueue_attrs();
-- 
2.51.0

Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Lai Jiangshan 1 month ago

On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:

> @@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
>                     (data & WORK_OFFQ_BH)) {
>                         /*
>                          * On RT, prevent a live lock when %current preempted
> -                        * soft interrupt processing or prevents ksoftirqd from
> -                        * running by keeping flipping BH. If the BH work item
> -                        * runs on a different CPU then this has no effect other
> -                        * than doing the BH disable/enable dance for nothing.
> -                        * This is copied from
> -                        * kernel/softirq.c::tasklet_unlock_spin_wait().
> +                        * soft interrupt processing by blocking on lock which
> +                        * is owned by the thread invoking the callback.
>                          */
>                         while (!try_wait_for_completion(&barr.done)) {
>                                 if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> -                                       local_bh_disable();
> -                                       local_bh_enable();
> +                                       struct worker_pool *pool;
> +
> +                                       mutex_lock(&wq_pool_mutex);
> +                                       pool = get_work_pool(work);
> +                                       if (pool)
> +                                               workqueue_callback_cancel_wait_running(pool);
> +                                       mutex_unlock(&wq_pool_mutex);

The goal is to avoid using a potentially sleeping function in
__flush_work() for BH work items on PREEMPT_RT, but
mutex_lock(&wq_pool_mutex) is not appropriate in this context.

To obtain the pool of a work item, the preferred approach is to use
rcu_read_lock() together with get_work_pool(work), as is done in
start_flush_work().


>                                 } else {
>                                         cpu_relax();
>                                 }

Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Tejun Heo 4 weeks, 1 day ago

On Wed, Sep 03, 2025 at 03:30:08PM +0800, Lai Jiangshan wrote:
> On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> 
> > @@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
> >                     (data & WORK_OFFQ_BH)) {
> >                         /*
> >                          * On RT, prevent a live lock when %current preempted
> > -                        * soft interrupt processing or prevents ksoftirqd from
> > -                        * running by keeping flipping BH. If the BH work item
> > -                        * runs on a different CPU then this has no effect other
> > -                        * than doing the BH disable/enable dance for nothing.
> > -                        * This is copied from
> > -                        * kernel/softirq.c::tasklet_unlock_spin_wait().
> > +                        * soft interrupt processing by blocking on lock which
> > +                        * is owned by the thread invoking the callback.
> >                          */
> >                         while (!try_wait_for_completion(&barr.done)) {
> >                                 if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > -                                       local_bh_disable();
> > -                                       local_bh_enable();
> > +                                       struct worker_pool *pool;
> > +
> > +                                       mutex_lock(&wq_pool_mutex);
> > +                                       pool = get_work_pool(work);
> > +                                       if (pool)
> > +                                               workqueue_callback_cancel_wait_running(pool);
> > +                                       mutex_unlock(&wq_pool_mutex);
> 
> The goal is to avoid using a potentially sleeping function in
> __flush_work() for BH work items on PREEMPT_RT, but
> mutex_lock(&wq_pool_mutex) is not appropriate in this context.
> 
> To obtain the pool of a work item, the preferred approach is to use
> rcu_read_lock() together with get_work_pool(work), as is done in
> start_flush_work().

Yeah, Sebastian, can you please switch it to rcu_read_lock()?

Thanks.

-- 
tejun

Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Lai Jiangshan 1 month ago

Hello

On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> While a BH work item is canceled, the core code spins until it
> determines that the item completed. On PREEMPT_RT the spinning relies on
> a lock in local_bh_disable() to avoid a live lock if the canceling
> thread has higher priority than the BH-worker and preempts it. This lock
> ensures that the BH-worker makes progress by PI-boosting it.
>
> This lock in local_bh_disable() is a central per-CPU BKL and about to be
> removed.
>
> To provide the required synchronisation add a per pool lock. The lock is
> acquired by the bh_worker at the begin while the individual callbacks
> are invoked. To enforce progress in case of interruption, __flush_work()
> needs to acquire the lock.
> This will flush all BH-work items assigned to that pool.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 42 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index c6b79b3675c31..94e226f637992 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -222,7 +222,9 @@ struct worker_pool {
>         struct workqueue_attrs  *attrs;         /* I: worker attributes */
>         struct hlist_node       hash_node;      /* PL: unbound_pool_hash node */
>         int                     refcnt;         /* PL: refcnt for unbound pools */
> -
> +#ifdef CONFIG_PREEMPT_RT
> +       spinlock_t              cb_lock;        /* BH worker cancel lock */
> +#endif
>         /*

Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
and rt_mutex_wait_proxy_lock()?

Or is it possible to add something like rt_spinlock_init_proxy_locked(),
rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
the same as the rt_mutex's proxy lock primitives but for non-sleep context?

I think they will work as an rt variant of struct completion and
they can be used for __flush_work() for BH work for preempt_rt
as the same way as wait_for_completion() is used for normal work.

Thanks
Lai

Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Sebastian Andrzej Siewior 1 month ago

On 2025-09-02 18:12:10 [+0800], Lai Jiangshan wrote:
> Hello
Hi Lai,

> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -222,7 +222,9 @@ struct worker_pool {
> >         struct workqueue_attrs  *attrs;         /* I: worker attributes */
> >         struct hlist_node       hash_node;      /* PL: unbound_pool_hash node */
> >         int                     refcnt;         /* PL: refcnt for unbound pools */
> > -
> > +#ifdef CONFIG_PREEMPT_RT
> > +       spinlock_t              cb_lock;        /* BH worker cancel lock */
> > +#endif
> >         /*
> 
> Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
> and rt_mutex_wait_proxy_lock()?
> 
> Or is it possible to add something like rt_spinlock_init_proxy_locked(),
> rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
> the same as the rt_mutex's proxy lock primitives but for non-sleep context?

I don't think so. I think non-sleep context is the killer part. Those
are for PI and this works by assigning waiter's priority, going to sleep
until "it" is done. Now if you want non-sleep then you would have to
remain on the CPU and spin until the "work" is done. This spinning would
work if the other task is on a remote CPU. But if both are on the same
CPU then spinning is not working.

> I think they will work as an rt variant of struct completion and
> they can be used for __flush_work() for BH work for preempt_rt
> as the same way as wait_for_completion() is used for normal work.

I completely dislike the idea of spinning until completion for the BH
work. Yes, we have three tasklet users which are doing this as of today.
We have no users which require this for workqueue and you can schedule a
workqueue from interrupt context.
Why do we even what this? My idea was to submit this and then hide later
behind an "atomic" API which will hopefully remain unused. As explained
it the previous thread: From RT's point of view it requires to acquire
and drop the cb_lock each time and boost all items on that worker until
the "barrier" work item arrives. While in general the "one" item is
boosted if needed.

> Thanks
> Lai

Sebastian

Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Lai Jiangshan 1 month ago

Hello, Sebastian

On Tue, Sep 2, 2025 at 7:17 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:

> >
> > Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
> > and rt_mutex_wait_proxy_lock()?
> >
> > Or is it possible to add something like rt_spinlock_init_proxy_locked(),
> > rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
> > the same as the rt_mutex's proxy lock primitives but for non-sleep context?
>
> I don't think so. I think non-sleep context is the killer part. Those
> are for PI and this works by assigning waiter's priority, going to sleep
> until "it" is done. Now if you want non-sleep then you would have to
> remain on the CPU and spin until the "work" is done. This spinning would
> work if the other task is on a remote CPU. But if both are on the same
> CPU then spinning is not working.
>

I meant to say that the supposed rt_spinlock_wait_proxy_lock() would
work similarly to the rt_mutex proxy lock, which would wait until the
boosted task (in this case, the kthread running the BH work) calls
rt_spinlock_proxy_unlock(). It would also behave like the PREEMPT_RT
version of spin_lock, where the task blocked  on a spin_lock has a
special style of blocked/sleep instead of spinning on the CPU and this
is what the prefix "rt_spinlock" means.

By the way, I’m not objecting to this patch — I just want to explore
whether there might be other options.

Thanks
Lai

Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Sebastian Andrzej Siewior 1 month ago

On 2025-09-02 22:19:26 [+0800], Lai Jiangshan wrote:
> Hello, Sebastian
Hi Lai,

> On Tue, Sep 2, 2025 at 7:17 PM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> > > Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
> > > and rt_mutex_wait_proxy_lock()?
> > >
> > > Or is it possible to add something like rt_spinlock_init_proxy_locked(),
> > > rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
> > > the same as the rt_mutex's proxy lock primitives but for non-sleep context?
> >
> > I don't think so. I think non-sleep context is the killer part. Those
> > are for PI and this works by assigning waiter's priority, going to sleep
> > until "it" is done. Now if you want non-sleep then you would have to
> > remain on the CPU and spin until the "work" is done. This spinning would
> > work if the other task is on a remote CPU. But if both are on the same
> > CPU then spinning is not working.
> >
> 
> I meant to say that the supposed rt_spinlock_wait_proxy_lock() would
> work similarly to the rt_mutex proxy lock, which would wait until the
> boosted task (in this case, the kthread running the BH work) calls
> rt_spinlock_proxy_unlock(). It would also behave like the PREEMPT_RT
> version of spin_lock, where the task blocked  on a spin_lock has a
> special style of blocked/sleep instead of spinning on the CPU and this
> is what the prefix "rt_spinlock" means.

That interface is actually implementing that boosting for users which
don't use it directly. By "it" I mean the proper rtmutex.

This is used by the PI/ futex code where a rtmutex is created as a
substitute for the lock that is held by the user in userland. The lock
is acquired in userland without kernel's doing. So in case of contention
the user goes to kernel and creates a rtmutex as a representation of the
userland lock in the kernel and assign it to the task that is holding
the userland lock. Now you can block on the lock and userland tasks is
forced to go to the kernel for unlocking.

For RCU, as far as I remember, a task within an RCU can get preempted
and in that case it adds itself to a list during schedule() so it can be
identified later on. There can be more than one task which is preempted
within a RCU section and so block a GP. The boost mutex is assigned to
the first task currently blocking the GP and then next one if needed. 
A poor substitute would be something like taking a lock during
schedule() and having a list of all those locks in case boosting is
needed so it be acquired one by one.

> By the way, I’m not objecting to this patch — I just want to explore
> whether there might be other options.

Right. So you would avoid taking the cb_lock in bh_worker(). Instead you
would need to assign the "acquired" lock to the bh_work() task in
__flush_work() and then block on that lock in __flush_work(). In order
to figure out which task it is, you need some bookkeeping for it. And a
lock to synchronise adding/ removing tasks on that list (bookkeeping) as
well as accessing the lock itself in case of "contention".
So given all this, that approach looks slightly more complicated. You
would avoid the need to acquire the lock in bh_worker() but you would
also substitute it with bookkeeping and its locking elsewhere. So I am
not sure it is worth it.

On !RT you can have only one running softirq at a time. On RT with the
removal of the lock in local_bh_disable() (patch #3) there can be
multiple softirqs instances in parallel on the same CPU. The primary
goal is avoid center bottleneck and make it possible to have one NIC
doing throughput and another NIC doing low latency packets and allowing
the low latency NIC preempt the throughput NIC in the middle of sending/
receiving packets instead of waiting for NAPI doing a handover.

The lock I'm adding here adds some synchronisation here. So you see how
this requirement for the three legacy users makes it slightly more
complicated especially after the cleanup years ago…
However I hope now to come up with a atomic API as Tejun suggested and
push it behind Kconfig bars or so.

> Thanks
> Lai

Sebastian

Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers

Posted by Lai Jiangshan 1 month ago

On Tue, Sep 2, 2025 at 11:56 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:

>
> That interface is actually implementing that boosting for users which
> don't use it directly. By "it" I mean the proper rtmutex.
>
> This is used by the PI/ futex code where a rtmutex is created as a
> substitute for the lock that is held by the user in userland. The lock
> is acquired in userland without kernel's doing. So in case of contention
> the user goes to kernel and creates a rtmutex as a representation of the
> userland lock in the kernel and assign it to the task that is holding
> the userland lock. Now you can block on the lock and userland tasks is
> forced to go to the kernel for unlocking.
>
> For RCU, as far as I remember, a task within an RCU can get preempted
> and in that case it adds itself to a list during schedule() so it can be
> identified later on. There can be more than one task which is preempted
> within a RCU section and so block a GP. The boost mutex is assigned to
> the first task currently blocking the GP and then next one if needed.
> A poor substitute would be something like taking a lock during
> schedule() and having a list of all those locks in case boosting is
> needed so it be acquired one by one.
>


Well explained why a “proxy” lock is needed in these two cases — thanks a lot.


>
> Right. So you would avoid taking the cb_lock in bh_worker(). Instead you
> would need to assign the "acquired" lock to the bh_work() task in
> __flush_work() and then block on that lock in __flush_work(). In order
> to figure out which task it is, you need some bookkeeping for it. And a
> lock to synchronise adding/ removing tasks on that list (bookkeeping) as
> well as accessing the lock itself in case of "contention".
> So given all this, that approach looks slightly more complicated. You
> would avoid the need to acquire the lock in bh_worker() but you would
> also substitute it with bookkeeping and its locking elsewhere. So I am
> not sure it is worth it.
>
> On !RT you can have only one running softirq at a time. On RT with the
> removal of the lock in local_bh_disable() (patch #3) there can be
> multiple softirqs instances in parallel on the same CPU. The primary
> goal is avoid center bottleneck and make it possible to have one NIC
> doing throughput and another NIC doing low latency packets and allowing
> the low latency NIC preempt the throughput NIC in the middle of sending/
> receiving packets instead of waiting for NAPI doing a handover.
>

The bookkeeping isn’t necessarily complicated. For bh_worker() on
PREEMPT_RT, it could be done in worker_leave_idle(worker) with
pool->lock already held. worker->task could be repurposed to record
the task running bh_worker(), which I think would also be useful for
debugging even on !RT.

That said, since a proxy lock isn’t appropriate here, let’s just leave it aside.