kernel/softirq.c | 60 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 55 insertions(+), 5 deletions(-)
The tasklet_unlock_spin_wait() via tasklet_disable_in_atomic() is
provided for a few legacy tasklet users. The interface is used from
atomic context (which is either softirq or disabled preemption) on
non-PREEMPT_RT an relies on spinning until the tasklet callback
completes.
On PREEMPT_RT the context is never atomic but the busy polling logic
remains. It possible that the thread invoking tasklet_unlock_spin_wait()
has higher priority than the tasklet. If both run on the same CPU the
the tasklet makes no progress and the thread trying to cancel the
tasklet will live-lock the system.
To avoid the lockup tasklet_unlock_spin_wait() uses local_bh_disable()/
enable() which utilizes the local_lock_t for synchronisation. This lock
is a central per-CPU BKL and about to be removed.
Acquire a lock in tasklet_action_common() which is held while the
tasklet's callback is invoked. This lock will be acquired from
tasklet_unlock_spin_wait() via tasklet_callback_cancel_wait_running().
After the tasklet completed tasklet_callback_sync_wait_running() drops
the lock and acquires it again. In order to avoid unlocking the lock
even if there is no cancel request, there is a cb_waiters counter which
is incremented during a cancel request.
Blocking on the lock will PI-boost the tasklet if needed, ensuring
progress is made.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/softirq.c | 60 ++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 55 insertions(+), 5 deletions(-)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 513b1945987cc..57a84758b8459 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -805,6 +805,56 @@ static bool tasklet_clear_sched(struct tasklet_struct *t)
return false;
}
+#ifdef CONFIG_PREEMPT_RT
+struct tasklet_sync_callback {
+ spinlock_t cb_lock;
+ atomic_t cb_waiters;
+};
+
+static DEFINE_PER_CPU(struct tasklet_sync_callback, tasklet_sync_callback) = {
+ .cb_lock = __SPIN_LOCK_UNLOCKED(tasklet_sync_callback.cb_lock),
+ .cb_waiters = ATOMIC_INIT(0),
+};
+
+static void tasklet_lock_callback(void)
+{
+ spin_lock(this_cpu_ptr(&tasklet_sync_callback.cb_lock));
+}
+
+static void tasklet_unlock_callback(void)
+{
+ spin_unlock(this_cpu_ptr(&tasklet_sync_callback.cb_lock));
+}
+
+static void tasklet_callback_cancel_wait_running(void)
+{
+ struct tasklet_sync_callback *sync_cb = this_cpu_ptr(&tasklet_sync_callback);
+
+ atomic_inc(&sync_cb->cb_waiters);
+ spin_lock(&sync_cb->cb_lock);
+ atomic_dec(&sync_cb->cb_waiters);
+ spin_unlock(&sync_cb->cb_lock);
+}
+
+static void tasklet_callback_sync_wait_running(void)
+{
+ struct tasklet_sync_callback *sync_cb = this_cpu_ptr(&tasklet_sync_callback);
+
+ if (atomic_read(&sync_cb->cb_waiters)) {
+ spin_unlock(&sync_cb->cb_lock);
+ spin_lock(&sync_cb->cb_lock);
+ }
+}
+
+#else
+
+static void tasklet_lock_callback(void) { }
+static void tasklet_unlock_callback(void) { }
+static void tasklet_callback_cancel_wait_running(void) { }
+static void tasklet_callback_sync_wait_running(void) { }
+
+#endif
+
static void tasklet_action_common(struct tasklet_head *tl_head,
unsigned int softirq_nr)
{
@@ -816,6 +866,7 @@ static void tasklet_action_common(struct tasklet_head *tl_head,
tl_head->tail = &tl_head->head;
local_irq_enable();
+ tasklet_lock_callback();
while (list) {
struct tasklet_struct *t = list;
@@ -835,6 +886,7 @@ static void tasklet_action_common(struct tasklet_head *tl_head,
}
}
tasklet_unlock(t);
+ tasklet_callback_sync_wait_running();
continue;
}
tasklet_unlock(t);
@@ -847,6 +899,7 @@ static void tasklet_action_common(struct tasklet_head *tl_head,
__raise_softirq_irqoff(softirq_nr);
local_irq_enable();
}
+ tasklet_unlock_callback();
}
static __latent_entropy void tasklet_action(void)
@@ -897,12 +950,9 @@ void tasklet_unlock_spin_wait(struct tasklet_struct *t)
/*
* Prevent a live lock when current preempted soft
* interrupt processing or prevents ksoftirqd from
- * running. If the tasklet runs on a different CPU
- * then this has no effect other than doing the BH
- * disable/enable dance for nothing.
+ * running.
*/
- local_bh_disable();
- local_bh_enable();
+ tasklet_callback_cancel_wait_running();
} else {
cpu_relax();
}
--
2.50.1
Hi Sebastian, kernel test robot noticed the following build warnings: [auto build test WARNING on linus/master] [also build test WARNING on v6.17-rc1 next-20250813] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/softirq-Provide-a-handshake-for-canceling-tasklets-via-polling-on-PREEMPT_RT/20250812-224928 base: linus/master patch link: https://lore.kernel.org/r/20250812143930.22RBn5BW%40linutronix.de patch subject: [PATCH] softirq: Provide a handshake for canceling tasklets via polling on PREEMPT_RT config: csky-randconfig-002-20250813 (https://download.01.org/0day-ci/archive/20250813/202508131652.C4118cYh-lkp@intel.com/config) compiler: csky-linux-gcc (GCC) 13.4.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250813/202508131652.C4118cYh-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202508131652.C4118cYh-lkp@intel.com/ All warnings (new ones prefixed by >>): >> kernel/softirq.c:853:13: warning: 'tasklet_callback_cancel_wait_running' defined but not used [-Wunused-function] 853 | static void tasklet_callback_cancel_wait_running(void) { } | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ vim +/tasklet_callback_cancel_wait_running +853 kernel/softirq.c 850 851 static void tasklet_lock_callback(void) { } 852 static void tasklet_unlock_callback(void) { } > 853 static void tasklet_callback_cancel_wait_running(void) { } 854 static void tasklet_callback_sync_wait_running(void) { } 855 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
On 2025-08-12 16:39:32 [+0200], To linux-kernel@vger.kernel.org wrote: > The tasklet_unlock_spin_wait() via tasklet_disable_in_atomic() is > provided for a few legacy tasklet users. The interface is used from > atomic context (which is either softirq or disabled preemption) on > non-PREEMPT_RT an relies on spinning until the tasklet callback > completes. > On PREEMPT_RT the context is never atomic but the busy polling logic > remains. It possible that the thread invoking tasklet_unlock_spin_wait() > has higher priority than the tasklet. If both run on the same CPU the > the tasklet makes no progress and the thread trying to cancel the > tasklet will live-lock the system. > To avoid the lockup tasklet_unlock_spin_wait() uses local_bh_disable()/ > enable() which utilizes the local_lock_t for synchronisation. This lock > is a central per-CPU BKL and about to be removed. > > Acquire a lock in tasklet_action_common() which is held while the > tasklet's callback is invoked. This lock will be acquired from > tasklet_unlock_spin_wait() via tasklet_callback_cancel_wait_running(). > After the tasklet completed tasklet_callback_sync_wait_running() drops > the lock and acquires it again. In order to avoid unlocking the lock > even if there is no cancel request, there is a cb_waiters counter which > is incremented during a cancel request. > Blocking on the lock will PI-boost the tasklet if needed, ensuring > progress is made. > Tejun, Lai, I noticed that the BH part of workqueue also relies on this mechanism (__flush_work(), the PREEMPT_RT ifdef). This is a fairly recent API so there should be no "legacy" users as we have it the tasklet interface. The majority of users use tasklet_kill() (or seldom tasklet_unlock_wait()) and not tasklet_unlock_spin_wait(). The plan was to get rid of the spinning API but I didn't manage to get rid of all users especially since some of the code could not be sanely/ safely converted (+tested). Does the workqueue-BH code require the canceling from atomic context or was this just added because the API for BH and non-BH work items is the same and __cancel_work_sync() allows it? Could we avoid the busy-waiting for BH work items and rely on the wait_for_completion() below or do we need something similar to what I added here for the tasklet API? Sebastian
Hello, On Tue, Aug 12, 2025 at 04:53:59PM +0200, Sebastian Andrzej Siewior wrote: > Does the workqueue-BH code require the canceling from atomic context or > was this just added because the API for BH and non-BH work items is the > same and __cancel_work_sync() allows it? > Could we avoid the busy-waiting for BH work items and rely on the > wait_for_completion() below or do we need something similar to what I > added here for the tasklet API? The intention is to convert all BH users to workqueue-BH and remove BH (that's what Linus wants and why workqueue-BH came to be), so the APIs should be able to match up, I'm afraid. There were some attempts at pushing the conversion but we've only made minimal progress. If you're looking at BH users anyway and feel like it, please feel free to convert them. Thanks. -- tejun
On 2025-08-12 09:38:50 [-1000], Tejun Heo wrote: > Hello, Hi Tejun, > On Tue, Aug 12, 2025 at 04:53:59PM +0200, Sebastian Andrzej Siewior wrote: > > Does the workqueue-BH code require the canceling from atomic context or > > was this just added because the API for BH and non-BH work items is the > > same and __cancel_work_sync() allows it? > > Could we avoid the busy-waiting for BH work items and rely on the > > wait_for_completion() below or do we need something similar to what I > > added here for the tasklet API? > > The intention is to convert all BH users to workqueue-BH and remove BH > (that's what Linus wants and why workqueue-BH came to be), so the APIs > should be able to match up, I'm afraid. There were some attempts at pushing > the conversion but we've only made minimal progress. If you're looking at BH > users anyway and feel like it, please feel free to convert them. I understand this but I am talking about legacy users: | drivers/atm/eni.c: tasklet_disable_in_atomic(&ENI_DEV(vcc->dev)->task); | drivers/net/wireless/ath/ath9k/beacon.c: tasklet_disable_in_atomic(&sc->bcon_tasklet); | drivers/pci/controller/pci-hyperv.c: tasklet_disable_in_atomic(&channel->callback_event); This is what is left. (There is also i915 but this is "special"). So we are talking about establishing an API and behaviour for those here after we painfully managed converting everyone else away: | git grep 'tasklet_unlock_wait([^s]' | wc -l | 5 | git grep 'tasklet_disable([^s]' | wc -l | 97 | git grep 'tasklet_kill([^s]' | wc -l | 304 While I think it could be possible with upstream's help to avoid the in-atomic bits for atk9k and hyperv I lost all hope ) for the ATM driver. > Thanks. Sebastian
Hello, On Wed, Aug 13, 2025 at 08:33:11AM +0200, Sebastian Andrzej Siewior wrote: ... > > The intention is to convert all BH users to workqueue-BH and remove BH > > (that's what Linus wants and why workqueue-BH came to be), so the APIs > > should be able to match up, I'm afraid. There were some attempts at pushing > > the conversion but we've only made minimal progress. If you're looking at BH > > users anyway and feel like it, please feel free to convert them. > > I understand this but I am talking about legacy users: > > | drivers/atm/eni.c: tasklet_disable_in_atomic(&ENI_DEV(vcc->dev)->task); > | drivers/net/wireless/ath/ath9k/beacon.c: tasklet_disable_in_atomic(&sc->bcon_tasklet); > | drivers/pci/controller/pci-hyperv.c: tasklet_disable_in_atomic(&channel->callback_event); > > This is what is left. (There is also i915 but this is "special"). > So we are talking about establishing an API and behaviour for those here > after we painfully managed converting everyone else away: Right, given how early in conversion, we can definitely leave this as something to think about later. I have no objection to leave it be for now. Thanks. -- tejun
On 2025-08-13 08:05:34 [-1000], Tejun Heo wrote: > Hello, Hi Tejun, > On Wed, Aug 13, 2025 at 08:33:11AM +0200, Sebastian Andrzej Siewior wrote: > ... > > > The intention is to convert all BH users to workqueue-BH and remove BH > > > (that's what Linus wants and why workqueue-BH came to be), so the APIs > > > should be able to match up, I'm afraid. There were some attempts at pushing > > > the conversion but we've only made minimal progress. If you're looking at BH > > > users anyway and feel like it, please feel free to convert them. > > > > I understand this but I am talking about legacy users: > > > > | drivers/atm/eni.c: tasklet_disable_in_atomic(&ENI_DEV(vcc->dev)->task); > > | drivers/net/wireless/ath/ath9k/beacon.c: tasklet_disable_in_atomic(&sc->bcon_tasklet); > > | drivers/pci/controller/pci-hyperv.c: tasklet_disable_in_atomic(&channel->callback_event); > > > > This is what is left. (There is also i915 but this is "special"). > > So we are talking about establishing an API and behaviour for those here > > after we painfully managed converting everyone else away: > > Right, given how early in conversion, we can definitely leave this as > something to think about later. I have no objection to leave it be for now. Okay. Do I need to update __flush_work() in anyway to make it obvious? The local_bh_disable()/ local_bh_enable() will become a nop in this regard and should be removed. It would be the revert of commit 134874e2eee93 ("workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items"). The commit added the possibility to flush BH work from atomic context but it is unclear if there already a requirement for this or if it was to match the legacy part of the tasklet API. > Thanks. Sebastian
Hello, On Mon, Aug 18, 2025 at 02:52:42PM +0200, Sebastian Andrzej Siewior wrote: ... > > Right, given how early in conversion, we can definitely leave this as > > something to think about later. I have no objection to leave it be for now. > > Okay. Do I need to update __flush_work() in anyway to make it obvious? > The local_bh_disable()/ local_bh_enable() will become a nop in this > regard and should be removed. > It would be the revert of commit 134874e2eee93 ("workqueue: Allow > cancel_work_sync() and disable_work() from atomic contexts on BH work > items"). The commit added the possibility to flush BH work from atomic > context but it is unclear if there already a requirement for this or if > it was to match the legacy part of the tasklet API. I see. Can I backtrack? If it doesn't require too invasive changes, let's just keep the two in sync. I'll get back to conversions so that we can actually achieve the goal eventually and it'll probably be more confusing if we revert that and try to redo it later. Thanks. -- tejun
On 2025-08-18 07:41:06 [-1000], Tejun Heo wrote: > Hello, Hi Tejun, > I see. Can I backtrack? If it doesn't require too invasive changes, let's > just keep the two in sync. I'll get back to conversions so that we can > actually achieve the goal eventually and it'll probably be more confusing if > we revert that and try to redo it later. Okay. Then let me repost the tasklet patch and make one for workqueue to stay in sync. I do hope that we end up with a requirement that any kind of teardown does not happen from an atomic context ;) > Thanks. > Sebastian
On 2025-08-19 17:01:07 [+0200], To Tejun Heo wrote: > Okay. Then let me repost the tasklet patch and make one for workqueue to > stay in sync. > I do hope that we end up with a requirement that any kind of teardown > does not happen from an atomic context ;) That would be ------------->8------------- Subject: [PATCH] workqueue: Provide a handshake for canceling BH workers While a BH work item is canceled, the core code spins until it determines that the item completed. On PREEMPT_RT the spinning relies on a lock in local_bh_disable() to avoid a live lock if the canceling thread has higher priority than the BH-worker and preempts it. This lock ensures that the BH-worker makes progress by PI-boosting it. This lock in local_bh_disable() is a central per-CPU BKL and about to be removed. To provide the required synchronisation add a per pool lock. The lock is acquired by the bh_worker at the begin while the individual callbacks are invoked. To enforce progress in case of interruption, __flush_work() needs to acquire the lock. This will flush all BH-work items assigned to that pool. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 42 insertions(+), 9 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c6b79b3675c31..94e226f637992 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -222,7 +222,9 @@ struct worker_pool { struct workqueue_attrs *attrs; /* I: worker attributes */ struct hlist_node hash_node; /* PL: unbound_pool_hash node */ int refcnt; /* PL: refcnt for unbound pools */ - +#ifdef CONFIG_PREEMPT_RT + spinlock_t cb_lock; /* BH worker cancel lock */ +#endif /* * Destruction of pool is RCU protected to allow dereferences * from get_work_pool(). @@ -3078,6 +3080,31 @@ __acquires(&pool->lock) goto restart; } +#ifdef CONFIG_PREEMPT_RT +static void worker_lock_callback(struct worker_pool *pool) +{ + spin_lock(&pool->cb_lock); +} + +static void worker_unlock_callback(struct worker_pool *pool) +{ + spin_unlock(&pool->cb_lock); +} + +static void workqueue_callback_cancel_wait_running(struct worker_pool *pool) +{ + spin_lock(&pool->cb_lock); + spin_unlock(&pool->cb_lock); +} + +#else + +static void worker_lock_callback(struct worker_pool *pool) { } +static void worker_unlock_callback(struct worker_pool *pool) { } +static void workqueue_callback_cancel_wait_running(struct worker_pool *pool) { } + +#endif + /** * manage_workers - manage worker pool * @worker: self @@ -3557,6 +3584,7 @@ static void bh_worker(struct worker *worker) int nr_restarts = BH_WORKER_RESTARTS; unsigned long end = jiffies + BH_WORKER_JIFFIES; + worker_lock_callback(pool); raw_spin_lock_irq(&pool->lock); worker_leave_idle(worker); @@ -3585,6 +3613,7 @@ static void bh_worker(struct worker *worker) worker_enter_idle(worker); kick_pool(pool); raw_spin_unlock_irq(&pool->lock); + worker_unlock_callback(pool); } /* @@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) (data & WORK_OFFQ_BH)) { /* * On RT, prevent a live lock when %current preempted - * soft interrupt processing or prevents ksoftirqd from - * running by keeping flipping BH. If the BH work item - * runs on a different CPU then this has no effect other - * than doing the BH disable/enable dance for nothing. - * This is copied from - * kernel/softirq.c::tasklet_unlock_spin_wait(). + * soft interrupt processing by blocking on lock which + * is owned by the thread invoking the callback. */ while (!try_wait_for_completion(&barr.done)) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) { - local_bh_disable(); - local_bh_enable(); + struct worker_pool *pool; + + mutex_lock(&wq_pool_mutex); + pool = get_work_pool(work); + if (pool) + workqueue_callback_cancel_wait_running(pool); + mutex_unlock(&wq_pool_mutex); } else { cpu_relax(); } @@ -4782,6 +4812,9 @@ static int init_worker_pool(struct worker_pool *pool) ida_init(&pool->worker_ida); INIT_HLIST_NODE(&pool->hash_node); pool->refcnt = 1; +#ifdef CONFIG_PREEMPT_RT + spin_lock_init(&pool->cb_lock); +#endif /* shouldn't fail above this point */ pool->attrs = alloc_workqueue_attrs(); -- 2.50.1
On 2025-08-20 12:36:59 [+0200], To Tejun Heo wrote: > Subject: [PATCH] workqueue: Provide a handshake for canceling BH workers … > This will flush all BH-work items assigned to that pool. We need to flush all items because the inserted wq_barrier is at the end of the queue. So if the cb_lock is dropped after worker->current_func(work) then we will live lock. Just tested, I somehow assumed it polls on worker. This behaviour is undesired and goes back to the requirement to be able to cancel something from an atomic context which can't be atomic on PREEMPT_RT to begin with. Since the caller can never be atomic on PREEMPT_RT, what about the following: diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c6b79b3675c31..6ce9c980a7966 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4214,28 +4214,16 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) * can't currently be queued. Its data must contain OFFQ bits. If @work * was queued on a BH workqueue, we also know that it was running in the * BH context and thus can be busy-waited. + * On PREEMPT_RT the BH context can not be busy-waited because it can be + * preempted by the caller if it has higher priority. */ - if (from_cancel) { + if (from_cancel && !IS_ENABLED(CONFIG_PREEMPT_RT)) { unsigned long data = *work_data_bits(work); if (!WARN_ON_ONCE(data & WORK_STRUCT_PWQ) && (data & WORK_OFFQ_BH)) { - /* - * On RT, prevent a live lock when %current preempted - * soft interrupt processing or prevents ksoftirqd from - * running by keeping flipping BH. If the BH work item - * runs on a different CPU then this has no effect other - * than doing the BH disable/enable dance for nothing. - * This is copied from - * kernel/softirq.c::tasklet_unlock_spin_wait(). - */ while (!try_wait_for_completion(&barr.done)) { - if (IS_ENABLED(CONFIG_PREEMPT_RT)) { - local_bh_disable(); - local_bh_enable(); - } else { - cpu_relax(); - } + cpu_relax(); } goto out_destroy; } @@ -4351,7 +4339,7 @@ static bool __cancel_work_sync(struct work_struct *work, u32 cflags) ret = __cancel_work(work, cflags | WORK_CANCEL_DISABLE); - if (*work_data_bits(work) & WORK_OFFQ_BH) + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && *work_data_bits(work) & WORK_OFFQ_BH) WARN_ON_ONCE(in_hardirq()); else might_sleep(); It is not the revert I suggested. This should work for softirq caller and from forced-thread interrupt (not that I encourage such behaviour). It will not work from an atomic context such as with raw_spinlock_t acquired but this will also not work with the current (local_bh_disable() + enable()) solution. I prefer this because it avoids the locking bh_worker() and the flushing of all pending work items the flush/ cancel case. Sebastian
Hello, On Wed, Aug 20, 2025 at 12:55:18PM +0200, Sebastian Andrzej Siewior wrote: > On 2025-08-20 12:36:59 [+0200], To Tejun Heo wrote: > > Subject: [PATCH] workqueue: Provide a handshake for canceling BH workers > … > > This will flush all BH-work items assigned to that pool. > > We need to flush all items because the inserted wq_barrier is at the > end of the queue. So if the cb_lock is dropped after > worker->current_func(work) then we will live lock. Just tested, I > somehow assumed it polls on worker. Is flushing all a problem tho? I think the main focus is keeping the semantics matching on RT, right? ... > - if (from_cancel) { > + if (from_cancel && !IS_ENABLED(CONFIG_PREEMPT_RT)) { > unsigned long data = *work_data_bits(work); > > if (!WARN_ON_ONCE(data & WORK_STRUCT_PWQ) && > (data & WORK_OFFQ_BH)) { > - /* > - * On RT, prevent a live lock when %current preempted > - * soft interrupt processing or prevents ksoftirqd from > - * running by keeping flipping BH. If the BH work item > - * runs on a different CPU then this has no effect other > - * than doing the BH disable/enable dance for nothing. > - * This is copied from > - * kernel/softirq.c::tasklet_unlock_spin_wait(). > - */ > while (!try_wait_for_completion(&barr.done)) { > - if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > - local_bh_disable(); > - local_bh_enable(); > - } else { > - cpu_relax(); > - } > + cpu_relax(); I'm most likely missing something about RT but wouldn't the above still lead to deadlocks if the caller is non-hardirq but higher priority thread? Thanks. -- tejun
On 2025-08-20 09:44:53 [-1000], Tejun Heo wrote: > Hello, Hi Tejun, > On Wed, Aug 20, 2025 at 12:55:18PM +0200, Sebastian Andrzej Siewior wrote: > > On 2025-08-20 12:36:59 [+0200], To Tejun Heo wrote: > > > Subject: [PATCH] workqueue: Provide a handshake for canceling BH workers > > … > > > This will flush all BH-work items assigned to that pool. > > > > We need to flush all items because the inserted wq_barrier is at the > > end of the queue. So if the cb_lock is dropped after > > worker->current_func(work) then we will live lock. Just tested, I > > somehow assumed it polls on worker. > > Is flushing all a problem tho? It is not wrong as in something will explode. The priority-inheritance boost is meant to give the lower priority task runtime in order to leave its critical section. So the task with the higher priority can continue to make progress instead of sitting around. Spinning while waiting for completion will not succeed. In this case "leaving the critical section" would mean complete the one work item. But instead we flush all of them. It is more of semantics in this case. That is why the looping-cancel in tasklet cancels just that one workitem. > I think the main focus is keeping the > semantics matching on RT, right? Yes, but having the semantics with busy waiting on a BH work is kind of the problem. And there is no need for it which makes it a bit difficult. The previous patch would match the !RT bits but we flush all work, have and the lock for no reason. That is why I don't like it. The majority of tasklet users don't need it. It is in my opinion bad semantics. But if you insist on it, the previous patch will make it work and has been tested. There is not much I can do. > ... > > - if (from_cancel) { > > + if (from_cancel && !IS_ENABLED(CONFIG_PREEMPT_RT)) { > > unsigned long data = *work_data_bits(work); > > > > if (!WARN_ON_ONCE(data & WORK_STRUCT_PWQ) && > > (data & WORK_OFFQ_BH)) { > > - /* > > - * On RT, prevent a live lock when %current preempted > > - * soft interrupt processing or prevents ksoftirqd from > > - * running by keeping flipping BH. If the BH work item > > - * runs on a different CPU then this has no effect other > > - * than doing the BH disable/enable dance for nothing. > > - * This is copied from > > - * kernel/softirq.c::tasklet_unlock_spin_wait(). > > - */ > > while (!try_wait_for_completion(&barr.done)) { > > - if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > > - local_bh_disable(); > > - local_bh_enable(); > > - } else { > > - cpu_relax(); > > - } > > + cpu_relax(); > > I'm most likely missing something about RT but wouldn't the above still lead > to deadlocks if the caller is non-hardirq but higher priority thread? Not sure what you refer to. Right now there is this lock in local_bh_disable() which forces PI. Removing the whole section for RT as in this snippet gets us to the wait_for_completion() below. It lets the task with higher priority schedule out allowing task with lower priority to run. Eventually the barrier item completes and with the wakeup the high priority task will continue. So the high-priority task will lose runtime (allowing task with lower priority to run). I don't think it will be a problem because it is mostly used in "quit" scenarios (not during normal workload) and matches tasklet_disable(). > Thanks. Sebastian
Hello, Sebastian. On Thu, Aug 21, 2025 at 11:28:27AM +0200, Sebastian Andrzej Siewior wrote: ... > It is not wrong as in something will explode. The priority-inheritance > boost is meant to give the lower priority task runtime in order to leave > its critical section. So the task with the higher priority can continue > to make progress instead of sitting around. Spinning while waiting for > completion will not succeed. > In this case "leaving the critical section" would mean complete the one > work item. But instead we flush all of them. It is more of semantics in > this case. That is why the looping-cancel in tasklet cancels just that > one workitem. Understood. However, given that these pools are per-cpu, BHs are usually not heavily contended and canceling itself is a low frequency operation, the practical difference likely won't be noticeable. > > I think the main focus is keeping the > > semantics matching on RT, right? > > Yes, but having the semantics with busy waiting on a BH work is kind of > the problem. And there is no need for it which makes it a bit difficult. > The previous patch would match the !RT bits but we flush all work, have > and the lock for no reason. That is why I don't like it. The majority of > tasklet users don't need it. It is in my opinion bad semantics. > > But if you insist on it, the previous patch will make it work and has > been tested. There is not much I can do. Oh, I'm not insisting, don't know enough to do so. Just trying to understand the situation. > > I'm most likely missing something about RT but wouldn't the above still lead > > to deadlocks if the caller is non-hardirq but higher priority thread? > > Not sure what you refer to. Right now there is this lock in > local_bh_disable() which forces PI. > Removing the whole section for RT as in this snippet gets us to the > wait_for_completion() below. It lets the task with higher priority > schedule out allowing task with lower priority to run. Eventually the > barrier item completes and with the wakeup the high priority task will > continue. > So the high-priority task will lose runtime (allowing task with lower > priority to run). I don't think it will be a problem because it is > mostly used in "quit" scenarios (not during normal workload) and matches > tasklet_disable(). Okay, so, on !RT, that busy loop section is there to allow cancel_work_sync() to be called from BH-disabled contexts and the caller is responsible for ensuring there's no recursion. It's not great but matches the existing behavior. Obviously, in !RT, we can't go to wait_for_completion() there because we can be in a non-sleepable context. Are you saying that, in RT, it'd be fine to call wait_for_completion() inside local_bh_disable() and won't trip any of the context warnings? If so, yeah, we don't need any of the looping. Thanks. -- tejun
On 2025-08-21 07:10:42 [-1000], Tejun Heo wrote: > Hello, Sebastian. Hello Tejun, > Oh, I'm not insisting, don't know enough to do so. Just trying to understand > the situation. ah okay. > > > I'm most likely missing something about RT but wouldn't the above still lead > > > to deadlocks if the caller is non-hardirq but higher priority thread? > > > > Not sure what you refer to. Right now there is this lock in > > local_bh_disable() which forces PI. > > Removing the whole section for RT as in this snippet gets us to the > > wait_for_completion() below. It lets the task with higher priority > > schedule out allowing task with lower priority to run. Eventually the > > barrier item completes and with the wakeup the high priority task will > > continue. > > So the high-priority task will lose runtime (allowing task with lower > > priority to run). I don't think it will be a problem because it is > > mostly used in "quit" scenarios (not during normal workload) and matches > > tasklet_disable(). > > Okay, so, on !RT, that busy loop section is there to allow > cancel_work_sync() to be called from BH-disabled contexts and the caller is > responsible for ensuring there's no recursion. It's not great but matches > the existing behavior. hold on for for a sec: existing behaviour for tasklet_unlock_spin_wait() which has three users (a fourth one if we count i915 which has its own tasklet layer). Not something that I would call common or wide spread behaviour in the kernel (and task workqueue does not have it either). tasklet_disable() and tasklet_kill() both sleep while waiting for completion and don't spin. > Obviously, in !RT, we can't go to > wait_for_completion() there because we can be in a non-sleepable context. Again, only a small amount of users require to do so. > Are you saying that, in RT, it'd be fine to call wait_for_completion() > inside local_bh_disable() and won't trip any of the context warnings? If so, > yeah, we don't need any of the looping. No, that won't work. local_bh_disable() will start a RCU read section and then RCU will complain during schedule(). So if the requirement is to cancel a BH workitem from within a BH disabled section then we need the first patch in this thread. But if we get rid of this requirement… > Thanks. Sebastian
Hello, Sebastian. On Fri, Aug 22, 2025 at 11:48:12AM +0200, Sebastian Andrzej Siewior wrote: > > Okay, so, on !RT, that busy loop section is there to allow > > cancel_work_sync() to be called from BH-disabled contexts and the caller is > > responsible for ensuring there's no recursion. It's not great but matches > > the existing behavior. > > hold on for for a sec: existing behaviour for tasklet_unlock_spin_wait() > which has three users (a fourth one if we count i915 which has its own > tasklet layer). Not something that I would call common or wide spread > behaviour in the kernel (and task workqueue does not have it either). IIRC, there was a conversion patchset which didn't get pushed to completion and it hit those, so the patch to add the busy wait. > tasklet_disable() and tasklet_kill() both sleep while waiting for > completion and don't spin. > > > Obviously, in !RT, we can't go to > > wait_for_completion() there because we can be in a non-sleepable context. > > Again, only a small amount of users require to do so. > > > Are you saying that, in RT, it'd be fine to call wait_for_completion() > > inside local_bh_disable() and won't trip any of the context warnings? If so, > > yeah, we don't need any of the looping. > > No, that won't work. local_bh_disable() will start a RCU read section > and then RCU will complain during schedule(). > So if the requirement is to cancel a BH workitem from within a BH > disabled section then we need the first patch in this thread. > > But if we get rid of this requirement… Agreed, once we get rid of them, we can drop the whole block for both RT and !RT - ie. revert the patch that added it. But, wouldn't it be more orderly to match the semantics in both cases even if the behavior isn't quite optimal? We can put some comment noting what to do once the culprits are updated to not need it. Thanks. -- tejun
On 2025-08-22 08:07:47 [-1000], Tejun Heo wrote: > Hello, Sebastian. Hi Tejun, > Agreed, once we get rid of them, we can drop the whole block for both RT and > !RT - ie. revert the patch that added it. But, wouldn't it be more orderly > to match the semantics in both cases even if the behavior isn't quite > optimal? We can put some comment noting what to do once the culprits are > updated to not need it. Sure. I am only worried that if something is possible, people will use it. I don't think things will change if we debate for another week ;) The first patch here in this thread should provide the symmetrical API. Oh. We could also hide this polling API behind a special API which is hidden just for three special cases so everyone else would do the right job. > Thanks. Sebastian
Hello, On Tue, Aug 26, 2025 at 05:49:42PM +0200, Sebastian Andrzej Siewior wrote: > On 2025-08-22 08:07:47 [-1000], Tejun Heo wrote: > > Hello, Sebastian. > Hi Tejun, > > > Agreed, once we get rid of them, we can drop the whole block for both RT and > > !RT - ie. revert the patch that added it. But, wouldn't it be more orderly > > to match the semantics in both cases even if the behavior isn't quite > > optimal? We can put some comment noting what to do once the culprits are > > updated to not need it. > > Sure. I am only worried that if something is possible, people will use > it. I don't think things will change if we debate for another week ;) > The first patch here in this thread should provide the symmetrical API. > > Oh. We could also hide this polling API behind a special API which is > hidden just for three special cases so everyone else would do the right > job. Oh yeah, that makes a lot of sense to me - splitting it out into something which is named explicitly to discourage further usages. Thanks. -- tejun
On 2025-08-26 06:27:27 [-1000], Tejun Heo wrote: > Hello, Hello Tejun, > Oh yeah, that makes a lot of sense to me - splitting it out into something > which is named explicitly to discourage further usages. I am a bit lost now. Do you intend to apply the patch and we came up with the bh-canceling-from-bh API later on or what is the plan? I can repost it of course together with the tasklet patch. > Thanks. Sebastian
Hello, Sebastian. On Thu, Aug 28, 2025 at 06:04:55PM +0200, Sebastian Andrzej Siewior wrote: > On 2025-08-26 06:27:27 [-1000], Tejun Heo wrote: > > Oh yeah, that makes a lot of sense to me - splitting it out into something > > which is named explicitly to discourage further usages. > > I am a bit lost now. Do you intend to apply the patch and we came up > with the bh-canceling-from-bh API later on or what is the plan? I'd much prefer if the end result is that the busy waiting is a separate API. Whether that's done in a single patch or incremental patches doesn't really matter. Thanks. -- tejun
© 2016 - 2025 Red Hat, Inc.