Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.
The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
The series introduces the infrastructure and converts large parts of
networking which is largest stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
v1…v2 https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/:
- Jakub complained about touching networking drivers to make the
additional locking work. Alexei complained about the additional
locking within the XDP/eBFP case.
This led to a change in how the per-CPU variables are accessed for the
XDP/eBPF case. On PREEMPT_RT the variables are now stored on stack and
the task pointer to the structure is saved in the task_struct while
keeping every for !RT unchanged. This was proposed as a RFC in
v1: https://lore.kernel.org/all/20240213145923.2552753-1-bigeasy@linutronix.de/
and then updated
v2: https://lore.kernel.org/all/20240229183109.646865-1-bigeasy@linutronix.de/
- Renamed the container struct from xdp_storage to bpf_net_context.
Suggested by Toke Høiland-Jørgensen.
- Use the container struct also on !PREEMPT_RT builds. Store the
pointer to the on-stack struct in a per-CPU variable. Suggested by
Toke Høiland-Jørgensen.
This reduces the initial queue from 24 to 15 patches.
- There were complains about the scoped_guard() which shifts the whole
block and makes it harder to review because the whole gets removed and
added again. The usage has been replaced with local_lock_nested_bh()+
its unlock counterpart.
Sebastian
On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> local_bh_disable() section remains preemtible. As a result high prior
> tasks (or threaded interrupts) will be blocked by lower-prio task (or
> threaded interrupts) which are long running which includes softirq
> sections.
>
> The proposed way out is to introduce explicit per-CPU locks for
> resources which are protected by local_bh_disable() and use those only
> on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
Let me rephrase to check I understood the plan correctly.
The idea is to pair 'bare' local_bh_{disable,enable} with local lock
and late make local_bh_{disable,enable} no ops (on RT).
With 'bare' I mean not followed by a spin_lock() - which is enough to
ensure mutual exclusion vs BH on RT build - am I correct?
> The series introduces the infrastructure and converts large parts of
> networking which is largest stake holder here. Once this done the
> per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
AFAICS there are a bunch of local_bh_* call-sites under 'net' matching
the above description and not addressed here. Is this series supposed
to cover 'net' fully?
Could you please include the diffstat for the whole series? I
think/hope it will help catching the full picture more easily.
Note that some callers use local_bh_disable(), no additional lock, and
there is no specific struct to protect, but enforce explicit
serialization vs bh to a bunch of operation, e.g. the
local_bh_disable() in inet_twsk_purge().
I guess such call site should be handled, too?
Thanks!
Paolo
On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote:
> On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> > local_bh_disable() section remains preemtible. As a result high prior
> > tasks (or threaded interrupts) will be blocked by lower-prio task (or
> > threaded interrupts) which are long running which includes softirq
> > sections.
> >
> > The proposed way out is to introduce explicit per-CPU locks for
> > resources which are protected by local_bh_disable() and use those only
> > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
>
> Let me rephrase to check I understood the plan correctly.
>
> The idea is to pair 'bare' local_bh_{disable,enable} with local lock
> and late make local_bh_{disable,enable} no ops (on RT).
>
> With 'bare' I mean not followed by a spin_lock() - which is enough to
> ensure mutual exclusion vs BH on RT build - am I correct?
I might have I misunderstood your rephrase. But to make it clear:
| $ git grep -p local_lock\( kernel/softirq.c
| kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
| kernel/softirq.c: local_lock(&softirq_ctrl.lock);
this is what I want to remove. This is upstream RT only (not RT queue
only). !RT builds are not affected by this change.
> > The series introduces the infrastructure and converts large parts of
> > networking which is largest stake holder here. Once this done the
> > per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
>
> AFAICS there are a bunch of local_bh_* call-sites under 'net' matching
> the above description and not addressed here. Is this series supposed
> to cover 'net' fully?
The net subsystem has not been fully audited but the major parts have
been. I checked global per-CPU variables but there might be dynamic
ones. Also new ones might have appeared in the meantime. There are
two things which are not fixed yet that I am aware of:
- tw_timer timer
https://lore.kernel.org/all/20240415113436.3261042-1-vschneid@redhat.com/T/#u
- can gw
https://lore.kernel.org/linux-can/20231031112349.y0aLoBrz@linutronix.de/
https://lore.kernel.org/all/20231221123703.8170-1-socketcan@hartkopp.net/T/#u
That means those two need to be fixed first before that local_local()
can disappear from local_bh_disable()/ enable. Also the whole tree
should be checked.
> Could you please include the diffstat for the whole series? I
> think/hope it will help catching the full picture more easily.
total over the series:
| include/linux/filter.h | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
| include/linux/local_lock.h | 21 +++++++++++++++++++++
| include/linux/local_lock_internal.h | 31 +++++++++++++++++++++++++++++++
| include/linux/lockdep.h | 3 +++
| include/linux/netdevice.h | 12 ++++++++++++
| include/linux/sched.h | 9 ++++++++-
| include/net/seg6_local.h | 1 +
| include/net/sock.h | 5 +++++
| kernel/bpf/cpumap.c | 27 +++++++++++----------------
| kernel/bpf/devmap.c | 16 ++++++++--------
| kernel/fork.c | 3 +++
| kernel/locking/spinlock.c | 8 ++++++++
| net/bpf/test_run.c | 11 ++++++++++-
| net/bridge/br_netfilter_hooks.c | 7 ++++++-
| net/core/dev.c | 39 +++++++++++++++++++++++++++++++++------
| net/core/dev.h | 20 ++++++++++++++++++++
| net/core/filter.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------
| net/core/lwt_bpf.c | 9 +++++----
| net/core/skbuff.c | 25 ++++++++++++++++---------
| net/ipv4/tcp_ipv4.c | 15 +++++++++++----
| net/ipv4/tcp_sigpool.c | 17 +++++++++++++----
| net/ipv6/seg6_local.c | 22 ++++++++++++++--------
| net/xdp/xsk.c | 19 +++++++++++--------
| 23 files changed, 445 insertions(+), 116 deletions(-)
> Note that some callers use local_bh_disable(), no additional lock, and
> there is no specific struct to protect, but enforce explicit
> serialization vs bh to a bunch of operation, e.g. the
> local_bh_disable() in inet_twsk_purge().
>
> I guess such call site should be handled, too?
Yes but I didn't find much. inet_twsk_purge() is the first item from my
list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and
could be mixed.
The only resources that can be protected by disabling BH are per-CPU
resources. Either explicit defined (such as napi_alloc_cache) or
implicit by other means of per-CPU usage such as a CPU-bound timer,
worker, …. Protecting global variables by disabling BH is broken on SMP
(see the CAN gw example) so I am not too worried about those.
Unless you are aware of a category I did not think of.
> Thanks!
>
> Paolo
Sebastian
On Mon, 2024-05-06 at 11:38 +0200, Sebastian Andrzej Siewior wrote:
> On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote:
> > On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> > > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> > > local_bh_disable() section remains preemtible. As a result high prior
> > > tasks (or threaded interrupts) will be blocked by lower-prio task (or
> > > threaded interrupts) which are long running which includes softirq
> > > sections.
> > >
> > > The proposed way out is to introduce explicit per-CPU locks for
> > > resources which are protected by local_bh_disable() and use those only
> > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> >
> > Let me rephrase to check I understood the plan correctly.
> >
> > The idea is to pair 'bare' local_bh_{disable,enable} with local lock
> > and late make local_bh_{disable,enable} no ops (on RT).
> >
> > With 'bare' I mean not followed by a spin_lock() - which is enough to
> > ensure mutual exclusion vs BH on RT build - am I correct?
>
> I might have I misunderstood your rephrase. But to make it clear:
> > $ git grep -p local_lock\( kernel/softirq.c
> > kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
> > kernel/softirq.c: local_lock(&softirq_ctrl.lock);
>
> this is what I want to remove. This is upstream RT only (not RT queue
> only). !RT builds are not affected by this change.
I was trying to describe the places that need the additional
local_lock(), but I think we are on the same page WRT the overall
semantic.
> >
> > Note that some callers use local_bh_disable(), no additional lock, and
> > there is no specific struct to protect, but enforce explicit
> > serialization vs bh to a bunch of operation, e.g. the
> > local_bh_disable() in inet_twsk_purge().
> >
> > I guess such call site should be handled, too?
>
> Yes but I didn't find much. inet_twsk_purge() is the first item from my
> list. On RT spin_lock() vs spin_lock_bh() is the first item from my
> list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and
> could be mixed.
>
> The only resources that can be protected by disabling BH are per-CPU
> resources. Either explicit defined (such as napi_alloc_cache) or
> implicit by other means of per-CPU usage such as a CPU-bound timer,
> worker, …. Protecting global variables by disabling BH is broken on SMP
> (see the CAN gw example) so I am not too worried about those.
> Unless you are aware of a category I did not think of.
I think sometimes the stack could call local_bh_enable() after a while
WRT the paired spin lock release, to enforce some serialization - alike
what inet_twsk_purge() is doing - but I can't point to any specific
line on top of my head.
A possible side-effect you should/could observe in the final tree is
more pressure on the process scheduler, as something alike:
local_bh_disable()
<spinlock lock unlock>
<again spinlock lock unlock>
local_bh_enable()
could results in more invocation of the scheduler, right?
Cheers,
Paolo
On 2024-05-06 16:12:00 [+0200], Paolo Abeni wrote: > > I think sometimes the stack could call local_bh_enable() after a while > WRT the paired spin lock release, to enforce some serialization - alike > what inet_twsk_purge() is doing - but I can't point to any specific > line on top of my head. I *think* the inet_twsk_purge() is special because the timer is pinned and that bh_disable call ensures that the timer does not fire. > A possible side-effect you should/could observe in the final tree is > more pressure on the process scheduler, as something alike: > > local_bh_disable() > > <spinlock lock unlock> > > <again spinlock lock unlock> > > local_bh_enable() > > could results in more invocation of the scheduler, right? Yes, to some degree. On PREEMPT_RT "spinlock lock" does not disable preemption so the section remains preemptible. A task with elevated priority (SCHED_RR/FIFO/DL) remains on the CPU unless preempted by task with higher priority. Regardless of the locks. A SCHED_OTHER task can be preempted by another SCHED_OTHER task even with an acquired spinlock_t. This can be bad performance wise if this other SCHED_OTHER task preempts the lock owner and blocks on the same lock. To cope with this we had something called PREEMPT_LAZY (now PREEMPT_AUTO) in the RT-queue to avoid preemption within SCHED_OTHER tasks as long as a spinlock_t (or other lock that spins on !RT) is acquired. By removing the lock from local_bh_disable() we lose that "please don't preempt me" feature from your scenario above across the BH disabled section for SCHED_OTHER tasks. Nothing changes for tasks with elevated priority. > Cheers, > > Paolo Sebastian
© 2016 - 2025 Red Hat, Inc.