kernel/locking/lockdep.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
lockdep_unregister_key() is called from critical code paths, including
sections where rtnl_lock() is held. For example, when replacing a qdisc
in a network device, network egress traffic is disabled while
__qdisc_destroy() is called for every network queue.
If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
which gets blocked waiting for synchronize_rcu() to complete.
For example, a simple tc command to replace a qdisc could take 13
seconds:
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m13.195s
user 0m0.001s
sys 0m2.746s
During this time, network egress is completely frozen while waiting for
RCU synchronization.
Use synchronize_rcu_expedited() instead to minimize the impact on
critical operations like network connectivity changes.
This improves 10x the function call to tc, when replacing the qdisc for
a network card.
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m1.789s
user 0m0.000s
sys 0m1.613s
Reported-by: Erik Lundgren <elundgren@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
---
kernel/locking/lockdep.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 4470680f02269..a79030ac36dd4 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
if (need_callback)
call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
- /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
- synchronize_rcu();
+ /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
+ * This needs to be quick, since it is called in critical sections
+ */
+ synchronize_rcu_expedited();
}
EXPORT_SYMBOL_GPL(lockdep_unregister_key);
---
base-commit: 81e4f8d68c66da301bb881862735bd74c6241a19
change-id: 20250319-lockdep-b1eca0479665
Best regards,
--
Breno Leitao <leitao@debian.org>
Hello Waiman, Boqun, On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > lockdep_unregister_key() is called from critical code paths, including > sections where rtnl_lock() is held. For example, when replacing a qdisc > in a network device, network egress traffic is disabled while > __qdisc_destroy() is called for every network queue. > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > which gets blocked waiting for synchronize_rcu() to complete. > > For example, a simple tc command to replace a qdisc could take 13 > seconds: > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m13.195s > user 0m0.001s > sys 0m2.746s > > During this time, network egress is completely frozen while waiting for > RCU synchronization. > > Use synchronize_rcu_expedited() instead to minimize the impact on > critical operations like network connectivity changes. > > This improves 10x the function call to tc, when replacing the qdisc for > a network card. > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m1.789s > user 0m0.000s > sys 0m1.613s Can I have this landed as a workaround for the problem above, while hazard pointers doesn't get merged? This is affecting some systems that runs the Linus' upstream kernel with some debug flags enabled, and I would like to have they unblocked. Once hazard pointer lands, this will be reverted. Is this a fair approach? Thanks for your help, --breno
On 7/9/25 6:00 AM, Breno Leitao wrote: > Hello Waiman, Boqun, > > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: >> lockdep_unregister_key() is called from critical code paths, including >> sections where rtnl_lock() is held. For example, when replacing a qdisc >> in a network device, network egress traffic is disabled while >> __qdisc_destroy() is called for every network queue. >> >> If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), >> which gets blocked waiting for synchronize_rcu() to complete. >> >> For example, a simple tc command to replace a qdisc could take 13 >> seconds: >> >> # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq >> real 0m13.195s >> user 0m0.001s >> sys 0m2.746s >> >> During this time, network egress is completely frozen while waiting for >> RCU synchronization. >> >> Use synchronize_rcu_expedited() instead to minimize the impact on >> critical operations like network connectivity changes. >> >> This improves 10x the function call to tc, when replacing the qdisc for >> a network card. >> >> # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq >> real 0m1.789s >> user 0m0.000s >> sys 0m1.613s > Can I have this landed as a workaround for the problem above, while > hazard pointers doesn't get merged? > > This is affecting some systems that runs the Linus' upstream kernel with > some debug flags enabled, and I would like to have they unblocked. > > Once hazard pointer lands, this will be reverted. Is this a fair > approach? > > Thanks for your help, > --breno I am fine with this patch going in as a temporary workaround. Boqun, what do you think? Cheers, Longman
On Wed, Jul 09, 2025 at 09:57:44AM -0400, Waiman Long wrote: > On 7/9/25 6:00 AM, Breno Leitao wrote: > > Hello Waiman, Boqun, > > > > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > > > lockdep_unregister_key() is called from critical code paths, including > > > sections where rtnl_lock() is held. For example, when replacing a qdisc > > > in a network device, network egress traffic is disabled while > > > __qdisc_destroy() is called for every network queue. > > > > > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > > > which gets blocked waiting for synchronize_rcu() to complete. > > > > > > For example, a simple tc command to replace a qdisc could take 13 > > > seconds: > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > > real 0m13.195s > > > user 0m0.001s > > > sys 0m2.746s > > > > > > During this time, network egress is completely frozen while waiting for > > > RCU synchronization. > > > > > > Use synchronize_rcu_expedited() instead to minimize the impact on > > > critical operations like network connectivity changes. > > > > > > This improves 10x the function call to tc, when replacing the qdisc for > > > a network card. > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > > real 0m1.789s > > > user 0m0.000s > > > sys 0m1.613s > > Can I have this landed as a workaround for the problem above, while > > hazard pointers doesn't get merged? > > > > This is affecting some systems that runs the Linus' upstream kernel with > > some debug flags enabled, and I would like to have they unblocked. > > > > Once hazard pointer lands, this will be reverted. Is this a fair > > approach? > > > > Thanks for your help, > > --breno > > I am fine with this patch going in as a temporary workaround. Boqun, what do > you think? > Seems good to me. We can add a "// TODO" comment to call out it's a temporary workaround and should be replaced. Although, I believe Peter had some some concern about IPI, I would like to get his opinion on using synchronize_rcu_expedited() as a temporary solution. Peter? Regards, Boqun > Cheers, > Longman >
On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > lockdep_unregister_key() is called from critical code paths, including > sections where rtnl_lock() is held. For example, when replacing a qdisc > in a network device, network egress traffic is disabled while > __qdisc_destroy() is called for every network queue. > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > which gets blocked waiting for synchronize_rcu() to complete. > > For example, a simple tc command to replace a qdisc could take 13 > seconds: > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m13.195s > user 0m0.001s > sys 0m2.746s > > During this time, network egress is completely frozen while waiting for > RCU synchronization. > > Use synchronize_rcu_expedited() instead to minimize the impact on > critical operations like network connectivity changes. > > This improves 10x the function call to tc, when replacing the qdisc for > a network card. > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m1.789s > user 0m0.000s > sys 0m1.613s > > Reported-by: Erik Lundgren <elundgren@meta.com> > Signed-off-by: Breno Leitao <leitao@debian.org> > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> > --- > kernel/locking/lockdep.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 4470680f02269..a79030ac36dd4 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > if (need_callback) > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > - synchronize_rcu(); > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. > + * This needs to be quick, since it is called in critical sections > + */ > + synchronize_rcu_expedited(); > } > EXPORT_SYMBOL_GPL(lockdep_unregister_key); So I fundamentally despise synchronize_rcu_expedited(), also your comment style is broken. Why can't qdisc call this outside of the lock?
On Mon, Mar 24, 2025 at 1:12 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
> > lockdep_unregister_key() is called from critical code paths, including
> > sections where rtnl_lock() is held. For example, when replacing a qdisc
> > in a network device, network egress traffic is disabled while
> > __qdisc_destroy() is called for every network queue.
> >
> > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> > which gets blocked waiting for synchronize_rcu() to complete.
> >
> > For example, a simple tc command to replace a qdisc could take 13
> > seconds:
> >
> > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > real 0m13.195s
> > user 0m0.001s
> > sys 0m2.746s
> >
> > During this time, network egress is completely frozen while waiting for
> > RCU synchronization.
> >
> > Use synchronize_rcu_expedited() instead to minimize the impact on
> > critical operations like network connectivity changes.
> >
> > This improves 10x the function call to tc, when replacing the qdisc for
> > a network card.
> >
> > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > real 0m1.789s
> > user 0m0.000s
> > sys 0m1.613s
> >
> > Reported-by: Erik Lundgren <elundgren@meta.com>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
> > ---
> > kernel/locking/lockdep.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index 4470680f02269..a79030ac36dd4 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > if (need_callback)
> > call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> >
> > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > - synchronize_rcu();
> > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > + * This needs to be quick, since it is called in critical sections
> > + */
> > + synchronize_rcu_expedited();
> > }
> > EXPORT_SYMBOL_GPL(lockdep_unregister_key);
>
> So I fundamentally despise synchronize_rcu_expedited(), also your
> comment style is broken.
>
> Why can't qdisc call this outside of the lock?
Good luck with that, and anyway the time to call it 256 times would
still hurt Breno use case.
My suggestion was to change lockdep_unregister_key() contract, and use
kfree_rcu() there
> I think we should redesign lockdep_unregister_key() to work on a separately
> allocated piece of memory,
> then use kfree_rcu() in it.
>
> Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
>
> struct ... {
> struct lock_class_key key;
> struct rcu_head rcu;
> }
More work because it requires changing all lockdep_unregister_key() users.
On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
[...]
> > > ---
> > > kernel/locking/lockdep.c | 6 ++++--
> > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index 4470680f02269..a79030ac36dd4 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > if (need_callback)
> > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > >
> > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > - synchronize_rcu();
I feel a bit confusing even for the old comment, normally I would expect
the caller of lockdep_unregister_key() should guarantee the key has been
unpublished, in other words, there is no way a lockdep_unregister_key()
could race with a register_lock_class()/lockdep_init_map_type(). The
synchronize_rcu() is not needed then.
Let's say someone breaks my assumption above, then when doing a
register_lock_class() with a key about to be unregister, I cannot see
anything stops the following:
CPU 0 CPU 1
===== =====
register_lock_class():
...
} else if (... && !is_dynamic_key(lock->key)) {
// ->key is not unregistered yet, so this branch is not
// taken.
return NULL;
}
lockdep_unregister_key(..);
// key unregister, can be free
// any time.
key = lock->key->subkeys + subclass; // BOOM! UAF.
So either we don't need the synchronize_rcu() here or the
synchronize_rcu() doesn't help at all. Am I missing something subtle
here?
Regards,
Boqun
> > > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > > + * This needs to be quick, since it is called in critical sections
> > > + */
> > > + synchronize_rcu_expedited();
> > > }
> > > EXPORT_SYMBOL_GPL(lockdep_unregister_key);
> >
> > So I fundamentally despise synchronize_rcu_expedited(), also your
> > comment style is broken.
> >
> > Why can't qdisc call this outside of the lock?
>
> Good luck with that, and anyway the time to call it 256 times would
> still hurt Breno use case.
>
> My suggestion was to change lockdep_unregister_key() contract, and use
> kfree_rcu() there
>
> > I think we should redesign lockdep_unregister_key() to work on a separately
> > allocated piece of memory,
> > then use kfree_rcu() in it.
> >
> > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
> >
> > struct ... {
> > struct lock_class_key key;
> > struct rcu_head rcu;
> > }
>
> More work because it requires changing all lockdep_unregister_key() users.
On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
> On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
> [...]
> > > > ---
> > > > kernel/locking/lockdep.c | 6 ++++--
> > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > > index 4470680f02269..a79030ac36dd4 100644
> > > > --- a/kernel/locking/lockdep.c
> > > > +++ b/kernel/locking/lockdep.c
> > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > > if (need_callback)
> > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > > >
> > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > > - synchronize_rcu();
>
> I feel a bit confusing even for the old comment, normally I would expect
> the caller of lockdep_unregister_key() should guarantee the key has been
> unpublished, in other words, there is no way a lockdep_unregister_key()
> could race with a register_lock_class()/lockdep_init_map_type(). The
> synchronize_rcu() is not needed then.
>
> Let's say someone breaks my assumption above, then when doing a
> register_lock_class() with a key about to be unregister, I cannot see
> anything stops the following:
>
> CPU 0 CPU 1
> ===== =====
> register_lock_class():
> ...
> } else if (... && !is_dynamic_key(lock->key)) {
> // ->key is not unregistered yet, so this branch is not
> // taken.
> return NULL;
> }
> lockdep_unregister_key(..);
> // key unregister, can be free
> // any time.
> key = lock->key->subkeys + subclass; // BOOM! UAF.
>
> So either we don't need the synchronize_rcu() here or the
> synchronize_rcu() doesn't help at all. Am I missing something subtle
> here?
>
Oh! Maybe I was missing register_lock_class() must be called with irq
disabled, which is also an RCU read-side critical section.
Regards,
Boqun
> Regards,
> Boqun
>
> > > > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > > > + * This needs to be quick, since it is called in critical sections
> > > > + */
> > > > + synchronize_rcu_expedited();
> > > > }
> > > > EXPORT_SYMBOL_GPL(lockdep_unregister_key);
> > >
> > > So I fundamentally despise synchronize_rcu_expedited(), also your
> > > comment style is broken.
> > >
> > > Why can't qdisc call this outside of the lock?
> >
> > Good luck with that, and anyway the time to call it 256 times would
> > still hurt Breno use case.
> >
> > My suggestion was to change lockdep_unregister_key() contract, and use
> > kfree_rcu() there
> >
> > > I think we should redesign lockdep_unregister_key() to work on a separately
> > > allocated piece of memory,
> > > then use kfree_rcu() in it.
> > >
> > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
> > >
> > > struct ... {
> > > struct lock_class_key key;
> > > struct rcu_head rcu;
> > > }
> >
> > More work because it requires changing all lockdep_unregister_key() users.
On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
> On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
> > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
> > [...]
> > > > > ---
> > > > > kernel/locking/lockdep.c | 6 ++++--
> > > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > > > index 4470680f02269..a79030ac36dd4 100644
> > > > > --- a/kernel/locking/lockdep.c
> > > > > +++ b/kernel/locking/lockdep.c
> > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > > > if (need_callback)
> > > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > > > >
> > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > > > - synchronize_rcu();
> >
> > I feel a bit confusing even for the old comment, normally I would expect
> > the caller of lockdep_unregister_key() should guarantee the key has been
> > unpublished, in other words, there is no way a lockdep_unregister_key()
> > could race with a register_lock_class()/lockdep_init_map_type(). The
> > synchronize_rcu() is not needed then.
> >
> > Let's say someone breaks my assumption above, then when doing a
> > register_lock_class() with a key about to be unregister, I cannot see
> > anything stops the following:
> >
> > CPU 0 CPU 1
> > ===== =====
> > register_lock_class():
> > ...
> > } else if (... && !is_dynamic_key(lock->key)) {
> > // ->key is not unregistered yet, so this branch is not
> > // taken.
> > return NULL;
> > }
> > lockdep_unregister_key(..);
> > // key unregister, can be free
> > // any time.
> > key = lock->key->subkeys + subclass; // BOOM! UAF.
> >
> > So either we don't need the synchronize_rcu() here or the
> > synchronize_rcu() doesn't help at all. Am I missing something subtle
> > here?
> >
>
> Oh! Maybe I was missing register_lock_class() must be called with irq
> disabled, which is also an RCU read-side critical section.
>
Since register_lock_class() will be call with irq disabled, maybe hazard
pointers [1] is better because most of the case we only have nr_cpus
readers, so the potential hazard pointer slots are fixed.
So the below patch can reduce the time of the tc command from real ~1.7
second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
which is not surprising given it's a dedicated hazard pointers for
lock_class_key.
Thoughts?
[1]: https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/
Regards,
Boqun
---------------------------------->8
From: Boqun Feng <boqun.feng@gmail.com>
Date: Mon, 24 Mar 2025 13:38:15 -0700
Subject: [PATCH] WIP
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/locking/lockdep.c | 65 +++++++++++++++++++++++++++++++++++++---
1 file changed, 61 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 4470680f0226..0b6e78d56cfe 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -111,6 +111,29 @@ late_initcall(kernel_lockdep_sysctls_init);
DEFINE_PER_CPU(unsigned int, lockdep_recursion);
EXPORT_PER_CPU_SYMBOL_GPL(lockdep_recursion);
+/* hazptr free always clears the slot */
+DEFINE_FREE(lockdep_key_hazptr, struct lock_class_key **, if (_T) smp_store_release((_T), NULL));
+DEFINE_PER_CPU(struct lock_class_key *, lockdep_key_hazptr);
+
+static void synchronize_lockdep_key_hazptr(struct lock_class_key *key)
+{
+ int cpu;
+
+ if (!key)
+ return;
+ /*
+ * Synchronizes with the smp_mb() after protecting the pointer with
+ * WRITE_ONCE().
+ */
+ smp_mb();
+
+ for_each_possible_cpu(cpu) {
+ struct lock_class_key **hazptr = per_cpu_ptr(&lockdep_key_hazptr, cpu);
+
+ smp_cond_load_acquire(hazptr, VAL != key);
+ }
+}
+
static __always_inline bool lockdep_enabled(void)
{
if (!debug_locks)
@@ -1283,6 +1306,7 @@ static struct lock_class *
register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
{
struct lockdep_subclass_key *key;
+ struct lock_class_key **key_hazptr __free(lockdep_key_hazptr) = NULL;
struct hlist_head *hash_head;
struct lock_class *class;
int idx;
@@ -1293,11 +1317,25 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
if (likely(class))
goto out_set_class_cache;
+ /* Interrupts are disabled hence it's safe to acquire the hazptr slot */
+ key_hazptr = this_cpu_ptr(&lockdep_key_hazptr);
+
+ /* hazptr slot must be unusued */
+ BUG_ON(*key_hazptr);
+
if (!lock->key) {
if (!assign_lock_key(lock))
return NULL;
- } else if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) {
- return NULL;
+ } else {
+ /* hazptr: protect the key */
+ WRITE_ONCE(*key_hazptr, lock->key);
+
+ /* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */
+ smp_mb();
+
+ if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) {
+ return NULL;
+ }
}
key = lock->key->subkeys + subclass;
@@ -4964,16 +5002,35 @@ void lockdep_init_map_type(struct lockdep_map *lock, const char *name,
*/
if (DEBUG_LOCKS_WARN_ON(!key))
return;
+
+ preempt_disable();
+ struct lock_class_key **hazptr __free(lockdep_key_hazptr) = this_cpu_ptr(&lockdep_key_hazptr);
+
+ /* hapztr: Protect the key */
+ WRITE_ONCE(*hazptr, key);
+
+ /* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */
+ smp_mb();
+
/*
* Sanity check, the lock-class key must either have been allocated
* statically or must have been registered as a dynamic key.
*/
if (!static_obj(key) && !is_dynamic_key(key)) {
+ preempt_enable();
if (debug_locks)
printk(KERN_ERR "BUG: key %px has not been registered!\n", key);
DEBUG_LOCKS_WARN_ON(1);
return;
}
+
+ /* hazptr: Release the key */
+ smp_store_release(hazptr, NULL);
+
+ /* Do not auto clean up*/
+ hazptr = NULL;
+ preempt_enable();
+
lock->key = key;
if (unlikely(!debug_locks))
@@ -6595,8 +6652,8 @@ void lockdep_unregister_key(struct lock_class_key *key)
if (need_callback)
call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
- /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
- synchronize_rcu();
+ /* Wait until register_lock_class() has finished accessing key. */
+ synchronize_lockdep_key_hazptr(key);
}
EXPORT_SYMBOL_GPL(lockdep_unregister_key);
--
2.48.1
On 3/24/25 8:47 PM, Boqun Feng wrote:
> On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
>> On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
>>> On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
>>> [...]
>>>>>> ---
>>>>>> kernel/locking/lockdep.c | 6 ++++--
>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>>>> index 4470680f02269..a79030ac36dd4 100644
>>>>>> --- a/kernel/locking/lockdep.c
>>>>>> +++ b/kernel/locking/lockdep.c
>>>>>> @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
>>>>>> if (need_callback)
>>>>>> call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
>>>>>>
>>>>>> - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
>>>>>> - synchronize_rcu();
>>> I feel a bit confusing even for the old comment, normally I would expect
>>> the caller of lockdep_unregister_key() should guarantee the key has been
>>> unpublished, in other words, there is no way a lockdep_unregister_key()
>>> could race with a register_lock_class()/lockdep_init_map_type(). The
>>> synchronize_rcu() is not needed then.
>>>
>>> Let's say someone breaks my assumption above, then when doing a
>>> register_lock_class() with a key about to be unregister, I cannot see
>>> anything stops the following:
>>>
>>> CPU 0 CPU 1
>>> ===== =====
>>> register_lock_class():
>>> ...
>>> } else if (... && !is_dynamic_key(lock->key)) {
>>> // ->key is not unregistered yet, so this branch is not
>>> // taken.
>>> return NULL;
>>> }
>>> lockdep_unregister_key(..);
>>> // key unregister, can be free
>>> // any time.
>>> key = lock->key->subkeys + subclass; // BOOM! UAF.
>>>
>>> So either we don't need the synchronize_rcu() here or the
>>> synchronize_rcu() doesn't help at all. Am I missing something subtle
>>> here?
>>>
>> Oh! Maybe I was missing register_lock_class() must be called with irq
>> disabled, which is also an RCU read-side critical section.
>>
> Since register_lock_class() will be call with irq disabled, maybe hazard
> pointers [1] is better because most of the case we only have nr_cpus
> readers, so the potential hazard pointer slots are fixed.
>
> So the below patch can reduce the time of the tc command from real ~1.7
> second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
> which is not surprising given it's a dedicated hazard pointers for
> lock_class_key.
>
> Thoughts?
My understanding is that it is not a race between register_lock_class()
and lockdep_unregister_key(). It is the fact that the structure that
holds the lock_class_key may be freed immediately after return from
lockdep_unregister_key(). So any processes that are in the process of
iterating the hash_list containing the hash_entry to be unregistered may
hit a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid
potential access of invalid memory in lock_class") for a discussion of
this kind of UAF problem.
As suggested by Eric, one possible solution is to add a
lockdep_unregister_key() variant function that presumes the structure
holding the key won't be freed until after a RCU delay. In this case, we
can skip the last synchronize_rcu() call. Any callers that need
immediate return should use kfree_rcu() to free the structure after
calling the lockdep_unregister_key() variant.
Cheers,
Longman
On Mon, Mar 24, 2025 at 09:56:25PM -0400, Waiman Long wrote:
> On 3/24/25 8:47 PM, Boqun Feng wrote:
> > On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
> > > On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
> > > > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
> > > > [...]
> > > > > > > ---
> > > > > > > kernel/locking/lockdep.c | 6 ++++--
> > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > > > > > index 4470680f02269..a79030ac36dd4 100644
> > > > > > > --- a/kernel/locking/lockdep.c
> > > > > > > +++ b/kernel/locking/lockdep.c
> > > > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > > > > > if (need_callback)
> > > > > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > > > > > >
> > > > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > > > > > - synchronize_rcu();
> > > > I feel a bit confusing even for the old comment, normally I would expect
> > > > the caller of lockdep_unregister_key() should guarantee the key has been
> > > > unpublished, in other words, there is no way a lockdep_unregister_key()
> > > > could race with a register_lock_class()/lockdep_init_map_type(). The
> > > > synchronize_rcu() is not needed then.
> > > >
> > > > Let's say someone breaks my assumption above, then when doing a
> > > > register_lock_class() with a key about to be unregister, I cannot see
> > > > anything stops the following:
> > > >
> > > > CPU 0 CPU 1
> > > > ===== =====
> > > > register_lock_class():
> > > > ...
> > > > } else if (... && !is_dynamic_key(lock->key)) {
> > > > // ->key is not unregistered yet, so this branch is not
> > > > // taken.
> > > > return NULL;
> > > > }
> > > > lockdep_unregister_key(..);
> > > > // key unregister, can be free
> > > > // any time.
> > > > key = lock->key->subkeys + subclass; // BOOM! UAF.
This is not a UAF :(
> > > >
> > > > So either we don't need the synchronize_rcu() here or the
> > > > synchronize_rcu() doesn't help at all. Am I missing something subtle
> > > > here?
> > > >
> > > Oh! Maybe I was missing register_lock_class() must be called with irq
> > > disabled, which is also an RCU read-side critical section.
> > >
> > Since register_lock_class() will be call with irq disabled, maybe hazard
> > pointers [1] is better because most of the case we only have nr_cpus
> > readers, so the potential hazard pointer slots are fixed.
> >
> > So the below patch can reduce the time of the tc command from real ~1.7
> > second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
> > which is not surprising given it's a dedicated hazard pointers for
> > lock_class_key.
> >
> > Thoughts?
>
> My understanding is that it is not a race between register_lock_class() and
> lockdep_unregister_key(). It is the fact that the structure that holds the
> lock_class_key may be freed immediately after return from
> lockdep_unregister_key(). So any processes that are in the process of
> iterating the hash_list containing the hash_entry to be unregistered may hit
You mean the lock_keys_hash table, right? I used register_lock_class()
as an example, because it's one of the places that iterates
lock_keys_hash. IIUC lock_keys_hash is only used in
lockdep_{un,}register_key() and is_dynamic_key() (which are only called
by lockdep_init_map_type() and register_lock_class()).
> a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential
> access of invalid memory in lock_class") for a discussion of this kind of
> UAF problem.
>
That commit seemed fixing a race between disabling lockdep and
unregistering key, and most importantly, call zap_class() for the
unregistered key even if lockdep is disabled (debug_locks = 0). It might
be related, but I'm not sure that's the reason of putting
synchronize_rcu() there. Say you want to synchronize between
/proc/lockdep and lockdep_unregister_key(), and you have
synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side
critical section at /proc/lockdep?
Regards,
Boqun
> As suggested by Eric, one possible solution is to add a
> lockdep_unregister_key() variant function that presumes the structure
> holding the key won't be freed until after a RCU delay. In this case, we can
> skip the last synchronize_rcu() call. Any callers that need immediate return
> should use kfree_rcu() to free the structure after calling the
> lockdep_unregister_key() variant.
>
> Cheers,
> Longman
>
On Mon, Mar 24, 2025 at 1:23 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Mar 24, 2025 at 1:12 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
> > > lockdep_unregister_key() is called from critical code paths, including
> > > sections where rtnl_lock() is held. For example, when replacing a qdisc
> > > in a network device, network egress traffic is disabled while
> > > __qdisc_destroy() is called for every network queue.
> > >
> > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> > > which gets blocked waiting for synchronize_rcu() to complete.
> > >
> > > For example, a simple tc command to replace a qdisc could take 13
> > > seconds:
> > >
> > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > > real 0m13.195s
> > > user 0m0.001s
> > > sys 0m2.746s
> > >
> > > During this time, network egress is completely frozen while waiting for
> > > RCU synchronization.
> > >
> > > Use synchronize_rcu_expedited() instead to minimize the impact on
> > > critical operations like network connectivity changes.
> > >
> > > This improves 10x the function call to tc, when replacing the qdisc for
> > > a network card.
> > >
> > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > > real 0m1.789s
> > > user 0m0.000s
> > > sys 0m1.613s
> > >
> > > Reported-by: Erik Lundgren <elundgren@meta.com>
> > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
> > > ---
> > > kernel/locking/lockdep.c | 6 ++++--
> > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index 4470680f02269..a79030ac36dd4 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > if (need_callback)
> > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > >
> > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > - synchronize_rcu();
> > > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > > + * This needs to be quick, since it is called in critical sections
> > > + */
> > > + synchronize_rcu_expedited();
> > > }
> > > EXPORT_SYMBOL_GPL(lockdep_unregister_key);
> >
> > So I fundamentally despise synchronize_rcu_expedited(), also your
> > comment style is broken.
> >
> > Why can't qdisc call this outside of the lock?
>
> Good luck with that, and anyway the time to call it 256 times would
> still hurt Breno use case.
>
> My suggestion was to change lockdep_unregister_key() contract, and use
> kfree_rcu() there
>
> > I think we should redesign lockdep_unregister_key() to work on a separately
> > allocated piece of memory,
> > then use kfree_rcu() in it.
> >
> > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
> >
> > struct ... {
> > struct lock_class_key key;
> > struct rcu_head rcu;
> > }
>
> More work because it requires changing all lockdep_unregister_key() users.
Or add a new function, basically allow users to be converted when they need to.
On Fri, Mar 21, 2025 at 10:31 AM Breno Leitao <leitao@debian.org> wrote: > > lockdep_unregister_key() is called from critical code paths, including > sections where rtnl_lock() is held. For example, when replacing a qdisc > in a network device, network egress traffic is disabled while > __qdisc_destroy() is called for every network queue. > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > which gets blocked waiting for synchronize_rcu() to complete. > > For example, a simple tc command to replace a qdisc could take 13 > seconds: > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m13.195s > user 0m0.001s > sys 0m2.746s > > During this time, network egress is completely frozen while waiting for > RCU synchronization. > > Use synchronize_rcu_expedited() instead to minimize the impact on > critical operations like network connectivity changes. Running 'critical operations' with LOCKDEP enabled kernels seems a bit strange :) Reviewed-by: Eric Dumazet <edumazet@google.com>
On Fri, Mar 21, 2025 at 11:37:31AM +0100, Eric Dumazet wrote: > On Fri, Mar 21, 2025 at 10:31 AM Breno Leitao <leitao@debian.org> wrote: > > > > lockdep_unregister_key() is called from critical code paths, including > > sections where rtnl_lock() is held. For example, when replacing a qdisc > > in a network device, network egress traffic is disabled while > > __qdisc_destroy() is called for every network queue. > > > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > > which gets blocked waiting for synchronize_rcu() to complete. > > > > For example, a simple tc command to replace a qdisc could take 13 > > seconds: > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > real 0m13.195s > > user 0m0.001s > > sys 0m2.746s > > > > During this time, network egress is completely frozen while waiting for > > RCU synchronization. > > > > Use synchronize_rcu_expedited() instead to minimize the impact on > > critical operations like network connectivity changes. > > Running 'critical operations' with LOCKDEP enabled kernels seems a bit > strange :) Apologies, I meant to say that at Meta, we have certain service tiers that can accommodate the performance trade-offs of running a "debug" kernel. This kernel provides valuable feedback about its operations. The aim is to identify any significant issues early on, rather than having to debug "hard issues" (such as deadlocks, out-of-memory access, etc) once the kernel is in production. > Reviewed-by: Eric Dumazet <edumazet@google.com> Thanks for your support during this investigation, --breno
The following commit has been merged into the locking/core branch of tip:
Commit-ID: 7a3cedafccf8e7d038ad4cfec5b38052647ceac5
Gitweb: https://git.kernel.org/tip/7a3cedafccf8e7d038ad4cfec5b38052647ceac5
Author: Breno Leitao <leitao@debian.org>
AuthorDate: Fri, 21 Mar 2025 02:30:49 -07:00
Committer: Boqun Feng <boqun.feng@gmail.com>
CommitterDate: Mon, 14 Jul 2025 21:57:29 -07:00
lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
lockdep_unregister_key() is called from critical code paths, including
sections where rtnl_lock() is held. For example, when replacing a qdisc
in a network device, network egress traffic is disabled while
__qdisc_destroy() is called for every network queue.
If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
which gets blocked waiting for synchronize_rcu() to complete.
For example, a simple tc command to replace a qdisc could take 13
seconds:
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m13.195s
user 0m0.001s
sys 0m2.746s
During this time, network egress is completely frozen while waiting for
RCU synchronization.
Use synchronize_rcu_expedited() instead to minimize the impact on
critical operations like network connectivity changes.
This improves 10x the function call to tc, when replacing the qdisc for
a network card.
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m1.789s
user 0m0.000s
sys 0m1.613s
[boqun: Fixed the comment and add more information for the temporary
workaround, and add TODO information for hazptr]
Reported-by: Erik Lundgren <elundgren@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20250321-lockdep-v1-1-78b732d195fb@debian.org
---
kernel/locking/lockdep.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 0c94141..2d4c5ba 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6616,8 +6616,16 @@ void lockdep_unregister_key(struct lock_class_key *key)
if (need_callback)
call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
- /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
- synchronize_rcu();
+ /*
+ * Wait until is_dynamic_key() has finished accessing k->hash_entry.
+ *
+ * Some operations like __qdisc_destroy() will call this in a debug
+ * kernel, and the network traffic is disabled while waiting, hence
+ * the delay of the wait matters in debugging cases. Currently use a
+ * synchronize_rcu_expedited() to speed up the wait at the cost of
+ * system IPIs. TODO: Replace RCU with hazptr for this.
+ */
+ synchronize_rcu_expedited();
}
EXPORT_SYMBOL_GPL(lockdep_unregister_key);
© 2016 - 2025 Red Hat, Inc.