[PATCH v3 3/3] locking/lockdep: Disable KASAN instrumentation of lockdep.c

Waiman Long posted 3 patches 12 months ago
There is a newer version of this series
[PATCH v3 3/3] locking/lockdep: Disable KASAN instrumentation of lockdep.c
Posted by Waiman Long 12 months ago
Both KASAN and LOCKDEP are commonly enabled in building a debug kernel.
Each of them can significantly slow down the speed of a debug kernel.
Enabling KASAN instrumentation of the LOCKDEP code will further slow
thing down.

Since LOCKDEP is a high overhead debugging tool, it will never get
enabled in a production kernel. The LOCKDEP code is also pretty mature
and is unlikely to get major changes. There is also a possibility of
recursion similar to KCSAN.

To evaluate the performance impact of disabling KASAN instrumentation
of lockdep.c, the time to do a parallel build of the Linux defconfig
kernel was used as the benchmark. Two x86-64 systems (Skylake & Zen 2)
and an arm64 system were used as test beds. Two sets of non-RT and RT
kernels with similar configurations except mainly CONFIG_PREEMPT_RT
were used for evaulation.

For the Skylake system:

  Kernel			Run time	    Sys time
  ------			--------	    --------
  Non-debug kernel (baseline)	0m47.642s	      4m19.811s
  Debug kernel			2m11.108s (x2.8)     38m20.467s (x8.9)
  Debug kernel (patched)	1m49.602s (x2.3)     31m28.501s (x7.3)
  Debug kernel
  (patched + mitigations=off) 	1m30.988s (x1.9)     26m41.993s (x6.2)

  RT kernel (baseline)		0m54.871s	      7m15.340s
  RT debug kernel		6m07.151s (x6.7)    135m47.428s (x18.7)
  RT debug kernel (patched)	3m42.434s (x4.1)     74m51.636s (x10.3)
  RT debug kernel
  (patched + mitigations=off) 	2m40.383s (x2.9)     57m54.369s (x8.0)

For the Zen 2 system:

  Kernel			Run time	    Sys time
  ------			--------	    --------
  Non-debug kernel (baseline)	1m42.806s	     39m48.714s
  Debug kernel			4m04.524s (x2.4)    125m35.904s (x3.2)
  Debug kernel (patched)	3m56.241s (x2.3)    127m22.378s (x3.2)
  Debug kernel
  (patched + mitigations=off) 	2m38.157s (x1.5)     92m35.680s (x2.3)

  RT kernel (baseline)		 1m51.500s	     14m56.322s
  RT debug kernel		16m04.962s (x8.7)   244m36.463s (x16.4)
  RT debug kernel (patched)	 9m09.073s (x4.9)   129m28.439s (x8.7)
  RT debug kernel
  (patched + mitigations=off) 	 3m31.662s (x1.9)    51m01.391s (x3.4)

For the arm64 system:

  Kernel			Run time	    Sys time
  ------			--------	    --------
  Non-debug kernel (baseline)	1m56.844s	      8m47.150s
  Debug kernel			3m54.774s (x2.0)     92m30.098s (x10.5)
  Debug kernel (patched)	3m32.429s (x1.8)     77m40.779s (x8.8)

  RT kernel (baseline)		 4m01.641s	     18m16.777s
  RT debug kernel		19m32.977s (x4.9)   304m23.965s (x16.7)
  RT debug kernel (patched)	16m28.354s (x4.1)   234m18.149s (x12.8)

Turning the mitigations off doesn't seems to have any noticeable impact
on the performance of the arm64 system. So the mitigation=off entries
aren't included.

For the x86 CPUs, cpu mitigations has a much bigger impact on
performance, especially the RT debug kernel. The SRSO mitigation in
Zen 2 has an especially big impact on the debug kernel. It is also the
majority of the slowdown with mitigations on. It is because the patched
ret instruction slows down function returns. A lot of helper functions
that are normally compiled out or inlined may become real function
calls in the debug kernel. The KASAN instrumentation inserts a lot
of __asan_loadX*() and __kasan_check_read() function calls to memory
access portion of the code. The lockdep's __lock_acquire() function,
for instance, has 66 __asan_loadX*() and 6 __kasan_check_read() calls
added with KASAN instrumentation. Of course, the actual numbers may vary
depending on the compiler used and the exact version of the lockdep code.

With the newly added rtmutex and lockdep lock events, the relevant
event counts for the test runs with the Skylake system were:

  Event type		Debug kernel	RT debug kernel
  ----------		------------	---------------
  lockdep_acquire	1,968,663,277	5,425,313,953
  rtlock_slowlock	     -		  401,701,156
  rtmutex_slowlock	     -		      139,672

The __lock_acquire() calls in the RT debug kernel are x2.8 times of the
non-RT debug kernel with the same workload. Since the __lock_acquire()
function is a big hitter in term of performance slowdown, this makes
the RT debug kernel much slower than the non-RT one. The average lock
nesting depth is likely to be higher in the RT debug kernel too leading
to longer execution time in the __lock_acquire() function.

As the small advantage of enabling KASAN instrumentation to catch
potential memory access error in the lockdep debugging tool is probably
not worth the drawback of further slowing down a debug kernel, disable
KASAN instrumentation in the lockdep code to allow the debug kernels
to regain some performance back, especially for the RT debug kernels.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 0db4093d17b8..a114949eeed5 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -5,7 +5,8 @@ KCOV_INSTRUMENT		:= n
 
 obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
-# Avoid recursion lockdep -> sanitizer -> ... -> lockdep.
+# Avoid recursion lockdep -> sanitizer -> ... -> lockdep & improve performance.
+KASAN_SANITIZE_lockdep.o := n
 KCSAN_SANITIZE_lockdep.o := n
 
 ifdef CONFIG_FUNCTION_TRACER
-- 
2.48.1
Re: [PATCH v3 3/3] locking/lockdep: Disable KASAN instrumentation of lockdep.c
Posted by Boqun Feng 12 months ago
[Cc KASAN]

A Reviewed-by or Acked-by from KASAN would be nice, thanks!

Regards,
Boqun

On Sun, Feb 09, 2025 at 11:26:12PM -0500, Waiman Long wrote:
> Both KASAN and LOCKDEP are commonly enabled in building a debug kernel.
> Each of them can significantly slow down the speed of a debug kernel.
> Enabling KASAN instrumentation of the LOCKDEP code will further slow
> thing down.
> 
> Since LOCKDEP is a high overhead debugging tool, it will never get
> enabled in a production kernel. The LOCKDEP code is also pretty mature
> and is unlikely to get major changes. There is also a possibility of
> recursion similar to KCSAN.
> 
> To evaluate the performance impact of disabling KASAN instrumentation
> of lockdep.c, the time to do a parallel build of the Linux defconfig
> kernel was used as the benchmark. Two x86-64 systems (Skylake & Zen 2)
> and an arm64 system were used as test beds. Two sets of non-RT and RT
> kernels with similar configurations except mainly CONFIG_PREEMPT_RT
> were used for evaulation.
> 
> For the Skylake system:
> 
>   Kernel			Run time	    Sys time
>   ------			--------	    --------
>   Non-debug kernel (baseline)	0m47.642s	      4m19.811s
>   Debug kernel			2m11.108s (x2.8)     38m20.467s (x8.9)
>   Debug kernel (patched)	1m49.602s (x2.3)     31m28.501s (x7.3)
>   Debug kernel
>   (patched + mitigations=off) 	1m30.988s (x1.9)     26m41.993s (x6.2)
> 
>   RT kernel (baseline)		0m54.871s	      7m15.340s
>   RT debug kernel		6m07.151s (x6.7)    135m47.428s (x18.7)
>   RT debug kernel (patched)	3m42.434s (x4.1)     74m51.636s (x10.3)
>   RT debug kernel
>   (patched + mitigations=off) 	2m40.383s (x2.9)     57m54.369s (x8.0)
> 
> For the Zen 2 system:
> 
>   Kernel			Run time	    Sys time
>   ------			--------	    --------
>   Non-debug kernel (baseline)	1m42.806s	     39m48.714s
>   Debug kernel			4m04.524s (x2.4)    125m35.904s (x3.2)
>   Debug kernel (patched)	3m56.241s (x2.3)    127m22.378s (x3.2)
>   Debug kernel
>   (patched + mitigations=off) 	2m38.157s (x1.5)     92m35.680s (x2.3)
> 
>   RT kernel (baseline)		 1m51.500s	     14m56.322s
>   RT debug kernel		16m04.962s (x8.7)   244m36.463s (x16.4)
>   RT debug kernel (patched)	 9m09.073s (x4.9)   129m28.439s (x8.7)
>   RT debug kernel
>   (patched + mitigations=off) 	 3m31.662s (x1.9)    51m01.391s (x3.4)
> 
> For the arm64 system:
> 
>   Kernel			Run time	    Sys time
>   ------			--------	    --------
>   Non-debug kernel (baseline)	1m56.844s	      8m47.150s
>   Debug kernel			3m54.774s (x2.0)     92m30.098s (x10.5)
>   Debug kernel (patched)	3m32.429s (x1.8)     77m40.779s (x8.8)
> 
>   RT kernel (baseline)		 4m01.641s	     18m16.777s
>   RT debug kernel		19m32.977s (x4.9)   304m23.965s (x16.7)
>   RT debug kernel (patched)	16m28.354s (x4.1)   234m18.149s (x12.8)
> 
> Turning the mitigations off doesn't seems to have any noticeable impact
> on the performance of the arm64 system. So the mitigation=off entries
> aren't included.
> 
> For the x86 CPUs, cpu mitigations has a much bigger impact on
> performance, especially the RT debug kernel. The SRSO mitigation in
> Zen 2 has an especially big impact on the debug kernel. It is also the
> majority of the slowdown with mitigations on. It is because the patched
> ret instruction slows down function returns. A lot of helper functions
> that are normally compiled out or inlined may become real function
> calls in the debug kernel. The KASAN instrumentation inserts a lot
> of __asan_loadX*() and __kasan_check_read() function calls to memory
> access portion of the code. The lockdep's __lock_acquire() function,
> for instance, has 66 __asan_loadX*() and 6 __kasan_check_read() calls
> added with KASAN instrumentation. Of course, the actual numbers may vary
> depending on the compiler used and the exact version of the lockdep code.
> 
> With the newly added rtmutex and lockdep lock events, the relevant
> event counts for the test runs with the Skylake system were:
> 
>   Event type		Debug kernel	RT debug kernel
>   ----------		------------	---------------
>   lockdep_acquire	1,968,663,277	5,425,313,953
>   rtlock_slowlock	     -		  401,701,156
>   rtmutex_slowlock	     -		      139,672
> 
> The __lock_acquire() calls in the RT debug kernel are x2.8 times of the
> non-RT debug kernel with the same workload. Since the __lock_acquire()
> function is a big hitter in term of performance slowdown, this makes
> the RT debug kernel much slower than the non-RT one. The average lock
> nesting depth is likely to be higher in the RT debug kernel too leading
> to longer execution time in the __lock_acquire() function.
> 
> As the small advantage of enabling KASAN instrumentation to catch
> potential memory access error in the lockdep debugging tool is probably
> not worth the drawback of further slowing down a debug kernel, disable
> KASAN instrumentation in the lockdep code to allow the debug kernels
> to regain some performance back, especially for the RT debug kernels.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 0db4093d17b8..a114949eeed5 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -5,7 +5,8 @@ KCOV_INSTRUMENT		:= n
>  
>  obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
>  
> -# Avoid recursion lockdep -> sanitizer -> ... -> lockdep.
> +# Avoid recursion lockdep -> sanitizer -> ... -> lockdep & improve performance.
> +KASAN_SANITIZE_lockdep.o := n
>  KCSAN_SANITIZE_lockdep.o := n
>  
>  ifdef CONFIG_FUNCTION_TRACER
> -- 
> 2.48.1
>
Re: [PATCH v3 3/3] locking/lockdep: Disable KASAN instrumentation of lockdep.c
Posted by Marco Elver 12 months ago
On Wed, 12 Feb 2025 at 06:57, Boqun Feng <boqun.feng@gmail.com> wrote:
>
> [Cc KASAN]
>
> A Reviewed-by or Acked-by from KASAN would be nice, thanks!
>
> Regards,
> Boqun
>
> On Sun, Feb 09, 2025 at 11:26:12PM -0500, Waiman Long wrote:
> > Both KASAN and LOCKDEP are commonly enabled in building a debug kernel.
> > Each of them can significantly slow down the speed of a debug kernel.
> > Enabling KASAN instrumentation of the LOCKDEP code will further slow
> > thing down.
> >
> > Since LOCKDEP is a high overhead debugging tool, it will never get
> > enabled in a production kernel. The LOCKDEP code is also pretty mature
> > and is unlikely to get major changes. There is also a possibility of
> > recursion similar to KCSAN.
> >
> > To evaluate the performance impact of disabling KASAN instrumentation
> > of lockdep.c, the time to do a parallel build of the Linux defconfig
> > kernel was used as the benchmark. Two x86-64 systems (Skylake & Zen 2)
> > and an arm64 system were used as test beds. Two sets of non-RT and RT
> > kernels with similar configurations except mainly CONFIG_PREEMPT_RT
> > were used for evaulation.
> >
> > For the Skylake system:
> >
> >   Kernel                      Run time            Sys time
> >   ------                      --------            --------
> >   Non-debug kernel (baseline) 0m47.642s             4m19.811s
> >   Debug kernel                        2m11.108s (x2.8)     38m20.467s (x8.9)
> >   Debug kernel (patched)      1m49.602s (x2.3)     31m28.501s (x7.3)
> >   Debug kernel
> >   (patched + mitigations=off)         1m30.988s (x1.9)     26m41.993s (x6.2)
> >
> >   RT kernel (baseline)                0m54.871s             7m15.340s
> >   RT debug kernel             6m07.151s (x6.7)    135m47.428s (x18.7)
> >   RT debug kernel (patched)   3m42.434s (x4.1)     74m51.636s (x10.3)
> >   RT debug kernel
> >   (patched + mitigations=off)         2m40.383s (x2.9)     57m54.369s (x8.0)
> >
> > For the Zen 2 system:
> >
> >   Kernel                      Run time            Sys time
> >   ------                      --------            --------
> >   Non-debug kernel (baseline) 1m42.806s            39m48.714s
> >   Debug kernel                        4m04.524s (x2.4)    125m35.904s (x3.2)
> >   Debug kernel (patched)      3m56.241s (x2.3)    127m22.378s (x3.2)
> >   Debug kernel
> >   (patched + mitigations=off)         2m38.157s (x1.5)     92m35.680s (x2.3)
> >
> >   RT kernel (baseline)                 1m51.500s           14m56.322s
> >   RT debug kernel             16m04.962s (x8.7)   244m36.463s (x16.4)
> >   RT debug kernel (patched)    9m09.073s (x4.9)   129m28.439s (x8.7)
> >   RT debug kernel
> >   (patched + mitigations=off)          3m31.662s (x1.9)    51m01.391s (x3.4)
> >
> > For the arm64 system:
> >
> >   Kernel                      Run time            Sys time
> >   ------                      --------            --------
> >   Non-debug kernel (baseline) 1m56.844s             8m47.150s
> >   Debug kernel                        3m54.774s (x2.0)     92m30.098s (x10.5)
> >   Debug kernel (patched)      3m32.429s (x1.8)     77m40.779s (x8.8)
> >
> >   RT kernel (baseline)                 4m01.641s           18m16.777s
> >   RT debug kernel             19m32.977s (x4.9)   304m23.965s (x16.7)
> >   RT debug kernel (patched)   16m28.354s (x4.1)   234m18.149s (x12.8)
> >
> > Turning the mitigations off doesn't seems to have any noticeable impact
> > on the performance of the arm64 system. So the mitigation=off entries
> > aren't included.
> >
> > For the x86 CPUs, cpu mitigations has a much bigger impact on
> > performance, especially the RT debug kernel. The SRSO mitigation in
> > Zen 2 has an especially big impact on the debug kernel. It is also the
> > majority of the slowdown with mitigations on. It is because the patched
> > ret instruction slows down function returns. A lot of helper functions
> > that are normally compiled out or inlined may become real function
> > calls in the debug kernel. The KASAN instrumentation inserts a lot
> > of __asan_loadX*() and __kasan_check_read() function calls to memory
> > access portion of the code. The lockdep's __lock_acquire() function,
> > for instance, has 66 __asan_loadX*() and 6 __kasan_check_read() calls
> > added with KASAN instrumentation. Of course, the actual numbers may vary
> > depending on the compiler used and the exact version of the lockdep code.

For completeness-sake, we'd also have to compare with
CONFIG_KASAN_INLINE=y, which gets rid of the __asan_ calls (not the
explicit __kasan_ checks). But I leave it up to you - I'm aware it
results in slow-downs, too. ;-)

> > With the newly added rtmutex and lockdep lock events, the relevant
> > event counts for the test runs with the Skylake system were:
> >
> >   Event type          Debug kernel    RT debug kernel
> >   ----------          ------------    ---------------
> >   lockdep_acquire     1,968,663,277   5,425,313,953
> >   rtlock_slowlock          -            401,701,156
> >   rtmutex_slowlock         -                139,672
> >
> > The __lock_acquire() calls in the RT debug kernel are x2.8 times of the
> > non-RT debug kernel with the same workload. Since the __lock_acquire()
> > function is a big hitter in term of performance slowdown, this makes
> > the RT debug kernel much slower than the non-RT one. The average lock
> > nesting depth is likely to be higher in the RT debug kernel too leading
> > to longer execution time in the __lock_acquire() function.
> >
> > As the small advantage of enabling KASAN instrumentation to catch
> > potential memory access error in the lockdep debugging tool is probably
> > not worth the drawback of further slowing down a debug kernel, disable
> > KASAN instrumentation in the lockdep code to allow the debug kernels
> > to regain some performance back, especially for the RT debug kernels.

It's not about catching a bug in the lockdep code, but rather guard
against bugs in code that allocated the storage for some
synchronization object. Since lockdep state is embedded in each
synchronization object, lockdep checking code may be passed a
reference to garbage data, e.g. on use-after-free (or even
out-of-bounds if there's an array of sync objects). In that case, all
bets are off and lockdep may produce random false reports. Sure the
system is already in a bad state at that point, but it's going to make
debugging much harder.

Our approach has always been to ensure that as soon as there's an
error state detected it's reported as soon as we can, before it
results in random failure as execution continues (e.g. bad lock
reports).

To guard against that, I would propose adding carefully placed
kasan_check_byte() in lockdep code.
Re: [PATCH v3 3/3] locking/lockdep: Disable KASAN instrumentation of lockdep.c
Posted by Waiman Long 12 months ago
On 2/12/25 6:30 AM, Marco Elver wrote:
> On Wed, 12 Feb 2025 at 06:57, Boqun Feng <boqun.feng@gmail.com> wrote:
>> [Cc KASAN]
>>
>> A Reviewed-by or Acked-by from KASAN would be nice, thanks!
>>
>> Regards,
>> Boqun
>>
>> On Sun, Feb 09, 2025 at 11:26:12PM -0500, Waiman Long wrote:
>>> Both KASAN and LOCKDEP are commonly enabled in building a debug kernel.
>>> Each of them can significantly slow down the speed of a debug kernel.
>>> Enabling KASAN instrumentation of the LOCKDEP code will further slow
>>> thing down.
>>>
>>> Since LOCKDEP is a high overhead debugging tool, it will never get
>>> enabled in a production kernel. The LOCKDEP code is also pretty mature
>>> and is unlikely to get major changes. There is also a possibility of
>>> recursion similar to KCSAN.
>>>
>>> To evaluate the performance impact of disabling KASAN instrumentation
>>> of lockdep.c, the time to do a parallel build of the Linux defconfig
>>> kernel was used as the benchmark. Two x86-64 systems (Skylake & Zen 2)
>>> and an arm64 system were used as test beds. Two sets of non-RT and RT
>>> kernels with similar configurations except mainly CONFIG_PREEMPT_RT
>>> were used for evaulation.
>>>
>>> For the Skylake system:
>>>
>>>    Kernel                      Run time            Sys time
>>>    ------                      --------            --------
>>>    Non-debug kernel (baseline) 0m47.642s             4m19.811s
>>>    Debug kernel                        2m11.108s (x2.8)     38m20.467s (x8.9)
>>>    Debug kernel (patched)      1m49.602s (x2.3)     31m28.501s (x7.3)
>>>    Debug kernel
>>>    (patched + mitigations=off)         1m30.988s (x1.9)     26m41.993s (x6.2)
>>>
>>>    RT kernel (baseline)                0m54.871s             7m15.340s
>>>    RT debug kernel             6m07.151s (x6.7)    135m47.428s (x18.7)
>>>    RT debug kernel (patched)   3m42.434s (x4.1)     74m51.636s (x10.3)
>>>    RT debug kernel
>>>    (patched + mitigations=off)         2m40.383s (x2.9)     57m54.369s (x8.0)
>>>
>>> For the Zen 2 system:
>>>
>>>    Kernel                      Run time            Sys time
>>>    ------                      --------            --------
>>>    Non-debug kernel (baseline) 1m42.806s            39m48.714s
>>>    Debug kernel                        4m04.524s (x2.4)    125m35.904s (x3.2)
>>>    Debug kernel (patched)      3m56.241s (x2.3)    127m22.378s (x3.2)
>>>    Debug kernel
>>>    (patched + mitigations=off)         2m38.157s (x1.5)     92m35.680s (x2.3)
>>>
>>>    RT kernel (baseline)                 1m51.500s           14m56.322s
>>>    RT debug kernel             16m04.962s (x8.7)   244m36.463s (x16.4)
>>>    RT debug kernel (patched)    9m09.073s (x4.9)   129m28.439s (x8.7)
>>>    RT debug kernel
>>>    (patched + mitigations=off)          3m31.662s (x1.9)    51m01.391s (x3.4)
>>>
>>> For the arm64 system:
>>>
>>>    Kernel                      Run time            Sys time
>>>    ------                      --------            --------
>>>    Non-debug kernel (baseline) 1m56.844s             8m47.150s
>>>    Debug kernel                        3m54.774s (x2.0)     92m30.098s (x10.5)
>>>    Debug kernel (patched)      3m32.429s (x1.8)     77m40.779s (x8.8)
>>>
>>>    RT kernel (baseline)                 4m01.641s           18m16.777s
>>>    RT debug kernel             19m32.977s (x4.9)   304m23.965s (x16.7)
>>>    RT debug kernel (patched)   16m28.354s (x4.1)   234m18.149s (x12.8)
>>>
>>> Turning the mitigations off doesn't seems to have any noticeable impact
>>> on the performance of the arm64 system. So the mitigation=off entries
>>> aren't included.
>>>
>>> For the x86 CPUs, cpu mitigations has a much bigger impact on
>>> performance, especially the RT debug kernel. The SRSO mitigation in
>>> Zen 2 has an especially big impact on the debug kernel. It is also the
>>> majority of the slowdown with mitigations on. It is because the patched
>>> ret instruction slows down function returns. A lot of helper functions
>>> that are normally compiled out or inlined may become real function
>>> calls in the debug kernel. The KASAN instrumentation inserts a lot
>>> of __asan_loadX*() and __kasan_check_read() function calls to memory
>>> access portion of the code. The lockdep's __lock_acquire() function,
>>> for instance, has 66 __asan_loadX*() and 6 __kasan_check_read() calls
>>> added with KASAN instrumentation. Of course, the actual numbers may vary
>>> depending on the compiler used and the exact version of the lockdep code.
> For completeness-sake, we'd also have to compare with
> CONFIG_KASAN_INLINE=y, which gets rid of the __asan_ calls (not the
> explicit __kasan_ checks). But I leave it up to you - I'm aware it
> results in slow-downs, too. ;-)
I just realize that my config file for non-RT debug kernel does have 
CONFIG_KASAN_INLINE=y set, though the RT debug kernel does not have 
this. For the non-RT debug kernel, the _asan_report_load* functions are 
still being called because lockdep.c is very big (> 6k lines of code). 
So "call_threshold := 10000" in scripts/Makefile.kasan is probably not 
enough for lockdep.c.
>
>>> With the newly added rtmutex and lockdep lock events, the relevant
>>> event counts for the test runs with the Skylake system were:
>>>
>>>    Event type          Debug kernel    RT debug kernel
>>>    ----------          ------------    ---------------
>>>    lockdep_acquire     1,968,663,277   5,425,313,953
>>>    rtlock_slowlock          -            401,701,156
>>>    rtmutex_slowlock         -                139,672
>>>
>>> The __lock_acquire() calls in the RT debug kernel are x2.8 times of the
>>> non-RT debug kernel with the same workload. Since the __lock_acquire()
>>> function is a big hitter in term of performance slowdown, this makes
>>> the RT debug kernel much slower than the non-RT one. The average lock
>>> nesting depth is likely to be higher in the RT debug kernel too leading
>>> to longer execution time in the __lock_acquire() function.
>>>
>>> As the small advantage of enabling KASAN instrumentation to catch
>>> potential memory access error in the lockdep debugging tool is probably
>>> not worth the drawback of further slowing down a debug kernel, disable
>>> KASAN instrumentation in the lockdep code to allow the debug kernels
>>> to regain some performance back, especially for the RT debug kernels.
> It's not about catching a bug in the lockdep code, but rather guard
> against bugs in code that allocated the storage for some
> synchronization object. Since lockdep state is embedded in each
> synchronization object, lockdep checking code may be passed a
> reference to garbage data, e.g. on use-after-free (or even
> out-of-bounds if there's an array of sync objects). In that case, all
> bets are off and lockdep may produce random false reports. Sure the
> system is already in a bad state at that point, but it's going to make
> debugging much harder.
>
> Our approach has always been to ensure that as soon as there's an
> error state detected it's reported as soon as we can, before it
> results in random failure as execution continues (e.g. bad lock
> reports).
>
> To guard against that, I would propose adding carefully placed
> kasan_check_byte() in lockdep code.

Will take a look at that.

Cheers,
Longman
Re: [PATCH v3 3/3] locking/lockdep: Disable KASAN instrumentation of lockdep.c
Posted by Waiman Long 12 months ago
On 2/12/25 11:57 AM, Waiman Long wrote:
> On 2/12/25 6:30 AM, Marco Elver wrote:
>> On Wed, 12 Feb 2025 at 06:57, Boqun Feng <boqun.feng@gmail.com> wrote:
>>> [Cc KASAN]
>>>
>>> A Reviewed-by or Acked-by from KASAN would be nice, thanks!
>>>
>>> Regards,
>>> Boqun
>>>
>>> On Sun, Feb 09, 2025 at 11:26:12PM -0500, Waiman Long wrote:
>>>> Both KASAN and LOCKDEP are commonly enabled in building a debug 
>>>> kernel.
>>>> Each of them can significantly slow down the speed of a debug kernel.
>>>> Enabling KASAN instrumentation of the LOCKDEP code will further slow
>>>> thing down.
>>>>
>>>> Since LOCKDEP is a high overhead debugging tool, it will never get
>>>> enabled in a production kernel. The LOCKDEP code is also pretty mature
>>>> and is unlikely to get major changes. There is also a possibility of
>>>> recursion similar to KCSAN.
>>>>
>>>> To evaluate the performance impact of disabling KASAN instrumentation
>>>> of lockdep.c, the time to do a parallel build of the Linux defconfig
>>>> kernel was used as the benchmark. Two x86-64 systems (Skylake & Zen 2)
>>>> and an arm64 system were used as test beds. Two sets of non-RT and RT
>>>> kernels with similar configurations except mainly CONFIG_PREEMPT_RT
>>>> were used for evaulation.
>>>>
>>>> For the Skylake system:
>>>>
>>>>    Kernel                      Run time            Sys time
>>>>    ------                      --------            --------
>>>>    Non-debug kernel (baseline) 0m47.642s 4m19.811s
>>>>    Debug kernel                        2m11.108s (x2.8) 38m20.467s 
>>>> (x8.9)
>>>>    Debug kernel (patched)      1m49.602s (x2.3) 31m28.501s (x7.3)
>>>>    Debug kernel
>>>>    (patched + mitigations=off)         1m30.988s (x1.9) 26m41.993s 
>>>> (x6.2)
>>>>
>>>>    RT kernel (baseline)                0m54.871s 7m15.340s
>>>>    RT debug kernel             6m07.151s (x6.7) 135m47.428s (x18.7)
>>>>    RT debug kernel (patched)   3m42.434s (x4.1) 74m51.636s (x10.3)
>>>>    RT debug kernel
>>>>    (patched + mitigations=off)         2m40.383s (x2.9) 57m54.369s 
>>>> (x8.0)
>>>>
>>>> For the Zen 2 system:
>>>>
>>>>    Kernel                      Run time            Sys time
>>>>    ------                      --------            --------
>>>>    Non-debug kernel (baseline) 1m42.806s 39m48.714s
>>>>    Debug kernel                        4m04.524s (x2.4) 125m35.904s 
>>>> (x3.2)
>>>>    Debug kernel (patched)      3m56.241s (x2.3) 127m22.378s (x3.2)
>>>>    Debug kernel
>>>>    (patched + mitigations=off)         2m38.157s (x1.5) 92m35.680s 
>>>> (x2.3)
>>>>
>>>>    RT kernel (baseline)                 1m51.500s 14m56.322s
>>>>    RT debug kernel             16m04.962s (x8.7) 244m36.463s (x16.4)
>>>>    RT debug kernel (patched)    9m09.073s (x4.9) 129m28.439s (x8.7)
>>>>    RT debug kernel
>>>>    (patched + mitigations=off)          3m31.662s (x1.9) 51m01.391s 
>>>> (x3.4)
>>>>
>>>> For the arm64 system:
>>>>
>>>>    Kernel                      Run time            Sys time
>>>>    ------                      --------            --------
>>>>    Non-debug kernel (baseline) 1m56.844s 8m47.150s
>>>>    Debug kernel                        3m54.774s (x2.0) 92m30.098s 
>>>> (x10.5)
>>>>    Debug kernel (patched)      3m32.429s (x1.8) 77m40.779s (x8.8)
>>>>
>>>>    RT kernel (baseline)                 4m01.641s 18m16.777s
>>>>    RT debug kernel             19m32.977s (x4.9) 304m23.965s (x16.7)
>>>>    RT debug kernel (patched)   16m28.354s (x4.1) 234m18.149s (x12.8)
>>>>
>>>> Turning the mitigations off doesn't seems to have any noticeable 
>>>> impact
>>>> on the performance of the arm64 system. So the mitigation=off entries
>>>> aren't included.
>>>>
>>>> For the x86 CPUs, cpu mitigations has a much bigger impact on
>>>> performance, especially the RT debug kernel. The SRSO mitigation in
>>>> Zen 2 has an especially big impact on the debug kernel. It is also the
>>>> majority of the slowdown with mitigations on. It is because the 
>>>> patched
>>>> ret instruction slows down function returns. A lot of helper functions
>>>> that are normally compiled out or inlined may become real function
>>>> calls in the debug kernel. The KASAN instrumentation inserts a lot
>>>> of __asan_loadX*() and __kasan_check_read() function calls to memory
>>>> access portion of the code. The lockdep's __lock_acquire() function,
>>>> for instance, has 66 __asan_loadX*() and 6 __kasan_check_read() calls
>>>> added with KASAN instrumentation. Of course, the actual numbers may 
>>>> vary
>>>> depending on the compiler used and the exact version of the lockdep 
>>>> code.
>> For completeness-sake, we'd also have to compare with
>> CONFIG_KASAN_INLINE=y, which gets rid of the __asan_ calls (not the
>> explicit __kasan_ checks). But I leave it up to you - I'm aware it
>> results in slow-downs, too. ;-)

That is not correct. Setting CONFIG_KASAN_INLINE=y does have an effect 
in lockdep.c to reduce the number of __asan_* calls. I have posted the 
v4 series with the updated test results. I have also added a new patch 
to KASAN checking in lock_acquire().

Cheers,
Longman