[PATCH] drm: omapdrm: reduce clang stack usage

Arnd Bergmann posted 1 patch 4 months ago
drivers/gpu/drm/omapdrm/dss/dispc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] drm: omapdrm: reduce clang stack usage
Posted by Arnd Bergmann 4 months ago
From: Arnd Bergmann <arnd@arndb.de>

The thread sanitizer makes the stack usage explode from extra variable
spills in dispc_runtime_resume:

drivers/gpu/drm/omapdrm/dss/dispc.c:4735:27: error: stack frame size (1824) exceeds limit (1280) in 'dispc_runtime_resume' [-Werror,-Wframe-larger-than]

I could not figure out what exactly is going on here, but I see that
whenever dispc_restore_context() is not inlined, that function
and its caller shrink below 900 bytes combined of stack usage.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/gpu/drm/omapdrm/dss/dispc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/omapdrm/dss/dispc.c b/drivers/gpu/drm/omapdrm/dss/dispc.c
index 533f70e8a4a6..cf055815077c 100644
--- a/drivers/gpu/drm/omapdrm/dss/dispc.c
+++ b/drivers/gpu/drm/omapdrm/dss/dispc.c
@@ -524,7 +524,7 @@ static void dispc_save_context(struct dispc_device *dispc)
 	DSSDBG("context saved\n");
 }
 
-static void dispc_restore_context(struct dispc_device *dispc)
+static noinline_for_stack void dispc_restore_context(struct dispc_device *dispc)
 {
 	int i, j;
 
-- 
2.39.5
Re: [PATCH] drm: omapdrm: reduce clang stack usage
Posted by Tomi Valkeinen 4 months ago
Hi,

On 10/06/2025 12:27, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> The thread sanitizer makes the stack usage explode from extra variable
> spills in dispc_runtime_resume:
> 
> drivers/gpu/drm/omapdrm/dss/dispc.c:4735:27: error: stack frame size (1824) exceeds limit (1280) in 'dispc_runtime_resume' [-Werror,-Wframe-larger-than]
> 
> I could not figure out what exactly is going on here, but I see that
> whenever dispc_restore_context() is not inlined, that function
> and its caller shrink below 900 bytes combined of stack usage.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/gpu/drm/omapdrm/dss/dispc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/omapdrm/dss/dispc.c b/drivers/gpu/drm/omapdrm/dss/dispc.c
> index 533f70e8a4a6..cf055815077c 100644
> --- a/drivers/gpu/drm/omapdrm/dss/dispc.c
> +++ b/drivers/gpu/drm/omapdrm/dss/dispc.c
> @@ -524,7 +524,7 @@ static void dispc_save_context(struct dispc_device *dispc)
>  	DSSDBG("context saved\n");
>  }
>  
> -static void dispc_restore_context(struct dispc_device *dispc)
> +static noinline_for_stack void dispc_restore_context(struct dispc_device *dispc)
>  {
>  	int i, j;
>  

While I don't think this causes any harm, but... What's going on here?
If I compile with gcc (x86 or arm), I see stack usage in few hundreds of
bytes. If I compile with LLVM=1, the stack usage jumps to over a thousand.

Is clang just broken? I don't see anything special with
dispc_restore_context() or dispc_runtime_resume(), so is this same thing
happening all around the kernel, and we need to sprinkle noinlines
everywhere?

Or do we get some extra debugging feature enabled only on clang with
allmodconfig, and that is eating the stack?

 Tomi
Re: [PATCH] drm: omapdrm: reduce clang stack usage
Posted by Arnd Bergmann 4 months ago
On Thu, Jun 12, 2025, at 09:58, Tomi Valkeinen wrote:
> On 10/06/2025 12:27, Arnd Bergmann wrote:
>>  
>> -static void dispc_restore_context(struct dispc_device *dispc)
>> +static noinline_for_stack void dispc_restore_context(struct dispc_device *dispc)
>>  {
>>  	int i, j;
>>  
>
> While I don't think this causes any harm, but... What's going on here?
> If I compile with gcc (x86 or arm), I see stack usage in few hundreds of
> bytes. If I compile with LLVM=1, the stack usage jumps to over a thousand.
>
> Is clang just broken? I don't see anything special with
> dispc_restore_context() or dispc_runtime_resume(), so is this same thing
> happening all around the kernel, and we need to sprinkle noinlines
> everywhere?
>
> Or do we get some extra debugging feature enabled only on clang with
> allmodconfig, and that is eating the stack?

There is no general answer here, but a combination of multiple
effects going on at the same time throughout the kernel, which lead
to clang observing excessive stack usage in some files when gcc
does not:

- both compilers have a number of corner cases where they run off
  and do something crazy for unusual input (usually crypto code),
  but since gcc has more users, most files that trigger only gcc
  already have workarounds in place, while the ones that trigger
  with clang are still missing them

- The inlining algorithm works the opposite way on clang vs gcc,
  while gcc always starts inlining leaf functions into their callers
  and does this recursively, clang starts with global functions
  and inlines its direct callees first. If you have deeply nested
  static functions that could all be inlined, both stop at some
  point, but the resulting object code looks completely different,
  and the stack usage is a symptom of this. I've added 'noinline'
  for some of the cases like this where I know both result in
  the same (harmless) stack usage through the call chain, but
  only clang warns about it.

- clang has previously had bugs where it tracks the lifetime of
  stack variables incorrectly, so multiple variables that
  should share the same stack slot won't. Some of these are
  fixed now, others are a result of the different inlining, and
  some others are likely still bugs we should fix in clang

- CONFIG_KMSAN disables some optimizations that are required
  for reducing stack usage, and at the moment this is only
  implemented in clang but not gcc.

- CONFIG_KASAN has some similar issues as KMSAN but is not
  quite as bad here.

- CONFIG_KASAN_STACK tends to use more stack with clang than gcc
  because of implementation choices around how hard it should
  try to detect array overflows. This could be changed by having
  clang make similar decisions to gcc here, but for now we just
  require using CONFIG_EXPERT=y to enable KASAN_STACK on clang.

I have managed to produce a testcase for this file that shows
how clang produces huge stack usage when gcc does not,
in this case it seems to be triggered by -fsanitize=kernel-address

https://godbolt.org/z/TT88zPYf6


      Arnd
Re: [PATCH] drm: omapdrm: reduce clang stack usage
Posted by Tomi Valkeinen 4 months ago
Hi,

On 12/06/2025 15:40, Arnd Bergmann wrote:
> On Thu, Jun 12, 2025, at 09:58, Tomi Valkeinen wrote:
>> On 10/06/2025 12:27, Arnd Bergmann wrote:
>>>  
>>> -static void dispc_restore_context(struct dispc_device *dispc)
>>> +static noinline_for_stack void dispc_restore_context(struct dispc_device *dispc)
>>>  {
>>>  	int i, j;
>>>  
>>
>> While I don't think this causes any harm, but... What's going on here?
>> If I compile with gcc (x86 or arm), I see stack usage in few hundreds of
>> bytes. If I compile with LLVM=1, the stack usage jumps to over a thousand.
>>
>> Is clang just broken? I don't see anything special with
>> dispc_restore_context() or dispc_runtime_resume(), so is this same thing
>> happening all around the kernel, and we need to sprinkle noinlines
>> everywhere?
>>
>> Or do we get some extra debugging feature enabled only on clang with
>> allmodconfig, and that is eating the stack?
> 
> There is no general answer here, but a combination of multiple
> effects going on at the same time throughout the kernel, which lead
> to clang observing excessive stack usage in some files when gcc
> does not:
> 
> - both compilers have a number of corner cases where they run off
>   and do something crazy for unusual input (usually crypto code),
>   but since gcc has more users, most files that trigger only gcc
>   already have workarounds in place, while the ones that trigger
>   with clang are still missing them
> 
> - The inlining algorithm works the opposite way on clang vs gcc,
>   while gcc always starts inlining leaf functions into their callers
>   and does this recursively, clang starts with global functions
>   and inlines its direct callees first. If you have deeply nested
>   static functions that could all be inlined, both stop at some
>   point, but the resulting object code looks completely different,
>   and the stack usage is a symptom of this. I've added 'noinline'
>   for some of the cases like this where I know both result in
>   the same (harmless) stack usage through the call chain, but
>   only clang warns about it.
> 
> - clang has previously had bugs where it tracks the lifetime of
>   stack variables incorrectly, so multiple variables that
>   should share the same stack slot won't. Some of these are
>   fixed now, others are a result of the different inlining, and
>   some others are likely still bugs we should fix in clang
> 
> - CONFIG_KMSAN disables some optimizations that are required
>   for reducing stack usage, and at the moment this is only
>   implemented in clang but not gcc.
> 
> - CONFIG_KASAN has some similar issues as KMSAN but is not
>   quite as bad here.
> 
> - CONFIG_KASAN_STACK tends to use more stack with clang than gcc
>   because of implementation choices around how hard it should
>   try to detect array overflows. This could be changed by having
>   clang make similar decisions to gcc here, but for now we just
>   require using CONFIG_EXPERT=y to enable KASAN_STACK on clang.
> 
> I have managed to produce a testcase for this file that shows
> how clang produces huge stack usage when gcc does not,
> in this case it seems to be triggered by -fsanitize=kernel-address
> 
> https://godbolt.org/z/TT88zPYf6

Interesting! And clang does fine if I change the DISPC_OVL_BASE() to

static u16 DISPC_OVL_BASE(enum omap_plane_id plane)
{
    static const u16 bases[] = {0x0080, 0x00BC, 0x014C, 0x0300, 0x0500};
    return bases[plane];
}

In any case, I'll apply this with a small comment.

 Tomi
Re: [PATCH] drm: omapdrm: reduce clang stack usage
Posted by Arnd Bergmann 4 months ago
On Thu, Jun 12, 2025, at 16:37, Tomi Valkeinen wrote:
> On 12/06/2025 15:40, Arnd Bergmann wrote:

> static u16 DISPC_OVL_BASE(enum omap_plane_id plane)
> {
>     static const u16 bases[] = {0x0080, 0x00BC, 0x014C, 0x0300, 0x0500};
>     return bases[plane];
> }
>
> In any case, I'll apply this with a small comment.

Thanks!

I've also opened an issue against llvm at
https://github.com/llvm/llvm-project/issues/143908

Hopefully we can pinpoint something in llvm that can be fixed
to improve similar issues in other code.

It does seem fairly obscure, as even changing the code
slightly seems to completely avoid the problem, e.g. if the
one of the loops grows too big to be unrolled.

    Arnd