mm: avoid use of BIT() macro for initialising VMA flags

[PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
how VMA flags are declared, utilising an enum of VMA bit values and
ifdef-fery VM_xxx flag declarations via macro.

As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
the newly introduced VMA bit numbers.

However, use of this macro results in apparently unfortunate macro
expansion and resulted in a performance degradation.This appears to be due
to the (__force int), which is required for the sparse typechecking to
work.

Avoid macro expansion issues by simply using 1UL << bitnum.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---

Andrew - note I've referenced the linux-next commit number above, could you
replace with the upstream commit hash once your PR is taken? Thanks!

 include/linux/mm.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2f38fb68840..c4438b30c140 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -395,7 +395,8 @@ enum {
 #undef DECLARE_VMA_BIT
 #undef DECLARE_VMA_BIT_ALIAS

-#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
+#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
+
 #define VM_READ		INIT_VM_FLAG(READ)
 #define VM_WRITE	INIT_VM_FLAG(WRITE)
 #define VM_EXEC		INIT_VM_FLAG(EXEC)
--
2.52.0

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Al Viro 2 months ago

On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
> 
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
> 
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.

> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))

What the hell is __bitwise doing on these enum values?
Could we please get rid of that ridiculous cargo-culting?

Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
declared __bitwise?

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Al Viro 2 months ago

On Sat, Dec 06, 2025 at 01:14:35AM +0000, Al Viro wrote:
> On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > how VMA flags are declared, utilising an enum of VMA bit values and
> > ifdef-fery VM_xxx flag declarations via macro.
> > 
> > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > the newly introduced VMA bit numbers.
> > 
> > However, use of this macro results in apparently unfortunate macro
> > expansion and resulted in a performance degradation.This appears to be due
> > to the (__force int), which is required for the sparse typechecking to
> > work.
> 
> > -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> > +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
> 
> What the hell is __bitwise doing on these enum values?
> Could we please get rid of that ridiculous cargo-culting?
> 
> Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
> declared __bitwise?

FWIW, bitwise does make sense for things like (1 << SOME_CONSTANT);
then you get warned about arithmetics and conversions to integer
for those, with bitwise operations explicitly allowed.

VM_... are such; VMA_..._BIT are not.  VM_READ | VM_EXEC is fine;
VM_READ + 14 is nonsense and should be warned about.  That's where
__bitwise would make sense.  On bit numbers it's not - what makes
VMA_BIT_MAYREAD ^ VMA_BIT_SHARED any better than 3 * VMA_BIT_MAYREAD?

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Vlastimil Babka 2 months ago

On 12/6/25 2:26 AM, Al Viro wrote:
> On Sat, Dec 06, 2025 at 01:14:35AM +0000, Al Viro wrote:
>> On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
>>> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>>> how VMA flags are declared, utilising an enum of VMA bit values and
>>> ifdef-fery VM_xxx flag declarations via macro.
>>>
>>> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
>>> the newly introduced VMA bit numbers.
>>>
>>> However, use of this macro results in apparently unfortunate macro
>>> expansion and resulted in a performance degradation.This appears to be due
>>> to the (__force int), which is required for the sparse typechecking to
>>> work.
>>
>>> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
>>> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
>>
>> What the hell is __bitwise doing on these enum values?
>> Could we please get rid of that ridiculous cargo-culting?
>>
>> Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
>> declared __bitwise?

I was confused by this too at first when reviewing, but instead of the angry
display above, simply asked the author and got answers.

Comment says:

/**     
 * typedef vma_flag_t - specifies an individual VMA flag by bit number.
 *      
 * This value is made type safe by sparse to avoid passing invalid flag values
 * around.
 */     
typedef int __bitwise vma_flag_t;

It's done as documented in Documentation/dev-tools/sparse.rst section
"Using sparse for typechecking".

So yeah the keyword is __bitwise and indeed we don't perform bitwise operations
on the VM_ values, in fact we don't perform any operations without __force
casting them back first, to catch when they are used by mistake.
It's not cargo-culting, IIRC it catched a bug in an early version of the
patch itself.

I wouldn't mind if sparse provided a different keyword than __bitwise
for this use case to make it less misleading. Or even better if we could
make the compiler itself treat vma_flag_t as a "special int" that can't
be implicitly cast to a normal int, so we don't have to rely on sparse
checks to catch those.

> FWIW, bitwise does make sense for things like (1 << SOME_CONSTANT);
> then you get warned about arithmetics and conversions to integer
> for those, with bitwise operations explicitly allowed.
> 
> VM_... are such; VMA_..._BIT are not.  VM_READ | VM_EXEC is fine;
> VM_READ + 14 is nonsense and should be warned about.  That's where
> __bitwise would make sense.  On bit numbers it's not - what makes
> VMA_BIT_MAYREAD ^ VMA_BIT_SHARED any better than 3 * VMA_BIT_MAYREAD?

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Sat, Dec 06, 2025 at 01:35:51PM +0100, Vlastimil Babka wrote:
> On 12/6/25 2:26 AM, Al Viro wrote:
> > On Sat, Dec 06, 2025 at 01:14:35AM +0000, Al Viro wrote:
> >> On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> >>> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> >>> how VMA flags are declared, utilising an enum of VMA bit values and
> >>> ifdef-fery VM_xxx flag declarations via macro.
> >>>
> >>> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> >>> the newly introduced VMA bit numbers.
> >>>
> >>> However, use of this macro results in apparently unfortunate macro
> >>> expansion and resulted in a performance degradation.This appears to be due
> >>> to the (__force int), which is required for the sparse typechecking to
> >>> work.
> >>
> >>> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> >>> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
> >>
> >> What the hell is __bitwise doing on these enum values?
> >> Could we please get rid of that ridiculous cargo-culting?
> >>
> >> Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
> >> declared __bitwise?
>
> I was confused by this too at first when reviewing, but instead of the angry
> display above, simply asked the author and got answers.
>
> Comment says:
>
> /**
>  * typedef vma_flag_t - specifies an individual VMA flag by bit number.
>  *
>  * This value is made type safe by sparse to avoid passing invalid flag values
>  * around.
>  */
> typedef int __bitwise vma_flag_t;
>
> It's done as documented in Documentation/dev-tools/sparse.rst section
> "Using sparse for typechecking".
>
> So yeah the keyword is __bitwise and indeed we don't perform bitwise operations
> on the VM_ values, in fact we don't perform any operations without __force
> casting them back first, to catch when they are used by mistake.
> It's not cargo-culting, IIRC it catched a bug in an early version of the
> patch itself.
>
> I wouldn't mind if sparse provided a different keyword than __bitwise
> for this use case to make it less misleading. Or even better if we could
> make the compiler itself treat vma_flag_t as a "special int" that can't
> be implicitly cast to a normal int, so we don't have to rely on sparse
> checks to catch those.
>

Yup precisely - this was entirely to avoid issues with passing a VM_xxx flag
around when a VMA bit number is required which is a kind of bug that would be
_really easy_ to do otherwise.

	vma_flags_set(..., VM_READ);

Reads perfectly but sets the write bit instead of the read bit, for instance.

Yes this is a hack, but does the job, and the sparse documentation doesn't
dissuade.

I agree with Vlasta that really we should provide an __explicit_type or
whatever annotation for this usage.


>
> > FWIW, bitwise does make sense for things like (1 << SOME_CONSTANT);
> > then you get warned about arithmetics and conversions to integer
> > for those, with bitwise operations explicitly allowed.
> >
> > VM_... are such; VMA_..._BIT are not.  VM_READ | VM_EXEC is fine;
> > VM_READ + 14 is nonsense and should be warned about.  That's where
> > __bitwise would make sense.  On bit numbers it's not - what makes
> > VMA_BIT_MAYREAD ^ VMA_BIT_SHARED any better than 3 * VMA_BIT_MAYREAD?
>

The issue isn't so much the operations, and yes obviously VMA_MAYREAD_BIT ^
VMA_MAYREAD_SHARED makes no sense, but nobody in their right mind would be
doing that anyway right?

I'm not using sparse attributes here to enforce basic baseline sanity, but
rather to avoid the aforementioned class of bug, and it works very
effectively.

I did speak to Vlasta about a struct foo { int val; }; type thing, but
sadly then we can't have them in an enum, and we put them in an enum
because otherwise tooling like drgn, rust, etc. find it harder to get
access to the type, and it is in fact a useful way of defining these
values, as they naturally do belong to an enum (unique, individual values).

Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Andrew Morton 2 months ago

On Fri,  5 Dec 2025 17:50:37 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>
> ...
>
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!

That's in mm-stable so the hash shouldn't be changing.


I'm not really sure what's the best way to determine this.  I use

hp2:/usr/src/mm> git tag --contains 2b6a3f061f11
mm-everything-2025-11-29-19-43
mm-everything-2025-12-01-19-02
mm-everything-2025-12-03-23-49
mm-everything-2025-12-05-00-55
mm-stable-2025-12-03-21-26

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Stephen Rothwell 2 months ago

Hi Andrew,

On Fri, 5 Dec 2025 12:15:01 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> I'm not really sure what's the best way to determine this.  I use
> 
> hp2:/usr/src/mm> git tag --contains 2b6a3f061f11  
> mm-everything-2025-11-29-19-43
> mm-everything-2025-12-01-19-02
> mm-everything-2025-12-03-23-49
> mm-everything-2025-12-05-00-55
> mm-stable-2025-12-03-21-26
> 

What does "git branch --contains 2b6a3f061f11" say in your tree?

In my linux-next tree it says (I need the -r to check remote branches):

$ git branch -r --contains 2b6a3f061f11
  mm-stable/mm-stable
  mm-unstable/mm-unstable

but I don't export my remotes to my published tree.

-- 
Cheers,
Stephen Rothwell

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Andrew Morton 2 months ago

On Sat, 6 Dec 2025 11:40:34 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:

> Hi Andrew,
> 
> On Fri, 5 Dec 2025 12:15:01 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > I'm not really sure what's the best way to determine this.  I use
> > 
> > hp2:/usr/src/mm> git tag --contains 2b6a3f061f11  
> > mm-everything-2025-11-29-19-43
> > mm-everything-2025-12-01-19-02
> > mm-everything-2025-12-03-23-49
> > mm-everything-2025-12-05-00-55
> > mm-stable-2025-12-03-21-26
> > 
> 
> What does "git branch --contains 2b6a3f061f11" say in your tree?

hp2:/usr/src/mm> git branch --contains 2b6a3f061f11
* linus
  mm-everything
  mm-new
  mm-stable
  mm-unstable
hp2:/usr/src/mm> git branch -r --contains 2b6a3f061f11 
  linus/master
  origin/mm-everything
  origin/mm-new
  origin/mm-stable
  origin/mm-unstable

kinda random, but it tells me "that's in mm-stable", which is what counts.

> In my linux-next tree it says (I need the -r to check remote branches):
> 
> $ git branch -r --contains 2b6a3f061f11
>   mm-stable/mm-stable
>   mm-unstable/mm-unstable
> 
> but I don't export my remotes to my published tree.

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Fri, Dec 05, 2025 at 07:12:55PM -0800, Andrew Morton wrote:
> On Sat, 6 Dec 2025 11:40:34 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> > Hi Andrew,
> >
> > On Fri, 5 Dec 2025 12:15:01 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > I'm not really sure what's the best way to determine this.  I use
> > >
> > > hp2:/usr/src/mm> git tag --contains 2b6a3f061f11
> > > mm-everything-2025-11-29-19-43
> > > mm-everything-2025-12-01-19-02
> > > mm-everything-2025-12-03-23-49
> > > mm-everything-2025-12-05-00-55
> > > mm-stable-2025-12-03-21-26
> > >
> >
> > What does "git branch --contains 2b6a3f061f11" say in your tree?
>
> hp2:/usr/src/mm> git branch --contains 2b6a3f061f11
> * linus
>   mm-everything
>   mm-new
>   mm-stable
>   mm-unstable
> hp2:/usr/src/mm> git branch -r --contains 2b6a3f061f11
>   linus/master
>   origin/mm-everything
>   origin/mm-new
>   origin/mm-stable
>   origin/mm-unstable
>
> kinda random, but it tells me "that's in mm-stable", which is what counts.
>
> > In my linux-next tree it says (I need the -r to check remote branches):
> >
> > $ git branch -r --contains 2b6a3f061f11
> >   mm-stable/mm-stable
> >   mm-unstable/mm-unstable
> >
> > but I don't export my remotes to my published tree.

Thanks guys for confirming, this is therefore less work all round :)

Though Andrew do please add the Fixes tag that I foolishly forgot to
include, thanks!

Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by David Hildenbrand (Red Hat) 2 months ago

On 12/5/25 21:15, Andrew Morton wrote:
> On Fri,  5 Dec 2025 17:50:37 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
>> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>>
>> ...
>>
>> Andrew - note I've referenced the linux-next commit number above, could you
>> replace with the upstream commit hash once your PR is taken? Thanks!
> 
> That's in mm-stable so the hash shouldn't be changing.
> 
> 
> I'm not really sure what's the best way to determine this.  I use
> 
> hp2:/usr/src/mm> git tag --contains 2b6a3f061f11
> mm-everything-2025-11-29-19-43
> mm-everything-2025-12-01-19-02
> mm-everything-2025-12-03-23-49
> mm-everything-2025-12-05-00-55
> mm-stable-2025-12-03-21-26

I asked myself the same question a couple of times.

Maybe this?

  $ git branch -r --contains 2b6a3f061f11
   mm/mm-everything
   mm/mm-new
   mm/mm-stable


-- 
Cheers

David

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by John Hubbard 2 months ago

On 12/5/25 9:50 AM, Lorenzo Stoakes wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
> 
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
> 
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.
> 
> Avoid macro expansion issues by simply using 1UL << bitnum.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> 
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!
> 
>   include/linux/mm.h | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2f38fb68840..c4438b30c140 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -395,7 +395,8 @@ enum {
>   #undef DECLARE_VMA_BIT
>   #undef DECLARE_VMA_BIT_ALIAS
> 
> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))

OK, so now maybe we don't need all of the rust/bindings/bindings_helper.h
changes? Those were because Rust's bindgen doesn't properly handle
nested macros, as I recall.

> +
>   #define VM_READ		INIT_VM_FLAG(READ)
>   #define VM_WRITE	INIT_VM_FLAG(WRITE)
>   #define VM_EXEC		INIT_VM_FLAG(EXEC)
> --
> 2.52.0
> 

thanks,
-- 
John Hubbard

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Fri, Dec 05, 2025 at 11:56:32AM -0800, John Hubbard wrote:
> On 12/5/25 9:50 AM, Lorenzo Stoakes wrote:
> > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > how VMA flags are declared, utilising an enum of VMA bit values and
> > ifdef-fery VM_xxx flag declarations via macro.
> >
> > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > the newly introduced VMA bit numbers.
> >
> > However, use of this macro results in apparently unfortunate macro
> > expansion and resulted in a performance degradation.This appears to be due
> > to the (__force int), which is required for the sparse typechecking to
> > work.
> >
> > Avoid macro expansion issues by simply using 1UL << bitnum.
> >
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >
> > Andrew - note I've referenced the linux-next commit number above, could you
> > replace with the upstream commit hash once your PR is taken? Thanks!
> >
> >   include/linux/mm.h | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a2f38fb68840..c4438b30c140 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -395,7 +395,8 @@ enum {
> >   #undef DECLARE_VMA_BIT
> >   #undef DECLARE_VMA_BIT_ALIAS
> >
> > -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> > +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
>
> OK, so now maybe we don't need all of the rust/bindings/bindings_helper.h
> changes? Those were because Rust's bindgen doesn't properly handle
> nested macros, as I recall.

Ah seems you're right, I just tried a clang/rust build with those dropped and it
worked fine.

Have added to TODO to send a patch to remove those after this one lands, thanks!

Cheers, Lorenzo

>
> > +
> >   #define VM_READ		INIT_VM_FLAG(READ)
> >   #define VM_WRITE	INIT_VM_FLAG(WRITE)
> >   #define VM_EXEC		INIT_VM_FLAG(EXEC)
> > --
> > 2.52.0
> >
>
> thanks,
> --
> John Hubbard
>

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by David Laight 2 months ago

On Fri,  5 Dec 2025 17:50:37 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
> 
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
> 
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.

Does sparse complain if you just add 0? As in:
#define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)

That should change the type without affecting what BIT() expands to.

	David

> 
> Avoid macro expansion issues by simply using 1UL << bitnum.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> 
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!
> 
>  include/linux/mm.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2f38fb68840..c4438b30c140 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -395,7 +395,8 @@ enum {
>  #undef DECLARE_VMA_BIT
>  #undef DECLARE_VMA_BIT_ALIAS
> 
> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
> +
>  #define VM_READ		INIT_VM_FLAG(READ)
>  #define VM_WRITE	INIT_VM_FLAG(WRITE)
>  #define VM_EXEC		INIT_VM_FLAG(EXEC)
> --
> 2.52.0
>

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> On Fri,  5 Dec 2025 17:50:37 +0000
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > how VMA flags are declared, utilising an enum of VMA bit values and
> > ifdef-fery VM_xxx flag declarations via macro.
> >
> > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > the newly introduced VMA bit numbers.
> >
> > However, use of this macro results in apparently unfortunate macro
> > expansion and resulted in a performance degradation.This appears to be due
> > to the (__force int), which is required for the sparse typechecking to
> > work.
>
> Does sparse complain if you just add 0? As in:
> #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
>
> That should change the type without affecting what BIT() expands to.

Thanks, checked that and unfortunately that doesn't satisfy sparse :)

I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
that this is an issue.

<insert rant about C macros here>

Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by David Laight 2 months ago

On Fri, 5 Dec 2025 19:18:56 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > On Fri,  5 Dec 2025 17:50:37 +0000
> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >  
> > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > ifdef-fery VM_xxx flag declarations via macro.
> > >
> > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > the newly introduced VMA bit numbers.
> > >
> > > However, use of this macro results in apparently unfortunate macro
> > > expansion and resulted in a performance degradation.This appears to be due
> > > to the (__force int), which is required for the sparse typechecking to
> > > work.  
> >
> > Does sparse complain if you just add 0? As in:
> > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> >
> > That should change the type without affecting what BIT() expands to.  
> 
> Thanks, checked that and unfortunately that doesn't satisfy sparse :)

Oh - it is that __bitwise that causes grief.

> I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> that this is an issue.

Did you try getting DECLARE_VMA_BIT to define both the bit number and the
bit flag and put them both into the anonymous enum?
Or are there other reasons for not doing that?

> 
> <insert rant about C macros here>

Add rant about the compiler thinking anon enums are doing anything other
than defining constants.

	David

> 
> Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Fri, Dec 05, 2025 at 09:49:40PM +0000, David Laight wrote:
> On Fri, 5 Dec 2025 19:18:56 +0000
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > > On Fri,  5 Dec 2025 17:50:37 +0000
> > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > ifdef-fery VM_xxx flag declarations via macro.
> > > >
> > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > the newly introduced VMA bit numbers.
> > > >
> > > > However, use of this macro results in apparently unfortunate macro
> > > > expansion and resulted in a performance degradation.This appears to be due
> > > > to the (__force int), which is required for the sparse typechecking to
> > > > work.
> > >
> > > Does sparse complain if you just add 0? As in:
> > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > >
> > > That should change the type without affecting what BIT() expands to.
> >
> > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
>
> Oh - it is that __bitwise that causes grief.

Well, if a sparse build is not enabled this tag is just removed (as is __force).

>
> > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > that this is an issue.
>
> Did you try getting DECLARE_VMA_BIT to define both the bit number and the
> bit flag and put them both into the anonymous enum?
> Or are there other reasons for not doing that?

I did and we can't do that because it results in errors like 'enum constant in
boolean context [-Werror=int-in-bool-context]' as clearly VM_xxx flags are used
in many different contexts in the kernel many of which seem incompatible with
enum constants (even though... they should be equivalent, at least in theory?)

>
> >
> > <insert rant about C macros here>
>
> Add rant about the compiler thinking anon enums are doing anything other
> than defining constants.

Right yes :)

I put these in an enum in part to make life easier for tools like drgn to be
able to find these values (the guys asked about this explicitly). But also it
makes sense for them to be in an enum!

Really the VM_xxx flags are at least in theory a temporary hack until everything
can use bit numbers... assuming I can find a way to do that without causing
performance regressions :)

The perf regression here was - rather unexpected - however - I must say!

>
> 	David
>
> >
> > Cheers, Lorenzo
>

Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by David Laight 2 months ago

On Fri, 5 Dec 2025 19:18:56 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > On Fri,  5 Dec 2025 17:50:37 +0000
> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >  
> > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > ifdef-fery VM_xxx flag declarations via macro.
> > >
> > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > the newly introduced VMA bit numbers.
> > >
> > > However, use of this macro results in apparently unfortunate macro
> > > expansion and resulted in a performance degradation.This appears to be due
> > > to the (__force int), which is required for the sparse typechecking to
> > > work.  
> >
> > Does sparse complain if you just add 0? As in:
> > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> >
> > That should change the type without affecting what BIT() expands to.  
> 
> Thanks, checked that and unfortunately that doesn't satisfy sparse :)
> 
> I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> that this is an issue.

I might use some of my copious spare time (ha) to see why BIT() fails.
I bet it is just too complex for its own good.
Personally I'm fine with both explicit (1ul << n) and hex constants.
The latter are definitely most useful if you ever look at hexdumps.

At the moment I'm trying to fix bitfield.h so you don't get compile errors
on lines that are 18KB long.

Found a new version in linux-next - has its own set of new bugs as well
as more of the old ones.

	David

> 
> <insert rant about C macros here>
> 
> Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:
> On Fri, 5 Dec 2025 19:18:56 +0000
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > > On Fri,  5 Dec 2025 17:50:37 +0000
> > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > ifdef-fery VM_xxx flag declarations via macro.
> > > >
> > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > the newly introduced VMA bit numbers.
> > > >
> > > > However, use of this macro results in apparently unfortunate macro
> > > > expansion and resulted in a performance degradation.This appears to be due
> > > > to the (__force int), which is required for the sparse typechecking to
> > > > work.
> > >
> > > Does sparse complain if you just add 0? As in:
> > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > >
> > > That should change the type without affecting what BIT() expands to.
> >
> > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
> >
> > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > that this is an issue.
>
> I might use some of my copious spare time (ha) to see why BIT() fails.
> I bet it is just too complex for its own good.
> Personally I'm fine with both explicit (1ul << n) and hex constants.
> The latter are definitely most useful if you ever look at hexdumps.

Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
to have the answer and wanted to get it fixed, but obviously am quite curious as
to what on earth is causing that.

>
> At the moment I'm trying to fix bitfield.h so you don't get compile errors
> on lines that are 18KB long.

:)

>
> Found a new version in linux-next - has its own set of new bugs as well
> as more of the old ones.
>
> 	David
>
> >
> > <insert rant about C macros here>
> >
> > Cheers, Lorenzo
>

Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Sat, Dec 06, 2025 at 04:43:57PM +0000, Lorenzo Stoakes wrote:
> On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:
> > On Fri, 5 Dec 2025 19:18:56 +0000
> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > > > On Fri,  5 Dec 2025 17:50:37 +0000
> > > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > > ifdef-fery VM_xxx flag declarations via macro.
> > > > >
> > > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > > the newly introduced VMA bit numbers.
> > > > >
> > > > > However, use of this macro results in apparently unfortunate macro
> > > > > expansion and resulted in a performance degradation.This appears to be due
> > > > > to the (__force int), which is required for the sparse typechecking to
> > > > > work.
> > > >
> > > > Does sparse complain if you just add 0? As in:
> > > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > > >
> > > > That should change the type without affecting what BIT() expands to.
> > >
> > > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
> > >
> > > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > > that this is an issue.
> >
> > I might use some of my copious spare time (ha) to see why BIT() fails.
> > I bet it is just too complex for its own good.
> > Personally I'm fine with both explicit (1ul << n) and hex constants.
> > The latter are definitely most useful if you ever look at hexdumps.
>
> Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
> to have the answer and wanted to get it fixed, but obviously am quite curious as
> to what on earth is causing that.

I did wonder about _calc_vm_trans(), given the 'interesting' stuff it does.

Maybe I should fiddle with that and see...

Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Vlastimil Babka 2 months ago

On 12/8/25 17:42, Lorenzo Stoakes wrote:
> On Sat, Dec 06, 2025 at 04:43:57PM +0000, Lorenzo Stoakes wrote:
>> On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:
>> > On Fri, 5 Dec 2025 19:18:56 +0000
>> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>> >
>> > > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
>> > > > On Fri,  5 Dec 2025 17:50:37 +0000
>> > > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>> > > >
>> > > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>> > > > > how VMA flags are declared, utilising an enum of VMA bit values and
>> > > > > ifdef-fery VM_xxx flag declarations via macro.
>> > > > >
>> > > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
>> > > > > the newly introduced VMA bit numbers.
>> > > > >
>> > > > > However, use of this macro results in apparently unfortunate macro
>> > > > > expansion and resulted in a performance degradation.This appears to be due
>> > > > > to the (__force int), which is required for the sparse typechecking to
>> > > > > work.
>> > > >
>> > > > Does sparse complain if you just add 0? As in:
>> > > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
>> > > >
>> > > > That should change the type without affecting what BIT() expands to.
>> > >
>> > > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
>> > >
>> > > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
>> > > that this is an issue.
>> >
>> > I might use some of my copious spare time (ha) to see why BIT() fails.
>> > I bet it is just too complex for its own good.
>> > Personally I'm fine with both explicit (1ul << n) and hex constants.
>> > The latter are definitely most useful if you ever look at hexdumps.
>>
>> Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
>> to have the answer and wanted to get it fixed, but obviously am quite curious as
>> to what on earth is causing that.
> 
> I did wonder about _calc_vm_trans(), given the 'interesting' stuff it does.

It's unlikely that this affects anything in what the benchmark stresses.
As Mateusz pointed out off-list, the profiles look like mutexes are doing
less optimistic spinning and more sleeping. Which IMHO isn't something that
this change can directly affect.

My own bloat-o-meter test before/after the fix suggests no changed code
generation (as I would indeed expect):

 At least in my case it doesn't seem to be altering the generated code (and
I would expect it wouldn't) except some weird symbols that don't look
related at all:

> ./scripts/bloat-o-meter vmlinux.o vmlinux.o.after
add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-277 (-277)
Function                                     old     new   delta
print_fmt_dax_pte_fault_class               1167    1160      -7
print_fmt_dax_pmd_load_hole_class            308     301      -7
print_fmt_dax_pmd_fault_class               1252    1245      -7
saved_rsp                                3117296 3117232     -64
saved_rdi                                3117312 3117248     -64
saved_rbx                                3117304 3117240     -64
saved_rbp                                3117328 3117264     -64
Total: Before=3350750464, After=3350750187, chg -0.00%

I don't know what happened to those functions above. Could be just
insufficient build reproducibility?
Maybe the effect is just that something slightly shifts in the cpu caches?

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Mateusz Guzik 2 months ago

On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote:
> As Mateusz pointed out off-list, the profiles look like mutexes are doing
> less optimistic spinning and more sleeping. Which IMHO isn't something that
> this change can directly affect.
> 

Not mutexes but rwsems.

The bench at hand has some of the code spinlocked, other parts take
rwsems for reading *or* writing.

I had a peek at rwsem implementation and to my understanding it can
degrade to no spinning in a microbenchmark setting like this one,
provided you are unlucky enough.

In particular you can get unlucky if existing timings get perturbed,
which I presume is happening after Lorenzo's patch.

To demonstrate I wrote a toy patch which conditionally converts affected
down_read calls into down_write (inlined at the end).

While the original report is based on a 192-thread box, I was only able
to test with 80 threads. Even so, the crux of the issue was nicely
reproduced.

./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80

Top says (times vary, idle is growing over time):
%Cpu(s):  3.3 us, 24.4 sy,  0.0 ni, 72.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 

... but if I flip the switch to down_write:

%Cpu(s):  6.3 us, 80.9 sy,  0.0 ni, 12.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

The switch is a sysctl named fs.magic_tunable (0 == down_read; 1 == down_write).

In terms of performance I see the following:

stress-ng: metrc: [2546] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
# sysctl fs.magic_tunable=0 ## down_read
stress-ng: metrc: [2546] msg            63353488     10.01     28.21    213.26   6331298.95      262362.91        30.16          2016
# sysctl fs.magic_tunable=1 ## down_write
stress-ng: metrc: [2036] msg           455014809     10.00     48.79    676.42  45496870.65      627425.68        90.64          2056

That is to say rwsem code is the real culprit and Lorenzo is a random
(albeit deserving) victim.

I see two action items:
- massage the patch back to a state where things compile to the same asm
  as before as it clearly avoidably regressed regardless of the
  aforementioned issue
- figure out what to do with rwsem code for read vs write spinning

I'm not picking this up for the time being, but I might look at this at
some point.

diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..de1ef700d144 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+unsigned long magic_tunable;
+
 static const struct ctl_table fs_stat_sysctls[] = {
 	{
 		.procname	= "file-nr",
@@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
 		.extra1		= SYSCTL_LONG_ZERO,
 		.extra2		= SYSCTL_LONG_MAX,
 	},
+	{
+		.procname	= "magic_tunable",
+		.data		= &magic_tunable,
+		.maxlen		= sizeof(magic_tunable),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= SYSCTL_LONG_ZERO,
+		.extra2		= SYSCTL_LONG_MAX,
+	},
+
 	{
 		.procname	= "nr_open",
 		.data		= &sysctl_nr_open,
diff --git a/ipc/msg.c b/ipc/msg.c
index ee6af4fe52bf..fa835ea53e09 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -474,6 +474,8 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
 	return err;
 }
 
+extern unsigned long magic_tunable;
+
 static int msgctl_info(struct ipc_namespace *ns, int msqid,
 			 int cmd, struct msginfo *msginfo)
 {
@@ -495,11 +497,19 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
 	msginfo->msgmnb = ns->msg_ctlmnb;
 	msginfo->msgssz = MSGSSZ;
 	msginfo->msgseg = MSGSEG;
-	down_read(&msg_ids(ns).rwsem);
-	if (cmd == MSG_INFO)
-		msginfo->msgpool = msg_ids(ns).in_use;
-	max_idx = ipc_get_maxidx(&msg_ids(ns));
-	up_read(&msg_ids(ns).rwsem);
+	if (!READ_ONCE(magic_tunable)) {
+		down_read(&msg_ids(ns).rwsem);
+		if (cmd == MSG_INFO)
+			msginfo->msgpool = msg_ids(ns).in_use;
+		max_idx = ipc_get_maxidx(&msg_ids(ns));
+		up_read(&msg_ids(ns).rwsem);
+	} else {
+		down_write(&msg_ids(ns).rwsem);
+		if (cmd == MSG_INFO)
+			msginfo->msgpool = msg_ids(ns).in_use;
+		max_idx = ipc_get_maxidx(&msg_ids(ns));
+		up_write(&msg_ids(ns).rwsem);
+	}
 	if (cmd == MSG_INFO) {
 		msginfo->msgmap = min_t(int,
 				     percpu_counter_sum(&ns->percpu_msg_hdrs),
diff --git a/ipc/util.c b/ipc/util.c
index cae60f11d9c2..c65c8289a54b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -771,6 +771,7 @@ struct ipc_proc_iter {
 	struct ipc_namespace *ns;
 	struct pid_namespace *pid_ns;
 	struct ipc_proc_iface *iface;
+	bool writelocked;
 };
 
 struct pid_namespace *ipc_seq_pid_ns(struct seq_file *s)
@@ -828,6 +829,8 @@ static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
 	return sysvipc_find_ipc(&iter->ns->ids[iface->ids], pos);
 }
 
+extern unsigned long magic_tunable;
+
 /*
  * File positions: pos 0 -> header, pos n -> ipc idx = n - 1.
  * SeqFile iterator: iterator value locked ipc pointer or SEQ_TOKEN_START.
@@ -844,7 +847,13 @@ static void *sysvipc_proc_start(struct seq_file *s, loff_t *pos)
 	 * Take the lock - this will be released by the corresponding
 	 * call to stop().
 	 */
-	down_read(&ids->rwsem);
+	if (!READ_ONCE(magic_tunable)) {
+		down_read(&ids->rwsem);
+		iter->writelocked = false;
+	} else {
+		down_write(&ids->rwsem);
+		iter->writelocked = true;
+	}
 
 	/* pos < 0 is invalid */
 	if (*pos < 0)
@@ -871,7 +880,10 @@ static void sysvipc_proc_stop(struct seq_file *s, void *it)
 
 	ids = &iter->ns->ids[iface->ids];
 	/* Release the lock we took in start() */
-	up_read(&ids->rwsem);
+	if (!iter->writelocked)
+		up_read(&ids->rwsem);
+	else
+		up_write(&ids->rwsem);
 }
 
 static int sysvipc_proc_show(struct seq_file *s, void *it)

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Mateusz Guzik 1 month, 3 weeks ago

So I had a look where the timing difference is coming from and I think
I have the answer: init_ipc_ns does not have a guaranteed cacheline
placement and things get moved around with the patch.

On my kernels (nm vmlinux-newbits | sort -nk 1 | less)

before:
ffffffff839ffb60 T init_ipc_ns
ffffffff83a00020 t event_exit__msgrcv

after:
ffffffff839ffbc0 T init_ipc_ns
ffffffff83a00080 t event_exit__msgrcv

This is the pervasive problem of vars from all .o files placed
adjacent to each other, meaning changes in one .o file result in
offsets changing in other files and then you get performance
fluctuations as not-explicitly-padded variables share (or no longer
share) cachelines.

I brought this up a year ago elsewhere:
https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html

maybe i should pick it up again and see it through

as for the thing at hand, someone(tm) will want to make sure the
namespace is cacheline aligned and possibly pad its own internals
afterwards. Personally I can't be bothered.

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 1 month, 3 weeks ago

On Fri, Dec 12, 2025 at 01:24:57PM +0100, Mateusz Guzik wrote:
> So I had a look where the timing difference is coming from and I think
> I have the answer: init_ipc_ns does not have a guaranteed cacheline
> placement and things get moved around with the patch.
>
> On my kernels (nm vmlinux-newbits | sort -nk 1 | less)
>
> before:
> ffffffff839ffb60 T init_ipc_ns
> ffffffff83a00020 t event_exit__msgrcv
>
> after:
> ffffffff839ffbc0 T init_ipc_ns
> ffffffff83a00080 t event_exit__msgrcv
>
> This is the pervasive problem of vars from all .o files placed
> adjacent to each other, meaning changes in one .o file result in
> offsets changing in other files and then you get performance
> fluctuations as not-explicitly-padded variables share (or no longer
> share) cachelines.
>
> I brought this up a year ago elsewhere:
> https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html
>
> maybe i should pick it up again and see it through
>
> as for the thing at hand, someone(tm) will want to make sure the
> namespace is cacheline aligned and possibly pad its own internals
> afterwards. Personally I can't be bothered.

Thanks! Looking it seems we accumulate a bunch of offsets in:

	print_fmt_dax_pte_fault_class
	print_fmt_dax_pmd_load_hole_class
	print_fmt_dax_pmd_fault_class

(entries that the bloat-o-meter confirms changed in size)

That continue on to eventually offset init_ipc_ns.

It actually looked in my testing like performance _improved_ with the change,
and I notice locally I end up aligned on the struct:

ffffffff82be16a0 T init_ipc_ns

->

ffffffff82be1700 T init_ipc_ns

Examining stress-ng before 2b6a3f061f11:

Command is:
$ stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --msg $(nproc)

Results:

stress-ng: metrc: [1662] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max

stress-ng: metrc: [776] msg           964459758     60.00    154.63    632.77  16073484.12     1224879.38        21.17          2132

After 2b6a3f061f11:

stress-ng: metrc: [782] msg           1326214608     60.00    194.19    713.34  22102974.72     1461348.11        24.40          2140

And if I simply do:

 ipc/msgutil.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index 7a03f6d03de3..df0d7a067bcf 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -26,7 +26,7 @@ DEFINE_SPINLOCK(mq_lock);
  * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE
  * and not CONFIG_IPC_NS.
  */
-struct ipc_namespace init_ipc_ns = {
+struct ipc_namespace init_ipc_ns ____cacheline_aligned_in_smp = {
 	.ns.__ns_ref = REFCOUNT_INIT(1),
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_ipc_ns),
--
2.52.0

We observe:

stress-ng: metrc: [764] msg           1321700290     60.00    196.73    723.82  22028700.93     1435779.72        24.75          2116 aligned

So this really _does_ look like an alignment issue.

So I think I should just submit the above patch right? Can you see how it behaves for you?

Cheers, Lorenzo

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Mateusz Guzik 1 month, 3 weeks ago

On Fri, Dec 12, 2025 at 4:03 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Dec 12, 2025 at 01:24:57PM +0100, Mateusz Guzik wrote:
> > So I had a look where the timing difference is coming from and I think
> > I have the answer: init_ipc_ns does not have a guaranteed cacheline
> > placement and things get moved around with the patch.
> >
> > On my kernels (nm vmlinux-newbits | sort -nk 1 | less)
> >
> > before:
> > ffffffff839ffb60 T init_ipc_ns
> > ffffffff83a00020 t event_exit__msgrcv
> >
> > after:
> > ffffffff839ffbc0 T init_ipc_ns
> > ffffffff83a00080 t event_exit__msgrcv
> >
> > This is the pervasive problem of vars from all .o files placed
> > adjacent to each other, meaning changes in one .o file result in
> > offsets changing in other files and then you get performance
> > fluctuations as not-explicitly-padded variables share (or no longer
> > share) cachelines.
> >
> > I brought this up a year ago elsewhere:
> > https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html
> >
> > maybe i should pick it up again and see it through
> >
> > as for the thing at hand, someone(tm) will want to make sure the
> > namespace is cacheline aligned and possibly pad its own internals
> > afterwards. Personally I can't be bothered.
>
> Thanks! Looking it seems we accumulate a bunch of offsets in:
>
>         print_fmt_dax_pte_fault_class
>         print_fmt_dax_pmd_load_hole_class
>         print_fmt_dax_pmd_fault_class
>
> (entries that the bloat-o-meter confirms changed in size)
>
> That continue on to eventually offset init_ipc_ns.
>
> It actually looked in my testing like performance _improved_ with the change,
> and I notice locally I end up aligned on the struct:
>
> ffffffff82be16a0 T init_ipc_ns
>
> ->
>
> ffffffff82be1700 T init_ipc_ns
>
> Examining stress-ng before 2b6a3f061f11:
>
> Command is:
> $ stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --msg $(nproc)
>
> Results:
>
> stress-ng: metrc: [1662] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
>
> stress-ng: metrc: [776] msg           964459758     60.00    154.63    632.77  16073484.12     1224879.38        21.17          2132
>
> After 2b6a3f061f11:
>
> stress-ng: metrc: [782] msg           1326214608     60.00    194.19    713.34  22102974.72     1461348.11        24.40          2140
>
> And if I simply do:
>
>  ipc/msgutil.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ipc/msgutil.c b/ipc/msgutil.c
> index 7a03f6d03de3..df0d7a067bcf 100644
> --- a/ipc/msgutil.c
> +++ b/ipc/msgutil.c
> @@ -26,7 +26,7 @@ DEFINE_SPINLOCK(mq_lock);
>   * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE
>   * and not CONFIG_IPC_NS.
>   */
> -struct ipc_namespace init_ipc_ns = {
> +struct ipc_namespace init_ipc_ns ____cacheline_aligned_in_smp = {
>         .ns.__ns_ref = REFCOUNT_INIT(1),
>         .user_ns = &init_user_ns,
>         .ns.inum = ns_init_inum(&init_ipc_ns),
> --
> 2.52.0
>
> We observe:
>
> stress-ng: metrc: [764] msg           1321700290     60.00    196.73    723.82  22028700.93     1435779.72        24.75          2116 aligned
>
> So this really _does_ look like an alignment issue.
>
> So I think I should just submit the above patch right? Can you see how it behaves for you?

It is my recommendation to drop the matter now that we know the vm
bits patch is a victim of circumstance.

While slapping an alignment requirement on the var is the usual fix
and is a part of a full solution, the situation here is of the fucky
sort and stopping at precisely that might happen to be the bad call.

I state again there is contention on a spin lock and contention on a
rwsem separately, and most importantly that there is a level of load
on the rwsem which triggers pessimal behavior during which it abandons
spinning.

If the cacheline bouncing affects areas used by either of these locks,
you don't know if you got a win because you in fact *slowed it down*
and avoided the pessimal rwsem condition. Then if someone(tm) fixed up
rwsems, the ipc code would be leaving performance on the table with an
annotation specifically putting it in that spot.

So I would refrain from adding it as is.

If I was to work on this (which I wont), the first step would be to
stabilize the rwsem problem. Afterwards annotate the init_ipc_ns var
to stabilize the placement, afterwards rearrange the fields + add
padding as needed to reduce the bouncing. But this can't be done with
rwsem behavior at play.

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by David Laight 1 month, 3 weeks ago

On Fri, 12 Dec 2025 13:24:57 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> So I had a look where the timing difference is coming from and I think
> I have the answer: init_ipc_ns does not have a guaranteed cacheline
> placement and things get moved around with the patch.
> 
> On my kernels (nm vmlinux-newbits | sort -nk 1 | less)
> 
> before:
> ffffffff839ffb60 T init_ipc_ns
> ffffffff83a00020 t event_exit__msgrcv
> 
> after:
> ffffffff839ffbc0 T init_ipc_ns
> ffffffff83a00080 t event_exit__msgrcv
> 
> This is the pervasive problem of vars from all .o files placed
> adjacent to each other, meaning changes in one .o file result in
> offsets changing in other files and then you get performance
> fluctuations as not-explicitly-padded variables share (or no longer
> share) cachelines.

Those look like text symbols, not data ones.
But moving code about can make the same sort of changes.

> I brought this up a year ago elsewhere:
> https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html

My guess is that all the extra padding increases the cache footprint
of the code and that causes other cache lines to displaced and then
needing to be re-read from memory.
So while a specific benchmark may improve, overall system performance
goes down.

Excessive loop unrolling has the same effect.

	David

> 
> maybe i should pick it up again and see it through
> 
> as for the thing at hand, someone(tm) will want to make sure the
> namespace is cacheline aligned and possibly pad its own internals
> afterwards. Personally I can't be bothered.

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Mateusz Guzik 1 month, 3 weeks ago

On Fri, Dec 12, 2025 at 2:02 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Fri, 12 Dec 2025 13:24:57 +0100
> Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> > So I had a look where the timing difference is coming from and I think
> > I have the answer: init_ipc_ns does not have a guaranteed cacheline
> > placement and things get moved around with the patch.
> >
> > On my kernels (nm vmlinux-newbits | sort -nk 1 | less)
> >
> > before:
> > ffffffff839ffb60 T init_ipc_ns
> > ffffffff83a00020 t event_exit__msgrcv
> >
> > after:
> > ffffffff839ffbc0 T init_ipc_ns
> > ffffffff83a00080 t event_exit__msgrcv
> >
> > This is the pervasive problem of vars from all .o files placed
> > adjacent to each other, meaning changes in one .o file result in
> > offsets changing in other files and then you get performance
> > fluctuations as not-explicitly-padded variables share (or no longer
> > share) cachelines.
>
> Those look like text symbols, not data ones.

I don't know why it is annotated with T.

There is only one symbol in the kernel with that name and it is the var.

You can see the same thing for other vars, for example:
ffffffff83808040 T mmlist_lock
ffffffff83808080 T tasklist_lock

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 1 month, 4 weeks ago

On Tue, Dec 09, 2025 at 10:26:22AM +0100, Mateusz Guzik wrote:
> On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote:
> > As Mateusz pointed out off-list, the profiles look like mutexes are doing
> > less optimistic spinning and more sleeping. Which IMHO isn't something that
> > this change can directly affect.
> >
>
> Not mutexes but rwsems.
>
> The bench at hand has some of the code spinlocked, other parts take
> rwsems for reading *or* writing.
>
> I had a peek at rwsem implementation and to my understanding it can
> degrade to no spinning in a microbenchmark setting like this one,
> provided you are unlucky enough.

So we're just sleep, sleep sleeping? That would explain it...

I mean is this an issue with the IPC design or the rwsem's in general?

>
> In particular you can get unlucky if existing timings get perturbed,
> which I presume is happening after Lorenzo's patch.
>
> To demonstrate I wrote a toy patch which conditionally converts affected
> down_read calls into down_write (inlined at the end).
>
> While the original report is based on a 192-thread box, I was only able
> to test with 80 threads. Even so, the crux of the issue was nicely
> reproduced.
>
> ./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80

I wonder if large core count matters here in particular, I was doing this
(albeit in a VM...) with 62 cores

>
> Top says (times vary, idle is growing over time):
> %Cpu(s):  3.3 us, 24.4 sy,  0.0 ni, 72.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
>
> ... but if I flip the switch to down_write:
>
> %Cpu(s):  6.3 us, 80.9 sy,  0.0 ni, 12.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
>
> The switch is a sysctl named fs.magic_tunable (0 == down_read; 1 == down_write).
>
> In terms of performance I see the following:
>
> stress-ng: metrc: [2546] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
> # sysctl fs.magic_tunable=0 ## down_read
> stress-ng: metrc: [2546] msg            63353488     10.01     28.21    213.26   6331298.95      262362.91        30.16          2016
> # sysctl fs.magic_tunable=1 ## down_write
> stress-ng: metrc: [2036] msg           455014809     10.00     48.79    676.42  45496870.65      627425.68        90.64          2056
>
> That is to say rwsem code is the real culprit and Lorenzo is a random
> (albeit deserving) victim.

;)

Ack interesting.

>
> I see two action items:
> - massage the patch back to a state where things compile to the same asm
>   as before as it clearly avoidably regressed regardless of the
>   aforementioned issue

I think given the fix is so trivially obviously correct and not egregious to
take, we should do that for the time being.

> - figure out what to do with rwsem code for read vs write spinning

I have no idea how the code generation could have varied here, I did wonder if
somehow the changes somehow pessimised the optimiser for heuristic compiler
reasons somehow, or if somewhere there is something that has an issue with a
certain incantation of cast (but only after a certain flavour of nested macro
expansion?...)

>
> I'm not picking this up for the time being, but I might look at this at
> some point.
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index cd4a3db4659a..de1ef700d144 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
>  	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
>  }
>
> +unsigned long magic_tunable;
> +
>  static const struct ctl_table fs_stat_sysctls[] = {
>  	{
>  		.procname	= "file-nr",
> @@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
>  		.extra1		= SYSCTL_LONG_ZERO,
>  		.extra2		= SYSCTL_LONG_MAX,
>  	},
> +	{
> +		.procname	= "magic_tunable",
> +		.data		= &magic_tunable,
> +		.maxlen		= sizeof(magic_tunable),
> +		.mode		= 0644,
> +		.proc_handler	= proc_doulongvec_minmax,
> +		.extra1		= SYSCTL_LONG_ZERO,
> +		.extra2		= SYSCTL_LONG_MAX,
> +	},
> +
>  	{
>  		.procname	= "nr_open",
>  		.data		= &sysctl_nr_open,
> diff --git a/ipc/msg.c b/ipc/msg.c
> index ee6af4fe52bf..fa835ea53e09 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -474,6 +474,8 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
>  	return err;
>  }
>
> +extern unsigned long magic_tunable;
> +
>  static int msgctl_info(struct ipc_namespace *ns, int msqid,
>  			 int cmd, struct msginfo *msginfo)
>  {
> @@ -495,11 +497,19 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
>  	msginfo->msgmnb = ns->msg_ctlmnb;
>  	msginfo->msgssz = MSGSSZ;
>  	msginfo->msgseg = MSGSEG;
> -	down_read(&msg_ids(ns).rwsem);
> -	if (cmd == MSG_INFO)
> -		msginfo->msgpool = msg_ids(ns).in_use;
> -	max_idx = ipc_get_maxidx(&msg_ids(ns));
> -	up_read(&msg_ids(ns).rwsem);
> +	if (!READ_ONCE(magic_tunable)) {
> +		down_read(&msg_ids(ns).rwsem);
> +		if (cmd == MSG_INFO)
> +			msginfo->msgpool = msg_ids(ns).in_use;
> +		max_idx = ipc_get_maxidx(&msg_ids(ns));
> +		up_read(&msg_ids(ns).rwsem);
> +	} else {
> +		down_write(&msg_ids(ns).rwsem);
> +		if (cmd == MSG_INFO)
> +			msginfo->msgpool = msg_ids(ns).in_use;
> +		max_idx = ipc_get_maxidx(&msg_ids(ns));
> +		up_write(&msg_ids(ns).rwsem);
> +	}
>  	if (cmd == MSG_INFO) {
>  		msginfo->msgmap = min_t(int,
>  				     percpu_counter_sum(&ns->percpu_msg_hdrs),
> diff --git a/ipc/util.c b/ipc/util.c
> index cae60f11d9c2..c65c8289a54b 100644
> --- a/ipc/util.c
> +++ b/ipc/util.c
> @@ -771,6 +771,7 @@ struct ipc_proc_iter {
>  	struct ipc_namespace *ns;
>  	struct pid_namespace *pid_ns;
>  	struct ipc_proc_iface *iface;
> +	bool writelocked;
>  };
>
>  struct pid_namespace *ipc_seq_pid_ns(struct seq_file *s)
> @@ -828,6 +829,8 @@ static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
>  	return sysvipc_find_ipc(&iter->ns->ids[iface->ids], pos);
>  }
>
> +extern unsigned long magic_tunable;
> +
>  /*
>   * File positions: pos 0 -> header, pos n -> ipc idx = n - 1.
>   * SeqFile iterator: iterator value locked ipc pointer or SEQ_TOKEN_START.
> @@ -844,7 +847,13 @@ static void *sysvipc_proc_start(struct seq_file *s, loff_t *pos)
>  	 * Take the lock - this will be released by the corresponding
>  	 * call to stop().
>  	 */
> -	down_read(&ids->rwsem);
> +	if (!READ_ONCE(magic_tunable)) {
> +		down_read(&ids->rwsem);
> +		iter->writelocked = false;
> +	} else {
> +		down_write(&ids->rwsem);
> +		iter->writelocked = true;
> +	}
>
>  	/* pos < 0 is invalid */
>  	if (*pos < 0)
> @@ -871,7 +880,10 @@ static void sysvipc_proc_stop(struct seq_file *s, void *it)
>
>  	ids = &iter->ns->ids[iface->ids];
>  	/* Release the lock we took in start() */
> -	up_read(&ids->rwsem);
> +	if (!iter->writelocked)
> +		up_read(&ids->rwsem);
> +	else
> +		up_write(&ids->rwsem);
>  }
>
>  static int sysvipc_proc_show(struct seq_file *s, void *it)

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Mateusz Guzik 1 month, 4 weeks ago

On Wed, Dec 10, 2025 at 5:18 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Dec 09, 2025 at 10:26:22AM +0100, Mateusz Guzik wrote:
> > On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote:
> > > As Mateusz pointed out off-list, the profiles look like mutexes are doing
> > > less optimistic spinning and more sleeping. Which IMHO isn't something that
> > > this change can directly affect.
> > >
> >
> > Not mutexes but rwsems.
> >
> > The bench at hand has some of the code spinlocked, other parts take
> > rwsems for reading *or* writing.
> >
> > I had a peek at rwsem implementation and to my understanding it can
> > degrade to no spinning in a microbenchmark setting like this one,
> > provided you are unlucky enough.
>
> So we're just sleep, sleep sleeping? That would explain it...
>
> I mean is this an issue with the IPC design or the rwsem's in general?
>

the ipc code is not doing itself any favors, but is probably not worth
working on either

the lock stuff suffers internal problems. while there is no such thing
as fastest possible lock, let alone sleepable rw lock, it is pretty
clear the current code leaves perf on the table even with that caveat

> >
> > In particular you can get unlucky if existing timings get perturbed,
> > which I presume is happening after Lorenzo's patch.
> >
> > To demonstrate I wrote a toy patch which conditionally converts affected
> > down_read calls into down_write (inlined at the end).
> >
> > While the original report is based on a 192-thread box, I was only able
> > to test with 80 threads. Even so, the crux of the issue was nicely
> > reproduced.
> >
> > ./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80
>
> I wonder if large core count matters here in particular, I was doing this
> (albeit in a VM...) with 62 cores
>

you need a large enough count to generate enough contention in the
first place. how much is it for this bench i have no tried figuring
out, even for my hw

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by David Laight 2 months ago

On Mon, 8 Dec 2025 16:42:43 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Sat, Dec 06, 2025 at 04:43:57PM +0000, Lorenzo Stoakes wrote:
> > On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:  
> > > On Fri, 5 Dec 2025 19:18:56 +0000
> > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >  
> > > > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:  
> > > > > On Fri,  5 Dec 2025 17:50:37 +0000
> > > > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > > >  
> > > > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > > > ifdef-fery VM_xxx flag declarations via macro.
> > > > > >
> > > > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > > > the newly introduced VMA bit numbers.
> > > > > >
> > > > > > However, use of this macro results in apparently unfortunate macro
> > > > > > expansion and resulted in a performance degradation.This appears to be due
> > > > > > to the (__force int), which is required for the sparse typechecking to
> > > > > > work.  
> > > > >
> > > > > Does sparse complain if you just add 0? As in:
> > > > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > > > >
> > > > > That should change the type without affecting what BIT() expands to.  
> > > >
> > > > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
> > > >
> > > > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > > > that this is an issue.  
> > >
> > > I might use some of my copious spare time (ha) to see why BIT() fails.
> > > I bet it is just too complex for its own good.
> > > Personally I'm fine with both explicit (1ul << n) and hex constants.
> > > The latter are definitely most useful if you ever look at hexdumps.  
> >
> > Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
> > to have the answer and wanted to get it fixed, but obviously am quite curious as
> > to what on earth is causing that.  
> 
> I did wonder about _calc_vm_trans(), given the 'interesting' stuff it does.
> 
> Maybe I should fiddle with that and see...

Hmmm...
/*
 * Optimisation macro.  It is equivalent to:
 *      (x & bit1) ? bit2 : 0
 * but this version is faster.
 * ("bit1" and "bit2" must be single bits)
 */
#define _calc_vm_trans(x, bit1, bit2) \
  ((!(bit1) || !(bit2)) ? 0 : \
  ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
   : ((x) & (bit1)) / ((bit1) / (bit2))))

The comment fails to mention it is only sane for constants.
If nothing else 9 expansions of BIT() are going to generate a very
long line.

For starters make it a statement expression and use __auto_type _bit1 = bit1.
Then add a check for both _bit1 and _bit2 being constants.

It is also worth checking the compiler doesn't do it for you.
Looks like gcc 7.1 onwards generate the 'optimised' code.

https://godbolt.org/z/EGGE56E3r

	David

Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

Posted by Lorenzo Stoakes 2 months ago

On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
>
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
>
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.
>
> Avoid macro expansion issues by simply using 1UL << bitnum.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Sorry forget to add Fixes tag... :)

Please update with that also when the PR is upstream apologies :P

> ---
>
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!
>
>  include/linux/mm.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2f38fb68840..c4438b30c140 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -395,7 +395,8 @@ enum {
>  #undef DECLARE_VMA_BIT
>  #undef DECLARE_VMA_BIT_ALIAS
>
> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
> +
>  #define VM_READ		INIT_VM_FLAG(READ)
>  #define VM_WRITE	INIT_VM_FLAG(WRITE)
>  #define VM_EXEC		INIT_VM_FLAG(EXEC)
> --
> 2.52.0

Cheers, Lorenzo