include/linux/mm.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
how VMA flags are declared, utilising an enum of VMA bit values and
ifdef-fery VM_xxx flag declarations via macro.
As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
the newly introduced VMA bit numbers.
However, use of this macro results in apparently unfortunate macro
expansion and resulted in a performance degradation.This appears to be due
to the (__force int), which is required for the sparse typechecking to
work.
Avoid macro expansion issues by simply using 1UL << bitnum.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
Andrew - note I've referenced the linux-next commit number above, could you
replace with the upstream commit hash once your PR is taken? Thanks!
include/linux/mm.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2f38fb68840..c4438b30c140 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -395,7 +395,8 @@ enum {
#undef DECLARE_VMA_BIT
#undef DECLARE_VMA_BIT_ALIAS
-#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
+#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
+
#define VM_READ INIT_VM_FLAG(READ)
#define VM_WRITE INIT_VM_FLAG(WRITE)
#define VM_EXEC INIT_VM_FLAG(EXEC)
--
2.52.0
On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
>
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
>
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.
> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
What the hell is __bitwise doing on these enum values?
Could we please get rid of that ridiculous cargo-culting?
Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
declared __bitwise?
On Sat, Dec 06, 2025 at 01:14:35AM +0000, Al Viro wrote:
> On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > how VMA flags are declared, utilising an enum of VMA bit values and
> > ifdef-fery VM_xxx flag declarations via macro.
> >
> > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > the newly introduced VMA bit numbers.
> >
> > However, use of this macro results in apparently unfortunate macro
> > expansion and resulted in a performance degradation.This appears to be due
> > to the (__force int), which is required for the sparse typechecking to
> > work.
>
> > -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> > +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
>
> What the hell is __bitwise doing on these enum values?
> Could we please get rid of that ridiculous cargo-culting?
>
> Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
> declared __bitwise?
FWIW, bitwise does make sense for things like (1 << SOME_CONSTANT);
then you get warned about arithmetics and conversions to integer
for those, with bitwise operations explicitly allowed.
VM_... are such; VMA_..._BIT are not. VM_READ | VM_EXEC is fine;
VM_READ + 14 is nonsense and should be warned about. That's where
__bitwise would make sense. On bit numbers it's not - what makes
VMA_BIT_MAYREAD ^ VMA_BIT_SHARED any better than 3 * VMA_BIT_MAYREAD?
On 12/6/25 2:26 AM, Al Viro wrote:
> On Sat, Dec 06, 2025 at 01:14:35AM +0000, Al Viro wrote:
>> On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
>>> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>>> how VMA flags are declared, utilising an enum of VMA bit values and
>>> ifdef-fery VM_xxx flag declarations via macro.
>>>
>>> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
>>> the newly introduced VMA bit numbers.
>>>
>>> However, use of this macro results in apparently unfortunate macro
>>> expansion and resulted in a performance degradation.This appears to be due
>>> to the (__force int), which is required for the sparse typechecking to
>>> work.
>>
>>> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
>>> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
>>
>> What the hell is __bitwise doing on these enum values?
>> Could we please get rid of that ridiculous cargo-culting?
>>
>> Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
>> declared __bitwise?
I was confused by this too at first when reviewing, but instead of the angry
display above, simply asked the author and got answers.
Comment says:
/**
* typedef vma_flag_t - specifies an individual VMA flag by bit number.
*
* This value is made type safe by sparse to avoid passing invalid flag values
* around.
*/
typedef int __bitwise vma_flag_t;
It's done as documented in Documentation/dev-tools/sparse.rst section
"Using sparse for typechecking".
So yeah the keyword is __bitwise and indeed we don't perform bitwise operations
on the VM_ values, in fact we don't perform any operations without __force
casting them back first, to catch when they are used by mistake.
It's not cargo-culting, IIRC it catched a bug in an early version of the
patch itself.
I wouldn't mind if sparse provided a different keyword than __bitwise
for this use case to make it less misleading. Or even better if we could
make the compiler itself treat vma_flag_t as a "special int" that can't
be implicitly cast to a normal int, so we don't have to rely on sparse
checks to catch those.
> FWIW, bitwise does make sense for things like (1 << SOME_CONSTANT);
> then you get warned about arithmetics and conversions to integer
> for those, with bitwise operations explicitly allowed.
>
> VM_... are such; VMA_..._BIT are not. VM_READ | VM_EXEC is fine;
> VM_READ + 14 is nonsense and should be warned about. That's where
> __bitwise would make sense. On bit numbers it's not - what makes
> VMA_BIT_MAYREAD ^ VMA_BIT_SHARED any better than 3 * VMA_BIT_MAYREAD?
On Sat, Dec 06, 2025 at 01:35:51PM +0100, Vlastimil Babka wrote:
> On 12/6/25 2:26 AM, Al Viro wrote:
> > On Sat, Dec 06, 2025 at 01:14:35AM +0000, Al Viro wrote:
> >> On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> >>> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> >>> how VMA flags are declared, utilising an enum of VMA bit values and
> >>> ifdef-fery VM_xxx flag declarations via macro.
> >>>
> >>> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> >>> the newly introduced VMA bit numbers.
> >>>
> >>> However, use of this macro results in apparently unfortunate macro
> >>> expansion and resulted in a performance degradation.This appears to be due
> >>> to the (__force int), which is required for the sparse typechecking to
> >>> work.
> >>
> >>> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> >>> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
> >>
> >> What the hell is __bitwise doing on these enum values?
> >> Could we please get rid of that ridiculous cargo-culting?
> >>
> >> Bitwise operations on BIT NUMBERS make no sense whatsoever; why are those
> >> declared __bitwise?
>
> I was confused by this too at first when reviewing, but instead of the angry
> display above, simply asked the author and got answers.
>
> Comment says:
>
> /**
> * typedef vma_flag_t - specifies an individual VMA flag by bit number.
> *
> * This value is made type safe by sparse to avoid passing invalid flag values
> * around.
> */
> typedef int __bitwise vma_flag_t;
>
> It's done as documented in Documentation/dev-tools/sparse.rst section
> "Using sparse for typechecking".
>
> So yeah the keyword is __bitwise and indeed we don't perform bitwise operations
> on the VM_ values, in fact we don't perform any operations without __force
> casting them back first, to catch when they are used by mistake.
> It's not cargo-culting, IIRC it catched a bug in an early version of the
> patch itself.
>
> I wouldn't mind if sparse provided a different keyword than __bitwise
> for this use case to make it less misleading. Or even better if we could
> make the compiler itself treat vma_flag_t as a "special int" that can't
> be implicitly cast to a normal int, so we don't have to rely on sparse
> checks to catch those.
>
Yup precisely - this was entirely to avoid issues with passing a VM_xxx flag
around when a VMA bit number is required which is a kind of bug that would be
_really easy_ to do otherwise.
vma_flags_set(..., VM_READ);
Reads perfectly but sets the write bit instead of the read bit, for instance.
Yes this is a hack, but does the job, and the sparse documentation doesn't
dissuade.
I agree with Vlasta that really we should provide an __explicit_type or
whatever annotation for this usage.
>
> > FWIW, bitwise does make sense for things like (1 << SOME_CONSTANT);
> > then you get warned about arithmetics and conversions to integer
> > for those, with bitwise operations explicitly allowed.
> >
> > VM_... are such; VMA_..._BIT are not. VM_READ | VM_EXEC is fine;
> > VM_READ + 14 is nonsense and should be warned about. That's where
> > __bitwise would make sense. On bit numbers it's not - what makes
> > VMA_BIT_MAYREAD ^ VMA_BIT_SHARED any better than 3 * VMA_BIT_MAYREAD?
>
The issue isn't so much the operations, and yes obviously VMA_MAYREAD_BIT ^
VMA_MAYREAD_SHARED makes no sense, but nobody in their right mind would be
doing that anyway right?
I'm not using sparse attributes here to enforce basic baseline sanity, but
rather to avoid the aforementioned class of bug, and it works very
effectively.
I did speak to Vlasta about a struct foo { int val; }; type thing, but
sadly then we can't have them in an enum, and we put them in an enum
because otherwise tooling like drgn, rust, etc. find it harder to get
access to the type, and it is in fact a useful way of defining these
values, as they naturally do belong to an enum (unique, individual values).
Cheers, Lorenzo
On Fri, 5 Dec 2025 17:50:37 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>
> ...
>
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!
That's in mm-stable so the hash shouldn't be changing.
I'm not really sure what's the best way to determine this. I use
hp2:/usr/src/mm> git tag --contains 2b6a3f061f11
mm-everything-2025-11-29-19-43
mm-everything-2025-12-01-19-02
mm-everything-2025-12-03-23-49
mm-everything-2025-12-05-00-55
mm-stable-2025-12-03-21-26
Hi Andrew, On Fri, 5 Dec 2025 12:15:01 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > I'm not really sure what's the best way to determine this. I use > > hp2:/usr/src/mm> git tag --contains 2b6a3f061f11 > mm-everything-2025-11-29-19-43 > mm-everything-2025-12-01-19-02 > mm-everything-2025-12-03-23-49 > mm-everything-2025-12-05-00-55 > mm-stable-2025-12-03-21-26 > What does "git branch --contains 2b6a3f061f11" say in your tree? In my linux-next tree it says (I need the -r to check remote branches): $ git branch -r --contains 2b6a3f061f11 mm-stable/mm-stable mm-unstable/mm-unstable but I don't export my remotes to my published tree. -- Cheers, Stephen Rothwell
On Sat, 6 Dec 2025 11:40:34 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote: > Hi Andrew, > > On Fri, 5 Dec 2025 12:15:01 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > > > I'm not really sure what's the best way to determine this. I use > > > > hp2:/usr/src/mm> git tag --contains 2b6a3f061f11 > > mm-everything-2025-11-29-19-43 > > mm-everything-2025-12-01-19-02 > > mm-everything-2025-12-03-23-49 > > mm-everything-2025-12-05-00-55 > > mm-stable-2025-12-03-21-26 > > > > What does "git branch --contains 2b6a3f061f11" say in your tree? hp2:/usr/src/mm> git branch --contains 2b6a3f061f11 * linus mm-everything mm-new mm-stable mm-unstable hp2:/usr/src/mm> git branch -r --contains 2b6a3f061f11 linus/master origin/mm-everything origin/mm-new origin/mm-stable origin/mm-unstable kinda random, but it tells me "that's in mm-stable", which is what counts. > In my linux-next tree it says (I need the -r to check remote branches): > > $ git branch -r --contains 2b6a3f061f11 > mm-stable/mm-stable > mm-unstable/mm-unstable > > but I don't export my remotes to my published tree.
On Fri, Dec 05, 2025 at 07:12:55PM -0800, Andrew Morton wrote: > On Sat, 6 Dec 2025 11:40:34 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote: > > > Hi Andrew, > > > > On Fri, 5 Dec 2025 12:15:01 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > I'm not really sure what's the best way to determine this. I use > > > > > > hp2:/usr/src/mm> git tag --contains 2b6a3f061f11 > > > mm-everything-2025-11-29-19-43 > > > mm-everything-2025-12-01-19-02 > > > mm-everything-2025-12-03-23-49 > > > mm-everything-2025-12-05-00-55 > > > mm-stable-2025-12-03-21-26 > > > > > > > What does "git branch --contains 2b6a3f061f11" say in your tree? > > hp2:/usr/src/mm> git branch --contains 2b6a3f061f11 > * linus > mm-everything > mm-new > mm-stable > mm-unstable > hp2:/usr/src/mm> git branch -r --contains 2b6a3f061f11 > linus/master > origin/mm-everything > origin/mm-new > origin/mm-stable > origin/mm-unstable > > kinda random, but it tells me "that's in mm-stable", which is what counts. > > > In my linux-next tree it says (I need the -r to check remote branches): > > > > $ git branch -r --contains 2b6a3f061f11 > > mm-stable/mm-stable > > mm-unstable/mm-unstable > > > > but I don't export my remotes to my published tree. Thanks guys for confirming, this is therefore less work all round :) Though Andrew do please add the Fixes tag that I foolishly forgot to include, thanks! Cheers, Lorenzo
On 12/5/25 21:15, Andrew Morton wrote:
> On Fri, 5 Dec 2025 17:50:37 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
>> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>>
>> ...
>>
>> Andrew - note I've referenced the linux-next commit number above, could you
>> replace with the upstream commit hash once your PR is taken? Thanks!
>
> That's in mm-stable so the hash shouldn't be changing.
>
>
> I'm not really sure what's the best way to determine this. I use
>
> hp2:/usr/src/mm> git tag --contains 2b6a3f061f11
> mm-everything-2025-11-29-19-43
> mm-everything-2025-12-01-19-02
> mm-everything-2025-12-03-23-49
> mm-everything-2025-12-05-00-55
> mm-stable-2025-12-03-21-26
I asked myself the same question a couple of times.
Maybe this?
$ git branch -r --contains 2b6a3f061f11
mm/mm-everything
mm/mm-new
mm/mm-stable
--
Cheers
David
On 12/5/25 9:50 AM, Lorenzo Stoakes wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
>
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
>
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.
>
> Avoid macro expansion issues by simply using 1UL << bitnum.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!
>
> include/linux/mm.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2f38fb68840..c4438b30c140 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -395,7 +395,8 @@ enum {
> #undef DECLARE_VMA_BIT
> #undef DECLARE_VMA_BIT_ALIAS
>
> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
OK, so now maybe we don't need all of the rust/bindings/bindings_helper.h
changes? Those were because Rust's bindgen doesn't properly handle
nested macros, as I recall.
> +
> #define VM_READ INIT_VM_FLAG(READ)
> #define VM_WRITE INIT_VM_FLAG(WRITE)
> #define VM_EXEC INIT_VM_FLAG(EXEC)
> --
> 2.52.0
>
thanks,
--
John Hubbard
On Fri, Dec 05, 2025 at 11:56:32AM -0800, John Hubbard wrote:
> On 12/5/25 9:50 AM, Lorenzo Stoakes wrote:
> > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > how VMA flags are declared, utilising an enum of VMA bit values and
> > ifdef-fery VM_xxx flag declarations via macro.
> >
> > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > the newly introduced VMA bit numbers.
> >
> > However, use of this macro results in apparently unfortunate macro
> > expansion and resulted in a performance degradation.This appears to be due
> > to the (__force int), which is required for the sparse typechecking to
> > work.
> >
> > Avoid macro expansion issues by simply using 1UL << bitnum.
> >
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >
> > Andrew - note I've referenced the linux-next commit number above, could you
> > replace with the upstream commit hash once your PR is taken? Thanks!
> >
> > include/linux/mm.h | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a2f38fb68840..c4438b30c140 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -395,7 +395,8 @@ enum {
> > #undef DECLARE_VMA_BIT
> > #undef DECLARE_VMA_BIT_ALIAS
> >
> > -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> > +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
>
> OK, so now maybe we don't need all of the rust/bindings/bindings_helper.h
> changes? Those were because Rust's bindgen doesn't properly handle
> nested macros, as I recall.
Ah seems you're right, I just tried a clang/rust build with those dropped and it
worked fine.
Have added to TODO to send a patch to remove those after this one lands, thanks!
Cheers, Lorenzo
>
> > +
> > #define VM_READ INIT_VM_FLAG(READ)
> > #define VM_WRITE INIT_VM_FLAG(WRITE)
> > #define VM_EXEC INIT_VM_FLAG(EXEC)
> > --
> > 2.52.0
> >
>
> thanks,
> --
> John Hubbard
>
On Fri, 5 Dec 2025 17:50:37 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
>
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
>
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.
Does sparse complain if you just add 0? As in:
#define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
That should change the type without affecting what BIT() expands to.
David
>
> Avoid macro expansion issues by simply using 1UL << bitnum.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!
>
> include/linux/mm.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2f38fb68840..c4438b30c140 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -395,7 +395,8 @@ enum {
> #undef DECLARE_VMA_BIT
> #undef DECLARE_VMA_BIT_ALIAS
>
> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
> +
> #define VM_READ INIT_VM_FLAG(READ)
> #define VM_WRITE INIT_VM_FLAG(WRITE)
> #define VM_EXEC INIT_VM_FLAG(EXEC)
> --
> 2.52.0
>
On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> On Fri, 5 Dec 2025 17:50:37 +0000
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > how VMA flags are declared, utilising an enum of VMA bit values and
> > ifdef-fery VM_xxx flag declarations via macro.
> >
> > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > the newly introduced VMA bit numbers.
> >
> > However, use of this macro results in apparently unfortunate macro
> > expansion and resulted in a performance degradation.This appears to be due
> > to the (__force int), which is required for the sparse typechecking to
> > work.
>
> Does sparse complain if you just add 0? As in:
> #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
>
> That should change the type without affecting what BIT() expands to.
Thanks, checked that and unfortunately that doesn't satisfy sparse :)
I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
that this is an issue.
<insert rant about C macros here>
Cheers, Lorenzo
On Fri, 5 Dec 2025 19:18:56 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > On Fri, 5 Dec 2025 17:50:37 +0000
> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > ifdef-fery VM_xxx flag declarations via macro.
> > >
> > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > the newly introduced VMA bit numbers.
> > >
> > > However, use of this macro results in apparently unfortunate macro
> > > expansion and resulted in a performance degradation.This appears to be due
> > > to the (__force int), which is required for the sparse typechecking to
> > > work.
> >
> > Does sparse complain if you just add 0? As in:
> > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> >
> > That should change the type without affecting what BIT() expands to.
>
> Thanks, checked that and unfortunately that doesn't satisfy sparse :)
Oh - it is that __bitwise that causes grief.
> I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> that this is an issue.
Did you try getting DECLARE_VMA_BIT to define both the bit number and the
bit flag and put them both into the anonymous enum?
Or are there other reasons for not doing that?
>
> <insert rant about C macros here>
Add rant about the compiler thinking anon enums are doing anything other
than defining constants.
David
>
> Cheers, Lorenzo
On Fri, Dec 05, 2025 at 09:49:40PM +0000, David Laight wrote:
> On Fri, 5 Dec 2025 19:18:56 +0000
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > > On Fri, 5 Dec 2025 17:50:37 +0000
> > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > ifdef-fery VM_xxx flag declarations via macro.
> > > >
> > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > the newly introduced VMA bit numbers.
> > > >
> > > > However, use of this macro results in apparently unfortunate macro
> > > > expansion and resulted in a performance degradation.This appears to be due
> > > > to the (__force int), which is required for the sparse typechecking to
> > > > work.
> > >
> > > Does sparse complain if you just add 0? As in:
> > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > >
> > > That should change the type without affecting what BIT() expands to.
> >
> > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
>
> Oh - it is that __bitwise that causes grief.
Well, if a sparse build is not enabled this tag is just removed (as is __force).
>
> > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > that this is an issue.
>
> Did you try getting DECLARE_VMA_BIT to define both the bit number and the
> bit flag and put them both into the anonymous enum?
> Or are there other reasons for not doing that?
I did and we can't do that because it results in errors like 'enum constant in
boolean context [-Werror=int-in-bool-context]' as clearly VM_xxx flags are used
in many different contexts in the kernel many of which seem incompatible with
enum constants (even though... they should be equivalent, at least in theory?)
>
> >
> > <insert rant about C macros here>
>
> Add rant about the compiler thinking anon enums are doing anything other
> than defining constants.
Right yes :)
I put these in an enum in part to make life easier for tools like drgn to be
able to find these values (the guys asked about this explicitly). But also it
makes sense for them to be in an enum!
Really the VM_xxx flags are at least in theory a temporary hack until everything
can use bit numbers... assuming I can find a way to do that without causing
performance regressions :)
The perf regression here was - rather unexpected - however - I must say!
>
> David
>
> >
> > Cheers, Lorenzo
>
Cheers, Lorenzo
On Fri, 5 Dec 2025 19:18:56 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > On Fri, 5 Dec 2025 17:50:37 +0000
> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > ifdef-fery VM_xxx flag declarations via macro.
> > >
> > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > the newly introduced VMA bit numbers.
> > >
> > > However, use of this macro results in apparently unfortunate macro
> > > expansion and resulted in a performance degradation.This appears to be due
> > > to the (__force int), which is required for the sparse typechecking to
> > > work.
> >
> > Does sparse complain if you just add 0? As in:
> > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> >
> > That should change the type without affecting what BIT() expands to.
>
> Thanks, checked that and unfortunately that doesn't satisfy sparse :)
>
> I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> that this is an issue.
I might use some of my copious spare time (ha) to see why BIT() fails.
I bet it is just too complex for its own good.
Personally I'm fine with both explicit (1ul << n) and hex constants.
The latter are definitely most useful if you ever look at hexdumps.
At the moment I'm trying to fix bitfield.h so you don't get compile errors
on lines that are 18KB long.
Found a new version in linux-next - has its own set of new bugs as well
as more of the old ones.
David
>
> <insert rant about C macros here>
>
> Cheers, Lorenzo
On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:
> On Fri, 5 Dec 2025 19:18:56 +0000
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > > On Fri, 5 Dec 2025 17:50:37 +0000
> > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > ifdef-fery VM_xxx flag declarations via macro.
> > > >
> > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > the newly introduced VMA bit numbers.
> > > >
> > > > However, use of this macro results in apparently unfortunate macro
> > > > expansion and resulted in a performance degradation.This appears to be due
> > > > to the (__force int), which is required for the sparse typechecking to
> > > > work.
> > >
> > > Does sparse complain if you just add 0? As in:
> > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > >
> > > That should change the type without affecting what BIT() expands to.
> >
> > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
> >
> > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > that this is an issue.
>
> I might use some of my copious spare time (ha) to see why BIT() fails.
> I bet it is just too complex for its own good.
> Personally I'm fine with both explicit (1ul << n) and hex constants.
> The latter are definitely most useful if you ever look at hexdumps.
Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
to have the answer and wanted to get it fixed, but obviously am quite curious as
to what on earth is causing that.
>
> At the moment I'm trying to fix bitfield.h so you don't get compile errors
> on lines that are 18KB long.
:)
>
> Found a new version in linux-next - has its own set of new bugs as well
> as more of the old ones.
>
> David
>
> >
> > <insert rant about C macros here>
> >
> > Cheers, Lorenzo
>
Cheers, Lorenzo
On Sat, Dec 06, 2025 at 04:43:57PM +0000, Lorenzo Stoakes wrote:
> On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:
> > On Fri, 5 Dec 2025 19:18:56 +0000
> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > > > On Fri, 5 Dec 2025 17:50:37 +0000
> > > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > > ifdef-fery VM_xxx flag declarations via macro.
> > > > >
> > > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > > the newly introduced VMA bit numbers.
> > > > >
> > > > > However, use of this macro results in apparently unfortunate macro
> > > > > expansion and resulted in a performance degradation.This appears to be due
> > > > > to the (__force int), which is required for the sparse typechecking to
> > > > > work.
> > > >
> > > > Does sparse complain if you just add 0? As in:
> > > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > > >
> > > > That should change the type without affecting what BIT() expands to.
> > >
> > > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
> > >
> > > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > > that this is an issue.
> >
> > I might use some of my copious spare time (ha) to see why BIT() fails.
> > I bet it is just too complex for its own good.
> > Personally I'm fine with both explicit (1ul << n) and hex constants.
> > The latter are definitely most useful if you ever look at hexdumps.
>
> Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
> to have the answer and wanted to get it fixed, but obviously am quite curious as
> to what on earth is causing that.
I did wonder about _calc_vm_trans(), given the 'interesting' stuff it does.
Maybe I should fiddle with that and see...
Cheers, Lorenzo
On 12/8/25 17:42, Lorenzo Stoakes wrote:
> On Sat, Dec 06, 2025 at 04:43:57PM +0000, Lorenzo Stoakes wrote:
>> On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:
>> > On Fri, 5 Dec 2025 19:18:56 +0000
>> > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>> >
>> > > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
>> > > > On Fri, 5 Dec 2025 17:50:37 +0000
>> > > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>> > > >
>> > > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
>> > > > > how VMA flags are declared, utilising an enum of VMA bit values and
>> > > > > ifdef-fery VM_xxx flag declarations via macro.
>> > > > >
>> > > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
>> > > > > the newly introduced VMA bit numbers.
>> > > > >
>> > > > > However, use of this macro results in apparently unfortunate macro
>> > > > > expansion and resulted in a performance degradation.This appears to be due
>> > > > > to the (__force int), which is required for the sparse typechecking to
>> > > > > work.
>> > > >
>> > > > Does sparse complain if you just add 0? As in:
>> > > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
>> > > >
>> > > > That should change the type without affecting what BIT() expands to.
>> > >
>> > > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
>> > >
>> > > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
>> > > that this is an issue.
>> >
>> > I might use some of my copious spare time (ha) to see why BIT() fails.
>> > I bet it is just too complex for its own good.
>> > Personally I'm fine with both explicit (1ul << n) and hex constants.
>> > The latter are definitely most useful if you ever look at hexdumps.
>>
>> Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
>> to have the answer and wanted to get it fixed, but obviously am quite curious as
>> to what on earth is causing that.
>
> I did wonder about _calc_vm_trans(), given the 'interesting' stuff it does.
It's unlikely that this affects anything in what the benchmark stresses.
As Mateusz pointed out off-list, the profiles look like mutexes are doing
less optimistic spinning and more sleeping. Which IMHO isn't something that
this change can directly affect.
My own bloat-o-meter test before/after the fix suggests no changed code
generation (as I would indeed expect):
At least in my case it doesn't seem to be altering the generated code (and
I would expect it wouldn't) except some weird symbols that don't look
related at all:
> ./scripts/bloat-o-meter vmlinux.o vmlinux.o.after
add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-277 (-277)
Function old new delta
print_fmt_dax_pte_fault_class 1167 1160 -7
print_fmt_dax_pmd_load_hole_class 308 301 -7
print_fmt_dax_pmd_fault_class 1252 1245 -7
saved_rsp 3117296 3117232 -64
saved_rdi 3117312 3117248 -64
saved_rbx 3117304 3117240 -64
saved_rbp 3117328 3117264 -64
Total: Before=3350750464, After=3350750187, chg -0.00%
I don't know what happened to those functions above. Could be just
insufficient build reproducibility?
Maybe the effect is just that something slightly shifts in the cpu caches?
On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote:
> As Mateusz pointed out off-list, the profiles look like mutexes are doing
> less optimistic spinning and more sleeping. Which IMHO isn't something that
> this change can directly affect.
>
Not mutexes but rwsems.
The bench at hand has some of the code spinlocked, other parts take
rwsems for reading *or* writing.
I had a peek at rwsem implementation and to my understanding it can
degrade to no spinning in a microbenchmark setting like this one,
provided you are unlucky enough.
In particular you can get unlucky if existing timings get perturbed,
which I presume is happening after Lorenzo's patch.
To demonstrate I wrote a toy patch which conditionally converts affected
down_read calls into down_write (inlined at the end).
While the original report is based on a 192-thread box, I was only able
to test with 80 threads. Even so, the crux of the issue was nicely
reproduced.
./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80
Top says (times vary, idle is growing over time):
%Cpu(s): 3.3 us, 24.4 sy, 0.0 ni, 72.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
... but if I flip the switch to down_write:
%Cpu(s): 6.3 us, 80.9 sy, 0.0 ni, 12.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
The switch is a sysctl named fs.magic_tunable (0 == down_read; 1 == down_write).
In terms of performance I see the following:
stress-ng: metrc: [2546] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
# sysctl fs.magic_tunable=0 ## down_read
stress-ng: metrc: [2546] msg 63353488 10.01 28.21 213.26 6331298.95 262362.91 30.16 2016
# sysctl fs.magic_tunable=1 ## down_write
stress-ng: metrc: [2036] msg 455014809 10.00 48.79 676.42 45496870.65 627425.68 90.64 2056
That is to say rwsem code is the real culprit and Lorenzo is a random
(albeit deserving) victim.
I see two action items:
- massage the patch back to a state where things compile to the same asm
as before as it clearly avoidably regressed regardless of the
aforementioned issue
- figure out what to do with rwsem code for read vs write spinning
I'm not picking this up for the time being, but I might look at this at
some point.
diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..de1ef700d144 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}
+unsigned long magic_tunable;
+
static const struct ctl_table fs_stat_sysctls[] = {
{
.procname = "file-nr",
@@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
.extra1 = SYSCTL_LONG_ZERO,
.extra2 = SYSCTL_LONG_MAX,
},
+ {
+ .procname = "magic_tunable",
+ .data = &magic_tunable,
+ .maxlen = sizeof(magic_tunable),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra1 = SYSCTL_LONG_ZERO,
+ .extra2 = SYSCTL_LONG_MAX,
+ },
+
{
.procname = "nr_open",
.data = &sysctl_nr_open,
diff --git a/ipc/msg.c b/ipc/msg.c
index ee6af4fe52bf..fa835ea53e09 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -474,6 +474,8 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
return err;
}
+extern unsigned long magic_tunable;
+
static int msgctl_info(struct ipc_namespace *ns, int msqid,
int cmd, struct msginfo *msginfo)
{
@@ -495,11 +497,19 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
msginfo->msgmnb = ns->msg_ctlmnb;
msginfo->msgssz = MSGSSZ;
msginfo->msgseg = MSGSEG;
- down_read(&msg_ids(ns).rwsem);
- if (cmd == MSG_INFO)
- msginfo->msgpool = msg_ids(ns).in_use;
- max_idx = ipc_get_maxidx(&msg_ids(ns));
- up_read(&msg_ids(ns).rwsem);
+ if (!READ_ONCE(magic_tunable)) {
+ down_read(&msg_ids(ns).rwsem);
+ if (cmd == MSG_INFO)
+ msginfo->msgpool = msg_ids(ns).in_use;
+ max_idx = ipc_get_maxidx(&msg_ids(ns));
+ up_read(&msg_ids(ns).rwsem);
+ } else {
+ down_write(&msg_ids(ns).rwsem);
+ if (cmd == MSG_INFO)
+ msginfo->msgpool = msg_ids(ns).in_use;
+ max_idx = ipc_get_maxidx(&msg_ids(ns));
+ up_write(&msg_ids(ns).rwsem);
+ }
if (cmd == MSG_INFO) {
msginfo->msgmap = min_t(int,
percpu_counter_sum(&ns->percpu_msg_hdrs),
diff --git a/ipc/util.c b/ipc/util.c
index cae60f11d9c2..c65c8289a54b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -771,6 +771,7 @@ struct ipc_proc_iter {
struct ipc_namespace *ns;
struct pid_namespace *pid_ns;
struct ipc_proc_iface *iface;
+ bool writelocked;
};
struct pid_namespace *ipc_seq_pid_ns(struct seq_file *s)
@@ -828,6 +829,8 @@ static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
return sysvipc_find_ipc(&iter->ns->ids[iface->ids], pos);
}
+extern unsigned long magic_tunable;
+
/*
* File positions: pos 0 -> header, pos n -> ipc idx = n - 1.
* SeqFile iterator: iterator value locked ipc pointer or SEQ_TOKEN_START.
@@ -844,7 +847,13 @@ static void *sysvipc_proc_start(struct seq_file *s, loff_t *pos)
* Take the lock - this will be released by the corresponding
* call to stop().
*/
- down_read(&ids->rwsem);
+ if (!READ_ONCE(magic_tunable)) {
+ down_read(&ids->rwsem);
+ iter->writelocked = false;
+ } else {
+ down_write(&ids->rwsem);
+ iter->writelocked = true;
+ }
/* pos < 0 is invalid */
if (*pos < 0)
@@ -871,7 +880,10 @@ static void sysvipc_proc_stop(struct seq_file *s, void *it)
ids = &iter->ns->ids[iface->ids];
/* Release the lock we took in start() */
- up_read(&ids->rwsem);
+ if (!iter->writelocked)
+ up_read(&ids->rwsem);
+ else
+ up_write(&ids->rwsem);
}
static int sysvipc_proc_show(struct seq_file *s, void *it)
So I had a look where the timing difference is coming from and I think I have the answer: init_ipc_ns does not have a guaranteed cacheline placement and things get moved around with the patch. On my kernels (nm vmlinux-newbits | sort -nk 1 | less) before: ffffffff839ffb60 T init_ipc_ns ffffffff83a00020 t event_exit__msgrcv after: ffffffff839ffbc0 T init_ipc_ns ffffffff83a00080 t event_exit__msgrcv This is the pervasive problem of vars from all .o files placed adjacent to each other, meaning changes in one .o file result in offsets changing in other files and then you get performance fluctuations as not-explicitly-padded variables share (or no longer share) cachelines. I brought this up a year ago elsewhere: https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html maybe i should pick it up again and see it through as for the thing at hand, someone(tm) will want to make sure the namespace is cacheline aligned and possibly pad its own internals afterwards. Personally I can't be bothered.
On Fri, Dec 12, 2025 at 01:24:57PM +0100, Mateusz Guzik wrote:
> So I had a look where the timing difference is coming from and I think
> I have the answer: init_ipc_ns does not have a guaranteed cacheline
> placement and things get moved around with the patch.
>
> On my kernels (nm vmlinux-newbits | sort -nk 1 | less)
>
> before:
> ffffffff839ffb60 T init_ipc_ns
> ffffffff83a00020 t event_exit__msgrcv
>
> after:
> ffffffff839ffbc0 T init_ipc_ns
> ffffffff83a00080 t event_exit__msgrcv
>
> This is the pervasive problem of vars from all .o files placed
> adjacent to each other, meaning changes in one .o file result in
> offsets changing in other files and then you get performance
> fluctuations as not-explicitly-padded variables share (or no longer
> share) cachelines.
>
> I brought this up a year ago elsewhere:
> https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html
>
> maybe i should pick it up again and see it through
>
> as for the thing at hand, someone(tm) will want to make sure the
> namespace is cacheline aligned and possibly pad its own internals
> afterwards. Personally I can't be bothered.
Thanks! Looking it seems we accumulate a bunch of offsets in:
print_fmt_dax_pte_fault_class
print_fmt_dax_pmd_load_hole_class
print_fmt_dax_pmd_fault_class
(entries that the bloat-o-meter confirms changed in size)
That continue on to eventually offset init_ipc_ns.
It actually looked in my testing like performance _improved_ with the change,
and I notice locally I end up aligned on the struct:
ffffffff82be16a0 T init_ipc_ns
->
ffffffff82be1700 T init_ipc_ns
Examining stress-ng before 2b6a3f061f11:
Command is:
$ stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --msg $(nproc)
Results:
stress-ng: metrc: [1662] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [776] msg 964459758 60.00 154.63 632.77 16073484.12 1224879.38 21.17 2132
After 2b6a3f061f11:
stress-ng: metrc: [782] msg 1326214608 60.00 194.19 713.34 22102974.72 1461348.11 24.40 2140
And if I simply do:
ipc/msgutil.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index 7a03f6d03de3..df0d7a067bcf 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -26,7 +26,7 @@ DEFINE_SPINLOCK(mq_lock);
* compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE
* and not CONFIG_IPC_NS.
*/
-struct ipc_namespace init_ipc_ns = {
+struct ipc_namespace init_ipc_ns ____cacheline_aligned_in_smp = {
.ns.__ns_ref = REFCOUNT_INIT(1),
.user_ns = &init_user_ns,
.ns.inum = ns_init_inum(&init_ipc_ns),
--
2.52.0
We observe:
stress-ng: metrc: [764] msg 1321700290 60.00 196.73 723.82 22028700.93 1435779.72 24.75 2116 aligned
So this really _does_ look like an alignment issue.
So I think I should just submit the above patch right? Can you see how it behaves for you?
Cheers, Lorenzo
On Fri, Dec 12, 2025 at 4:03 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Dec 12, 2025 at 01:24:57PM +0100, Mateusz Guzik wrote:
> > So I had a look where the timing difference is coming from and I think
> > I have the answer: init_ipc_ns does not have a guaranteed cacheline
> > placement and things get moved around with the patch.
> >
> > On my kernels (nm vmlinux-newbits | sort -nk 1 | less)
> >
> > before:
> > ffffffff839ffb60 T init_ipc_ns
> > ffffffff83a00020 t event_exit__msgrcv
> >
> > after:
> > ffffffff839ffbc0 T init_ipc_ns
> > ffffffff83a00080 t event_exit__msgrcv
> >
> > This is the pervasive problem of vars from all .o files placed
> > adjacent to each other, meaning changes in one .o file result in
> > offsets changing in other files and then you get performance
> > fluctuations as not-explicitly-padded variables share (or no longer
> > share) cachelines.
> >
> > I brought this up a year ago elsewhere:
> > https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html
> >
> > maybe i should pick it up again and see it through
> >
> > as for the thing at hand, someone(tm) will want to make sure the
> > namespace is cacheline aligned and possibly pad its own internals
> > afterwards. Personally I can't be bothered.
>
> Thanks! Looking it seems we accumulate a bunch of offsets in:
>
> print_fmt_dax_pte_fault_class
> print_fmt_dax_pmd_load_hole_class
> print_fmt_dax_pmd_fault_class
>
> (entries that the bloat-o-meter confirms changed in size)
>
> That continue on to eventually offset init_ipc_ns.
>
> It actually looked in my testing like performance _improved_ with the change,
> and I notice locally I end up aligned on the struct:
>
> ffffffff82be16a0 T init_ipc_ns
>
> ->
>
> ffffffff82be1700 T init_ipc_ns
>
> Examining stress-ng before 2b6a3f061f11:
>
> Command is:
> $ stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --msg $(nproc)
>
> Results:
>
> stress-ng: metrc: [1662] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
>
> stress-ng: metrc: [776] msg 964459758 60.00 154.63 632.77 16073484.12 1224879.38 21.17 2132
>
> After 2b6a3f061f11:
>
> stress-ng: metrc: [782] msg 1326214608 60.00 194.19 713.34 22102974.72 1461348.11 24.40 2140
>
> And if I simply do:
>
> ipc/msgutil.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ipc/msgutil.c b/ipc/msgutil.c
> index 7a03f6d03de3..df0d7a067bcf 100644
> --- a/ipc/msgutil.c
> +++ b/ipc/msgutil.c
> @@ -26,7 +26,7 @@ DEFINE_SPINLOCK(mq_lock);
> * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE
> * and not CONFIG_IPC_NS.
> */
> -struct ipc_namespace init_ipc_ns = {
> +struct ipc_namespace init_ipc_ns ____cacheline_aligned_in_smp = {
> .ns.__ns_ref = REFCOUNT_INIT(1),
> .user_ns = &init_user_ns,
> .ns.inum = ns_init_inum(&init_ipc_ns),
> --
> 2.52.0
>
> We observe:
>
> stress-ng: metrc: [764] msg 1321700290 60.00 196.73 723.82 22028700.93 1435779.72 24.75 2116 aligned
>
> So this really _does_ look like an alignment issue.
>
> So I think I should just submit the above patch right? Can you see how it behaves for you?
It is my recommendation to drop the matter now that we know the vm
bits patch is a victim of circumstance.
While slapping an alignment requirement on the var is the usual fix
and is a part of a full solution, the situation here is of the fucky
sort and stopping at precisely that might happen to be the bad call.
I state again there is contention on a spin lock and contention on a
rwsem separately, and most importantly that there is a level of load
on the rwsem which triggers pessimal behavior during which it abandons
spinning.
If the cacheline bouncing affects areas used by either of these locks,
you don't know if you got a win because you in fact *slowed it down*
and avoided the pessimal rwsem condition. Then if someone(tm) fixed up
rwsems, the ipc code would be leaving performance on the table with an
annotation specifically putting it in that spot.
So I would refrain from adding it as is.
If I was to work on this (which I wont), the first step would be to
stabilize the rwsem problem. Afterwards annotate the init_ipc_ns var
to stabilize the placement, afterwards rearrange the fields + add
padding as needed to reduce the bouncing. But this can't be done with
rwsem behavior at play.
On Fri, 12 Dec 2025 13:24:57 +0100 Mateusz Guzik <mjguzik@gmail.com> wrote: > So I had a look where the timing difference is coming from and I think > I have the answer: init_ipc_ns does not have a guaranteed cacheline > placement and things get moved around with the patch. > > On my kernels (nm vmlinux-newbits | sort -nk 1 | less) > > before: > ffffffff839ffb60 T init_ipc_ns > ffffffff83a00020 t event_exit__msgrcv > > after: > ffffffff839ffbc0 T init_ipc_ns > ffffffff83a00080 t event_exit__msgrcv > > This is the pervasive problem of vars from all .o files placed > adjacent to each other, meaning changes in one .o file result in > offsets changing in other files and then you get performance > fluctuations as not-explicitly-padded variables share (or no longer > share) cachelines. Those look like text symbols, not data ones. But moving code about can make the same sort of changes. > I brought this up a year ago elsewhere: > https://gcc.gnu.org/pipermail/gcc/2024-October/245004.html My guess is that all the extra padding increases the cache footprint of the code and that causes other cache lines to displaced and then needing to be re-read from memory. So while a specific benchmark may improve, overall system performance goes down. Excessive loop unrolling has the same effect. David > > maybe i should pick it up again and see it through > > as for the thing at hand, someone(tm) will want to make sure the > namespace is cacheline aligned and possibly pad its own internals > afterwards. Personally I can't be bothered.
On Fri, Dec 12, 2025 at 2:02 PM David Laight <david.laight.linux@gmail.com> wrote: > > On Fri, 12 Dec 2025 13:24:57 +0100 > Mateusz Guzik <mjguzik@gmail.com> wrote: > > > So I had a look where the timing difference is coming from and I think > > I have the answer: init_ipc_ns does not have a guaranteed cacheline > > placement and things get moved around with the patch. > > > > On my kernels (nm vmlinux-newbits | sort -nk 1 | less) > > > > before: > > ffffffff839ffb60 T init_ipc_ns > > ffffffff83a00020 t event_exit__msgrcv > > > > after: > > ffffffff839ffbc0 T init_ipc_ns > > ffffffff83a00080 t event_exit__msgrcv > > > > This is the pervasive problem of vars from all .o files placed > > adjacent to each other, meaning changes in one .o file result in > > offsets changing in other files and then you get performance > > fluctuations as not-explicitly-padded variables share (or no longer > > share) cachelines. > > Those look like text symbols, not data ones. I don't know why it is annotated with T. There is only one symbol in the kernel with that name and it is the var. You can see the same thing for other vars, for example: ffffffff83808040 T mmlist_lock ffffffff83808080 T tasklist_lock
On Tue, Dec 09, 2025 at 10:26:22AM +0100, Mateusz Guzik wrote:
> On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote:
> > As Mateusz pointed out off-list, the profiles look like mutexes are doing
> > less optimistic spinning and more sleeping. Which IMHO isn't something that
> > this change can directly affect.
> >
>
> Not mutexes but rwsems.
>
> The bench at hand has some of the code spinlocked, other parts take
> rwsems for reading *or* writing.
>
> I had a peek at rwsem implementation and to my understanding it can
> degrade to no spinning in a microbenchmark setting like this one,
> provided you are unlucky enough.
So we're just sleep, sleep sleeping? That would explain it...
I mean is this an issue with the IPC design or the rwsem's in general?
>
> In particular you can get unlucky if existing timings get perturbed,
> which I presume is happening after Lorenzo's patch.
>
> To demonstrate I wrote a toy patch which conditionally converts affected
> down_read calls into down_write (inlined at the end).
>
> While the original report is based on a 192-thread box, I was only able
> to test with 80 threads. Even so, the crux of the issue was nicely
> reproduced.
>
> ./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80
I wonder if large core count matters here in particular, I was doing this
(albeit in a VM...) with 62 cores
>
> Top says (times vary, idle is growing over time):
> %Cpu(s): 3.3 us, 24.4 sy, 0.0 ni, 72.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>
> ... but if I flip the switch to down_write:
>
> %Cpu(s): 6.3 us, 80.9 sy, 0.0 ni, 12.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>
> The switch is a sysctl named fs.magic_tunable (0 == down_read; 1 == down_write).
>
> In terms of performance I see the following:
>
> stress-ng: metrc: [2546] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
> # sysctl fs.magic_tunable=0 ## down_read
> stress-ng: metrc: [2546] msg 63353488 10.01 28.21 213.26 6331298.95 262362.91 30.16 2016
> # sysctl fs.magic_tunable=1 ## down_write
> stress-ng: metrc: [2036] msg 455014809 10.00 48.79 676.42 45496870.65 627425.68 90.64 2056
>
> That is to say rwsem code is the real culprit and Lorenzo is a random
> (albeit deserving) victim.
;)
Ack interesting.
>
> I see two action items:
> - massage the patch back to a state where things compile to the same asm
> as before as it clearly avoidably regressed regardless of the
> aforementioned issue
I think given the fix is so trivially obviously correct and not egregious to
take, we should do that for the time being.
> - figure out what to do with rwsem code for read vs write spinning
I have no idea how the code generation could have varied here, I did wonder if
somehow the changes somehow pessimised the optimiser for heuristic compiler
reasons somehow, or if somewhere there is something that has an issue with a
certain incantation of cast (but only after a certain flavour of nested macro
expansion?...)
>
> I'm not picking this up for the time being, but I might look at this at
> some point.
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index cd4a3db4659a..de1ef700d144 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
> return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
> }
>
> +unsigned long magic_tunable;
> +
> static const struct ctl_table fs_stat_sysctls[] = {
> {
> .procname = "file-nr",
> @@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
> .extra1 = SYSCTL_LONG_ZERO,
> .extra2 = SYSCTL_LONG_MAX,
> },
> + {
> + .procname = "magic_tunable",
> + .data = &magic_tunable,
> + .maxlen = sizeof(magic_tunable),
> + .mode = 0644,
> + .proc_handler = proc_doulongvec_minmax,
> + .extra1 = SYSCTL_LONG_ZERO,
> + .extra2 = SYSCTL_LONG_MAX,
> + },
> +
> {
> .procname = "nr_open",
> .data = &sysctl_nr_open,
> diff --git a/ipc/msg.c b/ipc/msg.c
> index ee6af4fe52bf..fa835ea53e09 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -474,6 +474,8 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
> return err;
> }
>
> +extern unsigned long magic_tunable;
> +
> static int msgctl_info(struct ipc_namespace *ns, int msqid,
> int cmd, struct msginfo *msginfo)
> {
> @@ -495,11 +497,19 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
> msginfo->msgmnb = ns->msg_ctlmnb;
> msginfo->msgssz = MSGSSZ;
> msginfo->msgseg = MSGSEG;
> - down_read(&msg_ids(ns).rwsem);
> - if (cmd == MSG_INFO)
> - msginfo->msgpool = msg_ids(ns).in_use;
> - max_idx = ipc_get_maxidx(&msg_ids(ns));
> - up_read(&msg_ids(ns).rwsem);
> + if (!READ_ONCE(magic_tunable)) {
> + down_read(&msg_ids(ns).rwsem);
> + if (cmd == MSG_INFO)
> + msginfo->msgpool = msg_ids(ns).in_use;
> + max_idx = ipc_get_maxidx(&msg_ids(ns));
> + up_read(&msg_ids(ns).rwsem);
> + } else {
> + down_write(&msg_ids(ns).rwsem);
> + if (cmd == MSG_INFO)
> + msginfo->msgpool = msg_ids(ns).in_use;
> + max_idx = ipc_get_maxidx(&msg_ids(ns));
> + up_write(&msg_ids(ns).rwsem);
> + }
> if (cmd == MSG_INFO) {
> msginfo->msgmap = min_t(int,
> percpu_counter_sum(&ns->percpu_msg_hdrs),
> diff --git a/ipc/util.c b/ipc/util.c
> index cae60f11d9c2..c65c8289a54b 100644
> --- a/ipc/util.c
> +++ b/ipc/util.c
> @@ -771,6 +771,7 @@ struct ipc_proc_iter {
> struct ipc_namespace *ns;
> struct pid_namespace *pid_ns;
> struct ipc_proc_iface *iface;
> + bool writelocked;
> };
>
> struct pid_namespace *ipc_seq_pid_ns(struct seq_file *s)
> @@ -828,6 +829,8 @@ static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
> return sysvipc_find_ipc(&iter->ns->ids[iface->ids], pos);
> }
>
> +extern unsigned long magic_tunable;
> +
> /*
> * File positions: pos 0 -> header, pos n -> ipc idx = n - 1.
> * SeqFile iterator: iterator value locked ipc pointer or SEQ_TOKEN_START.
> @@ -844,7 +847,13 @@ static void *sysvipc_proc_start(struct seq_file *s, loff_t *pos)
> * Take the lock - this will be released by the corresponding
> * call to stop().
> */
> - down_read(&ids->rwsem);
> + if (!READ_ONCE(magic_tunable)) {
> + down_read(&ids->rwsem);
> + iter->writelocked = false;
> + } else {
> + down_write(&ids->rwsem);
> + iter->writelocked = true;
> + }
>
> /* pos < 0 is invalid */
> if (*pos < 0)
> @@ -871,7 +880,10 @@ static void sysvipc_proc_stop(struct seq_file *s, void *it)
>
> ids = &iter->ns->ids[iface->ids];
> /* Release the lock we took in start() */
> - up_read(&ids->rwsem);
> + if (!iter->writelocked)
> + up_read(&ids->rwsem);
> + else
> + up_write(&ids->rwsem);
> }
>
> static int sysvipc_proc_show(struct seq_file *s, void *it)
On Wed, Dec 10, 2025 at 5:18 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Tue, Dec 09, 2025 at 10:26:22AM +0100, Mateusz Guzik wrote: > > On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote: > > > As Mateusz pointed out off-list, the profiles look like mutexes are doing > > > less optimistic spinning and more sleeping. Which IMHO isn't something that > > > this change can directly affect. > > > > > > > Not mutexes but rwsems. > > > > The bench at hand has some of the code spinlocked, other parts take > > rwsems for reading *or* writing. > > > > I had a peek at rwsem implementation and to my understanding it can > > degrade to no spinning in a microbenchmark setting like this one, > > provided you are unlucky enough. > > So we're just sleep, sleep sleeping? That would explain it... > > I mean is this an issue with the IPC design or the rwsem's in general? > the ipc code is not doing itself any favors, but is probably not worth working on either the lock stuff suffers internal problems. while there is no such thing as fastest possible lock, let alone sleepable rw lock, it is pretty clear the current code leaves perf on the table even with that caveat > > > > In particular you can get unlucky if existing timings get perturbed, > > which I presume is happening after Lorenzo's patch. > > > > To demonstrate I wrote a toy patch which conditionally converts affected > > down_read calls into down_write (inlined at the end). > > > > While the original report is based on a 192-thread box, I was only able > > to test with 80 threads. Even so, the crux of the issue was nicely > > reproduced. > > > > ./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80 > > I wonder if large core count matters here in particular, I was doing this > (albeit in a VM...) with 62 cores > you need a large enough count to generate enough contention in the first place. how much is it for this bench i have no tried figuring out, even for my hw
On Mon, 8 Dec 2025 16:42:43 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Sat, Dec 06, 2025 at 04:43:57PM +0000, Lorenzo Stoakes wrote:
> > On Fri, Dec 05, 2025 at 09:34:49PM +0000, David Laight wrote:
> > > On Fri, 5 Dec 2025 19:18:56 +0000
> > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > On Fri, Dec 05, 2025 at 06:43:42PM +0000, David Laight wrote:
> > > > > On Fri, 5 Dec 2025 17:50:37 +0000
> > > > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > > Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> > > > > > how VMA flags are declared, utilising an enum of VMA bit values and
> > > > > > ifdef-fery VM_xxx flag declarations via macro.
> > > > > >
> > > > > > As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> > > > > > the newly introduced VMA bit numbers.
> > > > > >
> > > > > > However, use of this macro results in apparently unfortunate macro
> > > > > > expansion and resulted in a performance degradation.This appears to be due
> > > > > > to the (__force int), which is required for the sparse typechecking to
> > > > > > work.
> > > > >
> > > > > Does sparse complain if you just add 0? As in:
> > > > > #define INIT_VM_FLAG(name) BIT(VMA_ ## name ## _BIT + 0u)
> > > > >
> > > > > That should change the type without affecting what BIT() expands to.
> > > >
> > > > Thanks, checked that and unfortunately that doesn't satisfy sparse :)
> > > >
> > > > I don't think it's too crazy to use 1UL << here, just very frustrating (TM)
> > > > that this is an issue.
> > >
> > > I might use some of my copious spare time (ha) to see why BIT() fails.
> > > I bet it is just too complex for its own good.
> > > Personally I'm fine with both explicit (1ul << n) and hex constants.
> > > The latter are definitely most useful if you ever look at hexdumps.
> >
> > Thanks :) yeah I just didn't want to go down that rabbit hole myself as I seemed
> > to have the answer and wanted to get it fixed, but obviously am quite curious as
> > to what on earth is causing that.
>
> I did wonder about _calc_vm_trans(), given the 'interesting' stuff it does.
>
> Maybe I should fiddle with that and see...
Hmmm...
/*
* Optimisation macro. It is equivalent to:
* (x & bit1) ? bit2 : 0
* but this version is faster.
* ("bit1" and "bit2" must be single bits)
*/
#define _calc_vm_trans(x, bit1, bit2) \
((!(bit1) || !(bit2)) ? 0 : \
((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
: ((x) & (bit1)) / ((bit1) / (bit2))))
The comment fails to mention it is only sane for constants.
If nothing else 9 expansions of BIT() are going to generate a very
long line.
For starters make it a statement expression and use __auto_type _bit1 = bit1.
Then add a check for both _bit1 and _bit2 being constants.
It is also worth checking the compiler doesn't do it for you.
Looks like gcc 7.1 onwards generate the 'optimised' code.
https://godbolt.org/z/EGGE56E3r
David
On Fri, Dec 05, 2025 at 05:50:37PM +0000, Lorenzo Stoakes wrote:
> Commit 2b6a3f061f11 ("mm: declare VMA flags by bit") significantly changed
> how VMA flags are declared, utilising an enum of VMA bit values and
> ifdef-fery VM_xxx flag declarations via macro.
>
> As part of this change, it uses INIT_VM_FLAG() to define VM_xxx flags from
> the newly introduced VMA bit numbers.
>
> However, use of this macro results in apparently unfortunate macro
> expansion and resulted in a performance degradation.This appears to be due
> to the (__force int), which is required for the sparse typechecking to
> work.
>
> Avoid macro expansion issues by simply using 1UL << bitnum.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202512041634.150c7e4f-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Sorry forget to add Fixes tag... :)
Please update with that also when the PR is upstream apologies :P
> ---
>
> Andrew - note I've referenced the linux-next commit number above, could you
> replace with the upstream commit hash once your PR is taken? Thanks!
>
> include/linux/mm.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2f38fb68840..c4438b30c140 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -395,7 +395,8 @@ enum {
> #undef DECLARE_VMA_BIT
> #undef DECLARE_VMA_BIT_ALIAS
>
> -#define INIT_VM_FLAG(name) BIT((__force int) VMA_ ## name ## _BIT)
> +#define INIT_VM_FLAG(name) (1UL << (__force int)(VMA_ ## name ## _BIT))
> +
> #define VM_READ INIT_VM_FLAG(READ)
> #define VM_WRITE INIT_VM_FLAG(WRITE)
> #define VM_EXEC INIT_VM_FLAG(EXEC)
> --
> 2.52.0
Cheers, Lorenzo
© 2016 - 2025 Red Hat, Inc.