include/uapi/asm-generic/mman-common.h | 6 + mm/madvise.c | 206 +++++++++++++++++++------ 2 files changed, 168 insertions(+), 44 deletions(-)
REVIEWERS NOTES: ================ This is a VERY EARLY version of the idea, it's relatively untested, and I'm 'putting it out there' for feedback. Any serious version of this will add a bunch of self-tests to assert correct behaviour and I will more carefully confirm everything's working. This is based on discussion arising from Usama's series [0], SJ's input on the thread around process_madvise() behaviour [1] (and a subsequent response by me [2]) and prior discussion about a new madvise() interface [3]. [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ ================ Currently, we are rather restricted in how madvise() operations proceed. While effort has been put in to expanding what process_madvise() can do (that is - unrestricted application of advice to the local process alongside recent improvements on the efficiency of TLB operations over these batvches), we are still constrained by existing madvise() limitations and default behaviours. This series makes use of the currently unused flags field in process_madvise() to provide more flexiblity. It introduces four flags: 1. PMADV_SKIP_ERRORS Currently, when an error arises applying advice in any individual VMA (keeping in mind that a range specified to madvise() or as part of the iovec passed to process_madvise()), the operation stops where it is and returns an error. This might not be the desired behaviour of the user, who may wish instead for the operation to be 'best effort'. By setting this flag, that behaviour is obtained. Since process_madvise() would trivially, if skipping errors, simply return the input vector size, we instead return the number of entries in the vector which completed successfully without error. The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. 2. PMADV_NO_ERROR_ON_UNMAPPED Currently madvise() has the peculiar behaviour of, if the range specified to it contains unmapped range(s), completing the full operation, but ultimately returning -ENOMEM. In the case of process_madvise(), this is fatal, as the operation will stop immediately upon this occurring. By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes unmapped areas to simply be entirely ignored. 3. PMADV_SET_FORK_EXEC_DEFAULT It may be desirable for a user to specify that all VMAs mapped in a process address space default to having an madvise() behaviour established by default, in such a fashion as that this persists across fork/exec. Since this is a very powerful option that would make no sense for many advice modes, we explicitly only permit known-safe flags here (currently MADV_HUGEPAGE and MADV_NOHUGEPAGE only). 4. PMADV_ENTIRE_ADDRESS_SPACE It can be annoying, should a user wish to apply madvise() to all VMAs in an address space, to have to add a singular large entry to the input iovec. So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified, we expect the user to pass NULL and -1 to the vec and vlen parameters respectively so they explicitly acknowledge that these will be ignored, e.g.: process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE, PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS); Usually a user ought to prefer setting PMADV_SKIP_ERRORS here as it may well be the case that incompatible VMAs will be encountered that ought to be skipped. If this is not set, the PMADV_NO_ERROR_ON_UNMAPPED (which was otherwise implicitly implied by PMADV_SKIP_ERRORS) ought to be set as of course, the entire address space spans at least some gaps. Lorenzo Stoakes (5): mm: madvise: refactor madvise_populate() mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag include/uapi/asm-generic/mman-common.h | 6 + mm/madvise.c | 206 +++++++++++++++++++------ 2 files changed, 168 insertions(+), 44 deletions(-) -- 2.49.0
(cc'ing linux-api) On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > REVIEWERS NOTES: > ================ > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > 'putting it out there' for feedback. Any serious version of this will add a > bunch of self-tests to assert correct behaviour and I will more carefully > confirm everything's working. > > This is based on discussion arising from Usama's series [0], SJ's input on > the thread around process_madvise() behaviour [1] (and a subsequent > response by me [2]) and prior discussion about a new madvise() interface > [3]. > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > ================ > > Currently, we are rather restricted in how madvise() operations > proceed. While effort has been put in to expanding what process_madvise() > can do (that is - unrestricted application of advice to the local process > alongside recent improvements on the efficiency of TLB operations over > these batvches), we are still constrained by existing madvise() limitations > and default behaviours. > > This series makes use of the currently unused flags field in > process_madvise() to provide more flexiblity. > > It introduces four flags: > > 1. PMADV_SKIP_ERRORS > > Currently, when an error arises applying advice in any individual VMA > (keeping in mind that a range specified to madvise() or as part of the > iovec passed to process_madvise()), the operation stops where it is and > returns an error. > > This might not be the desired behaviour of the user, who may wish instead > for the operation to be 'best effort'. By setting this flag, that behaviour > is obtained. > > Since process_madvise() would trivially, if skipping errors, simply return > the input vector size, we instead return the number of entries in the > vector which completed successfully without error. > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > Currently madvise() has the peculiar behaviour of, if the range specified > to it contains unmapped range(s), completing the full operation, but > ultimately returning -ENOMEM. > > In the case of process_madvise(), this is fatal, as the operation will stop > immediately upon this occurring. > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > unmapped areas to simply be entirely ignored. > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > It may be desirable for a user to specify that all VMAs mapped in a process > address space default to having an madvise() behaviour established by > default, in such a fashion as that this persists across fork/exec. > > Since this is a very powerful option that would make no sense for many > advice modes, we explicitly only permit known-safe flags here (currently > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > 4. PMADV_ENTIRE_ADDRESS_SPACE > > It can be annoying, should a user wish to apply madvise() to all VMAs in an > address space, to have to add a singular large entry to the input iovec. > > So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified, > we expect the user to pass NULL and -1 to the vec and vlen parameters > respectively so they explicitly acknowledge that these will be ignored, > e.g.: > > process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE, > PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS); > > Usually a user ought to prefer setting PMADV_SKIP_ERRORS here as it may > well be the case that incompatible VMAs will be encountered that ought to > be skipped. > > If this is not set, the PMADV_NO_ERROR_ON_UNMAPPED (which was otherwise > implicitly implied by PMADV_SKIP_ERRORS) ought to be set as of course, the > entire address space spans at least some gaps. > > Lorenzo Stoakes (5): > mm: madvise: refactor madvise_populate() > mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag > mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag > mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag > mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag > > include/uapi/asm-generic/mman-common.h | 6 + > mm/madvise.c | 206 +++++++++++++++++++------ > 2 files changed, 168 insertions(+), 44 deletions(-) > > -- > 2.49.0 > -- Sincerely yours, Mike.
On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > REVIEWERS NOTES: > ================ > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > 'putting it out there' for feedback. Any serious version of this will add a > bunch of self-tests to assert correct behaviour and I will more carefully > confirm everything's working. > > This is based on discussion arising from Usama's series [0], SJ's input on > the thread around process_madvise() behaviour [1] (and a subsequent > response by me [2]) and prior discussion about a new madvise() interface > [3]. > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > ================ > > Currently, we are rather restricted in how madvise() operations > proceed. While effort has been put in to expanding what process_madvise() > can do (that is - unrestricted application of advice to the local process > alongside recent improvements on the efficiency of TLB operations over > these batvches), we are still constrained by existing madvise() limitations > and default behaviours. > > This series makes use of the currently unused flags field in > process_madvise() to provide more flexiblity. > > It introduces four flags: > > 1. PMADV_SKIP_ERRORS > > Currently, when an error arises applying advice in any individual VMA > (keeping in mind that a range specified to madvise() or as part of the > iovec passed to process_madvise()), the operation stops where it is and > returns an error. > > This might not be the desired behaviour of the user, who may wish instead > for the operation to be 'best effort'. By setting this flag, that behaviour > is obtained. > > Since process_madvise() would trivially, if skipping errors, simply return > the input vector size, we instead return the number of entries in the > vector which completed successfully without error. > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > Currently madvise() has the peculiar behaviour of, if the range specified > to it contains unmapped range(s), completing the full operation, but > ultimately returning -ENOMEM. > > In the case of process_madvise(), this is fatal, as the operation will stop > immediately upon this occurring. > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > unmapped areas to simply be entirely ignored. Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why PMADV_SKIP_ERRORS is not enough? I don't see a need for PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS? > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > It may be desirable for a user to specify that all VMAs mapped in a process > address space default to having an madvise() behaviour established by > default, in such a fashion as that this persists across fork/exec. > > Since this is a very powerful option that would make no sense for many > advice modes, we explicitly only permit known-safe flags here (currently > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). Other flags seems general enough but this one is just weird. This is exactly the scenario for prctl() like interface. You are trying to make process_madvise() like prctl() and I can see process_madvise() would be included in all the hate that prctl() receives. Let me ask in a different way. Eventually we want to be in a state where hugepages works out of the box for all workloads. In that state what would the need for this flag unless you have use-cases other than hugepages. To me, there is a general consensus that prctl is a hacky interface, so having some intermediate solution through prctl until hugepages are good out of the box seems more reasonable. > > 4. PMADV_ENTIRE_ADDRESS_SPACE > > It can be annoying, should a user wish to apply madvise() to all VMAs in an > address space, to have to add a singular large entry to the input iovec. > > So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified, > we expect the user to pass NULL and -1 to the vec and vlen parameters > respectively so they explicitly acknowledge that these will be ignored, > e.g.: > > process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE, > PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS); > I still don't see a need for this flag. Why not the following? process_madvise(PIDFD_SELF, NULL, -1, advise, PMADV_SKIP_ERRORS)?
On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote: > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > > REVIEWERS NOTES: > > ================ > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > > 'putting it out there' for feedback. Any serious version of this will add a > > bunch of self-tests to assert correct behaviour and I will more carefully > > confirm everything's working. > > > > This is based on discussion arising from Usama's series [0], SJ's input on > > the thread around process_madvise() behaviour [1] (and a subsequent > > response by me [2]) and prior discussion about a new madvise() interface > > [3]. > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > > > ================ > > > > Currently, we are rather restricted in how madvise() operations > > proceed. While effort has been put in to expanding what process_madvise() > > can do (that is - unrestricted application of advice to the local process > > alongside recent improvements on the efficiency of TLB operations over > > these batvches), we are still constrained by existing madvise() limitations > > and default behaviours. > > > > This series makes use of the currently unused flags field in > > process_madvise() to provide more flexiblity. > > > > It introduces four flags: > > > > 1. PMADV_SKIP_ERRORS > > > > Currently, when an error arises applying advice in any individual VMA > > (keeping in mind that a range specified to madvise() or as part of the > > iovec passed to process_madvise()), the operation stops where it is and > > returns an error. > > > > This might not be the desired behaviour of the user, who may wish instead > > for the operation to be 'best effort'. By setting this flag, that behaviour > > is obtained. > > > > Since process_madvise() would trivially, if skipping errors, simply return > > the input vector size, we instead return the number of entries in the > > vector which completed successfully without error. > > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > > > Currently madvise() has the peculiar behaviour of, if the range specified > > to it contains unmapped range(s), completing the full operation, but > > ultimately returning -ENOMEM. > > > > In the case of process_madvise(), this is fatal, as the operation will stop > > immediately upon this occurring. > > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > > unmapped areas to simply be entirely ignored. > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why > PMADV_SKIP_ERRORS is not enough? I don't see a need for > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS? I thought I already explained this above: "In the case of process_madvise(), this is fatal, as the operation will stop immediately upon this occurring." This is somewhat bizarre behaviour. You specify multiple vector entries spanning different ranges, but one spans some unmapped space and now the whole process_madvise() operation grinds to a halt, except the vector entry containing ranges including unmapped space is completed. This is strange behaviour, and it makes sense to me to optionally disable this. If you were looping around doing an madvise(), this would _not_ occur, you could filter out the -ENOMEM's. It's a really weird peculiarity in process_madvise(). Moreover, you might not want an error reported, that possibly duplicates _real_ -ENOMEM errors, when you simply encounter unmapped addresses. Finally, if you perform an operation across the entire address space as proposed you may wish to stop on actual error but not on the (inevitable at least in 64-bit space) gaps you'll encounter. > > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > address space default to having an madvise() behaviour established by > > default, in such a fashion as that this persists across fork/exec. > > > > Since this is a very powerful option that would make no sense for many > > advice modes, we explicitly only permit known-safe flags here (currently > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > Other flags seems general enough but this one is just weird. This is > exactly the scenario for prctl() like interface. You are trying to make > process_madvise() like prctl() and I can see process_madvise() would be > included in all the hate that prctl() receives. I'm really not sure what you mean. prctl() has no rhyme nor reason, so not sure what a 'prctl() like interface' means here, and you're not explaining what you mean by that. Presumably you mean you find this odd as you feel it sits outside the realm of madvise() behaviour. But I'd suggest it does not - the idea is to align _everything_ with madvise(). Rather than adding an entirely arbitrary function in prctl(), we are going the other way - keeping everything relating to madvise()-like modification of memory _in mm_ and _in madvise()_, rather than bitrotting away in kernel/sys.c. So we get alignment in the fact that we're saying 'we establish a _default_ madvise flag for a process'. We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what you guys at meta want while also opening the door to doing so in future if it makes sense to. This couldn't be more different than putting some arbitrary code relating to mm in the 'netherrealm' of prctl(). > > Let me ask in a different way. Eventually we want to be in a state where > hugepages works out of the box for all workloads. In that state what > would the need for this flag unless you have use-cases other than > hugepages. To me, there is a general consensus that prctl is a hacky > interface, so having some intermediate solution through prctl until > hugepages are good out of the box seems more reasonable. No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will have to be supported. In a future where things are automatic, these may be no-ops in 'auto' mode. But the requirement to have these flags will still exist, the requirement to do madvise() operations will still exist, we're simply expanding this functionality. The problem arises the other way around when we shovel mm stuff in kernel/sys.c. > > > > > 4. PMADV_ENTIRE_ADDRESS_SPACE > > > > It can be annoying, should a user wish to apply madvise() to all VMAs in an > > address space, to have to add a singular large entry to the input iovec. > > > > So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified, > > we expect the user to pass NULL and -1 to the vec and vlen parameters > > respectively so they explicitly acknowledge that these will be ignored, > > e.g.: > > > > process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE, > > PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS); > > > > I still don't see a need for this flag. Why not the following? > > process_madvise(PIDFD_SELF, NULL, -1, advise, PMADV_SKIP_ERRORS)? I don't think iovec=NULL, vlen=-1 means 'everything' in any other interface anywhere else in the kernel? Why would we reasonably expect anybody to know/assume that here? It's surely far better to very explicitly say as much? The idea with NULL, -1 is that you're making sure the user knows any provided iovec, vlen will be ignored, and better to just disallow the user providing those at all (similar to mmap requiring -1 for fd for anon mappings).
On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote: > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote: > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > > > REVIEWERS NOTES: > > > ================ > > > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > > > 'putting it out there' for feedback. Any serious version of this will add a > > > bunch of self-tests to assert correct behaviour and I will more carefully > > > confirm everything's working. > > > > > > This is based on discussion arising from Usama's series [0], SJ's input on > > > the thread around process_madvise() behaviour [1] (and a subsequent > > > response by me [2]) and prior discussion about a new madvise() interface > > > [3]. > > > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > > > > > ================ > > > > > > Currently, we are rather restricted in how madvise() operations > > > proceed. While effort has been put in to expanding what process_madvise() > > > can do (that is - unrestricted application of advice to the local process > > > alongside recent improvements on the efficiency of TLB operations over > > > these batvches), we are still constrained by existing madvise() limitations > > > and default behaviours. > > > > > > This series makes use of the currently unused flags field in > > > process_madvise() to provide more flexiblity. > > > > > > It introduces four flags: > > > > > > 1. PMADV_SKIP_ERRORS > > > > > > Currently, when an error arises applying advice in any individual VMA > > > (keeping in mind that a range specified to madvise() or as part of the > > > iovec passed to process_madvise()), the operation stops where it is and > > > returns an error. > > > > > > This might not be the desired behaviour of the user, who may wish instead > > > for the operation to be 'best effort'. By setting this flag, that behaviour > > > is obtained. > > > > > > Since process_madvise() would trivially, if skipping errors, simply return > > > the input vector size, we instead return the number of entries in the > > > vector which completed successfully without error. > > > > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > > > > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > > > > > Currently madvise() has the peculiar behaviour of, if the range specified > > > to it contains unmapped range(s), completing the full operation, but > > > ultimately returning -ENOMEM. > > > > > > In the case of process_madvise(), this is fatal, as the operation will stop > > > immediately upon this occurring. > > > > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > > > unmapped areas to simply be entirely ignored. > > > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why > > PMADV_SKIP_ERRORS is not enough? I don't see a need for > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS? > > I thought I already explained this above: > > "In the case of process_madvise(), this is fatal, as the operation > will stop immediately upon this occurring." > > This is somewhat bizarre behaviour. You specify multiple vector entries > spanning different ranges, but one spans some unmapped space and now the > whole process_madvise() operation grinds to a halt, except the vector entry > containing ranges including unmapped space is completed. > > This is strange behaviour, and it makes sense to me to optionally disable > this. > > If you were looping around doing an madvise(), this would _not_ occur, you > could filter out the -ENOMEM's. It's a really weird peculiarity in > process_madvise(). > > Moreover, you might not want an error reported, that possibly duplicates > _real_ -ENOMEM errors, when you simply encounter unmapped addresses. > > Finally, if you perform an operation across the entire address space as > proposed you may wish to stop on actual error but not on the (inevitable at > least in 64-bit space) gaps you'll encounter. So, we *may* wish to stop on actual error, do you have a more concrete example? We should not add an API on a case which may be needed. We can always add stuff later when the actual concrete use-cases come up. > > > > > > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > > address space default to having an madvise() behaviour established by > > > default, in such a fashion as that this persists across fork/exec. > > > > > > Since this is a very powerful option that would make no sense for many > > > advice modes, we explicitly only permit known-safe flags here (currently > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > > > Other flags seems general enough but this one is just weird. This is > > exactly the scenario for prctl() like interface. You are trying to make > > process_madvise() like prctl() and I can see process_madvise() would be > > included in all the hate that prctl() receives. > > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not > sure what a 'prctl() like interface' means here, and you're not explaining > what you mean by that. I meant it applies a property at the task or process level and has examples where those properties are inherited to children. > > Presumably you mean you find this odd as you feel it sits outside the realm > of madvise() behaviour. The persistence across exec seems weird. > > But I'd suggest it does not - the idea is to align _everything_ with > madvise(). Rather than adding an entirely arbitrary function in prctl(), we > are going the other way - keeping everything relating to madvise()-like > modification of memory _in mm_ and _in madvise()_, rather than bitrotting > away in kernel/sys.c. The above para seems like you are talking about code which can be moved to mm. > > So we get alignment in the fact that we're saying 'we establish a _default_ > madvise flag for a process'. I think this is an important point. So, we want to introduce a way to set a process level property which can be inherited through fork/exec. With that in mind, is process_madvise() (or even madvise()) really a good interface for it? There is no need for address ranges for such use-case. > > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what > you guys at meta want while also opening the door to doing so in future if > it makes sense to. Please drop the "you guys at meta". We should be aiming for what is good for all (or most) linux users. Whatever is done here will be incorporated in systemd which will be used very widely. > > This couldn't be more different than putting some arbitrary code relating > to mm in the 'netherrealm' of prctl(). > > > > > > Let me ask in a different way. Eventually we want to be in a state where > > hugepages works out of the box for all workloads. In that state what > > would the need for this flag unless you have use-cases other than > > hugepages. To me, there is a general consensus that prctl is a hacky > > interface, so having some intermediate solution through prctl until > > hugepages are good out of the box seems more reasonable. > > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will > have to be supported. In a future where things are automatic, these may be > no-ops in 'auto' mode. > > But the requirement to have these flags will still exist, the requirement > to do madvise() operations will still exist, we're simply expanding this > functionality. > > The problem arises the other way around when we shovel mm stuff in > kernel/sys.c. I think you mixing the location of the code and the interface which will remain relevant long term. I don't see process_madvise (or madvise) good interface for this specific functionality (i.e. process level property that gets inherited through fork/exec). Now we can add a new syscall for this but to me, particularly for hugepage, this functionality is needed temporarily until hugepages are good out of the box. However if there is a use-case other than hugepages then yes, why not a new syscall.
* Shakeel Butt <shakeel.butt@linux.dev> [250520 15:49]: > On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote: > > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote: > > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > > > > REVIEWERS NOTES: > > > > ================ > > > > > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > > > > 'putting it out there' for feedback. Any serious version of this will add a > > > > bunch of self-tests to assert correct behaviour and I will more carefully > > > > confirm everything's working. > > > > > > > > This is based on discussion arising from Usama's series [0], SJ's input on > > > > the thread around process_madvise() behaviour [1] (and a subsequent > > > > response by me [2]) and prior discussion about a new madvise() interface > > > > [3]. > > > > > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > > > > > > > ================ ... > > > > > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > > > address space default to having an madvise() behaviour established by > > > > default, in such a fashion as that this persists across fork/exec. > > > > > > > > Since this is a very powerful option that would make no sense for many > > > > advice modes, we explicitly only permit known-safe flags here (currently > > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > > > > > Other flags seems general enough but this one is just weird. This is > > > exactly the scenario for prctl() like interface. You are trying to make > > > process_madvise() like prctl() and I can see process_madvise() would be > > > included in all the hate that prctl() receives. > > > > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not > > sure what a 'prctl() like interface' means here, and you're not explaining > > what you mean by that. > > I meant it applies a property at the task or process level and has > examples where those properties are inherited to children. Okay, I don't think we should have any illusions of how clear cut either of these things are. prctl has PR_SET_VMA, which isn't to to do with the process. madvise() can be called with a range covering the entire address space and may or may not fail on gaps (depending on the call). process_madvise() was to madvise on a process range - including all of it, so I don't see either as the wrong place. Or, really, the best place. > > > > > Presumably you mean you find this odd as you feel it sits outside the realm > > of madvise() behaviour. > > The persistence across exec seems weird. > > > > > But I'd suggest it does not - the idea is to align _everything_ with > > madvise(). Rather than adding an entirely arbitrary function in prctl(), we > > are going the other way - keeping everything relating to madvise()-like > > modification of memory _in mm_ and _in madvise()_, rather than bitrotting > > away in kernel/sys.c. > > The above para seems like you are talking about code which can be moved > to mm. > > > > > So we get alignment in the fact that we're saying 'we establish a _default_ > > madvise flag for a process'. > > I think this is an important point. So, we want to introduce a way to > set a process level property which can be inherited through fork/exec. > With that in mind, is process_madvise() (or even madvise()) really a > good interface for it? There is no need for address ranges for such > use-case. > > > > > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what > > you guys at meta want while also opening the door to doing so in future if > > it makes sense to. > > Please drop the "you guys at meta". We should be aiming for what is good > for all (or most) linux users. Whatever is done here will be > incorporated in systemd which will be used very widely. Agreed, the aim is to provide the most flexible interface for the majority of the users. In my view, there isn't a clear way forward yet: I don't want to make process_madvise() a dumping ground and I don't want to make prctl() a bigger mess. But right now, it seems there is more than one user arguing for this particular implementation, but in fact there two employees at the same company and that isn't clear in this conversation to the casual reader. And the opinions are being cross-referenced as evidence of how others view it. I have no idea if you two know each other, or even if you know that you are both at meta - but I think if I chimed in from some other email address saying how crazy prctl() is over this other way, then you'd probably feel compelled to say something about where I work (I hope you would tbh)? You could add a (meta) to your email address or signature so that it is implicit that there may be other influences on opinion. I'm not saying there is an influence on opinion or that this opinion is wrong, but it is not fully transparent and complicates the conversation. I don't think it is unreasonable for people to point out that everyone arguing for one solution is at the same company when it isn't obvious, do you? > > > > > This couldn't be more different than putting some arbitrary code relating > > to mm in the 'netherrealm' of prctl(). > > > > > > > > > > Let me ask in a different way. Eventually we want to be in a state where > > > hugepages works out of the box for all workloads. In that state what > > > would the need for this flag unless you have use-cases other than > > > hugepages. To me, there is a general consensus that prctl is a hacky > > > interface, so having some intermediate solution through prctl until > > > hugepages are good out of the box seems more reasonable. (oh my, I'm going to sound like a jerk here.. sorry) This is a terrible argument. This intermediate solution may outlive us all.. it may even shorten some of our lives due to stress dealing with it. Will the removal of the prctl() call coincide with the year of the Linux desktop? Or maybe the removal of the mmap lock... wait. The hacky interface of prctl() is one of my main gripes with that proposal, it's not a reason to go with it. > > > > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will > > have to be supported. In a future where things are automatic, these may be > > no-ops in 'auto' mode. > > > > But the requirement to have these flags will still exist, the requirement > > to do madvise() operations will still exist, we're simply expanding this > > functionality. > > > > The problem arises the other way around when we shovel mm stuff in > > kernel/sys.c. > > I think you mixing the location of the code and the interface which will > remain relevant long term. I don't see process_madvise (or madvise) good > interface for this specific functionality (i.e. process level property > that gets inherited through fork/exec). Process is literally in the name and we are applying an madvise flag across the entire range. The inherited through fork/exec is a much stronger reason that I see for not doing it in process_madvise(). > Now we can add a new syscall for > this but to me, particularly for hugepage, this functionality is needed > temporarily until hugepages are good out of the box. I'm not holding my breath. > However if there is > a use-case other than hugepages then yes, why not a new syscall. > We also have Barry looking at prctl() for avoiding LRU updates on exit [1], which I guess is process related.. or page related.. really it is mm flags related in his implementation, sort of like this feature. But it will probably only be used by android, so... [1]. https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/ Thanks, Liam
On Tue, May 20, 2025 at 12:49:24PM -0700, Shakeel Butt wrote: > On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote: > > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote: > > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > > > > REVIEWERS NOTES: > > > > ================ > > > > > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > > > > 'putting it out there' for feedback. Any serious version of this will add a > > > > bunch of self-tests to assert correct behaviour and I will more carefully > > > > confirm everything's working. > > > > > > > > This is based on discussion arising from Usama's series [0], SJ's input on > > > > the thread around process_madvise() behaviour [1] (and a subsequent > > > > response by me [2]) and prior discussion about a new madvise() interface > > > > [3]. > > > > > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > > > > > > > ================ > > > > > > > > Currently, we are rather restricted in how madvise() operations > > > > proceed. While effort has been put in to expanding what process_madvise() > > > > can do (that is - unrestricted application of advice to the local process > > > > alongside recent improvements on the efficiency of TLB operations over > > > > these batvches), we are still constrained by existing madvise() limitations > > > > and default behaviours. > > > > > > > > This series makes use of the currently unused flags field in > > > > process_madvise() to provide more flexiblity. > > > > > > > > It introduces four flags: > > > > > > > > 1. PMADV_SKIP_ERRORS > > > > > > > > Currently, when an error arises applying advice in any individual VMA > > > > (keeping in mind that a range specified to madvise() or as part of the > > > > iovec passed to process_madvise()), the operation stops where it is and > > > > returns an error. > > > > > > > > This might not be the desired behaviour of the user, who may wish instead > > > > for the operation to be 'best effort'. By setting this flag, that behaviour > > > > is obtained. > > > > > > > > Since process_madvise() would trivially, if skipping errors, simply return > > > > the input vector size, we instead return the number of entries in the > > > > vector which completed successfully without error. > > > > > > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > > > > > > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > > > > > > > Currently madvise() has the peculiar behaviour of, if the range specified > > > > to it contains unmapped range(s), completing the full operation, but > > > > ultimately returning -ENOMEM. > > > > > > > > In the case of process_madvise(), this is fatal, as the operation will stop > > > > immediately upon this occurring. > > > > > > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > > > > unmapped areas to simply be entirely ignored. > > > > > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why > > > PMADV_SKIP_ERRORS is not enough? I don't see a need for > > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where > > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS? > > > > I thought I already explained this above: > > > > "In the case of process_madvise(), this is fatal, as the operation > > will stop immediately upon this occurring." > > > > This is somewhat bizarre behaviour. You specify multiple vector entries > > spanning different ranges, but one spans some unmapped space and now the > > whole process_madvise() operation grinds to a halt, except the vector entry > > containing ranges including unmapped space is completed. > > > > This is strange behaviour, and it makes sense to me to optionally disable > > this. > > > > If you were looping around doing an madvise(), this would _not_ occur, you > > could filter out the -ENOMEM's. It's a really weird peculiarity in > > process_madvise(). > > > > Moreover, you might not want an error reported, that possibly duplicates > > _real_ -ENOMEM errors, when you simply encounter unmapped addresses. > > > > Finally, if you perform an operation across the entire address space as > > proposed you may wish to stop on actual error but not on the (inevitable at > > least in 64-bit space) gaps you'll encounter. > > So, we *may* wish to stop on actual error, do you have a more concrete > example? We should not add an API on a case which may be needed. We can > always add stuff later when the actual concrete use-cases come up. I feel like I just gave a concrete example? It's useful to not have to be absolutely sure that the range specified includes no unmapped ranges. It's required for the MADV_[NO]HUGEPAGE use case proposed, specifically applying the operation to the entire address space. But I think it might make this a lot more concrete when I write tests - as these will 'concretise' the interface and provide examples. We can also choose to 'hide' this from users and add it back in as you say. Perhaps worth just waiting for a respin with tests to see this in action. > > > > > > > > > > > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > > > address space default to having an madvise() behaviour established by > > > > default, in such a fashion as that this persists across fork/exec. > > > > > > > > Since this is a very powerful option that would make no sense for many > > > > advice modes, we explicitly only permit known-safe flags here (currently > > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > > > > > Other flags seems general enough but this one is just weird. This is > > > exactly the scenario for prctl() like interface. You are trying to make > > > process_madvise() like prctl() and I can see process_madvise() would be > > > included in all the hate that prctl() receives. > > > > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not > > sure what a 'prctl() like interface' means here, and you're not explaining > > what you mean by that. > > I meant it applies a property at the task or process level and has > examples where those properties are inherited to children. But de-facto several prctl() interfaces do not do this, and mm interfaces like mlockall() for instance do do this? _In practice_ prctl() is a random hodge-podge of stuff, subject to bitrot. > > > > > Presumably you mean you find this odd as you feel it sits outside the realm > > of madvise() behaviour. > > The persistence across exec seems weird. OK I'm not quite sure how to quantify 'weird'? As I argue below, the idea here is we're doing 'madvise by default'. So you can either have prctl() invoke madvise() for some stuff, and then establish some 'madvise by default' logic, or we can do it the other way, by doing _as much as possible_ madvise() stuff in madvise, and add the default-across-exec there as a highly controlled, very clear flag. I continue to believe the latter is cleaner, more maintainable, and less subject to bitrot. And I would argue invoking madvise() from prctl() is similarly odd (though pretty much everything that happens in prctl() is, by de-facto definition, sort of odd :) > > > > > But I'd suggest it does not - the idea is to align _everything_ with > > madvise(). Rather than adding an entirely arbitrary function in prctl(), we > > are going the other way - keeping everything relating to madvise()-like > > modification of memory _in mm_ and _in madvise()_, rather than bitrotting > > away in kernel/sys.c. > > The above para seems like you are talking about code which can be moved > to mm. > > > > > So we get alignment in the fact that we're saying 'we establish a _default_ > > madvise flag for a process'. > > I think this is an important point. So, we want to introduce a way to > set a process level property which can be inherited through fork/exec. > With that in mind, is process_madvise() (or even madvise()) really a > good interface for it? There is no need for address ranges for such > use-case. > > > > > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what > > you guys at meta want while also opening the door to doing so in future if > > it makes sense to. > > Please drop the "you guys at meta". We should be aiming for what is good > for all (or most) linux users. Whatever is done here will be > incorporated in systemd which will be used very widely. ...! I've spent several hours doing review of Usama's series and proposing this idea precisely to serve the community. I would ask you to please respect that. The point I'm making here re: you guys, is that we can both serve the community and solve your problem - because aiming at both is _the only way this change will get merged_. I am doing my absolute best to try to reach this end. Re: systemd, I'm not sure what you mean - has there been an indication that they will using this? I'm not sure they make use of every prctl() interface do they? > > > > > This couldn't be more different than putting some arbitrary code relating > > to mm in the 'netherrealm' of prctl(). > > > > > > > > > > Let me ask in a different way. Eventually we want to be in a state where > > > hugepages works out of the box for all workloads. In that state what > > > would the need for this flag unless you have use-cases other than > > > hugepages. To me, there is a general consensus that prctl is a hacky > > > interface, so having some intermediate solution through prctl until > > > hugepages are good out of the box seems more reasonable. > > > > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will > > have to be supported. In a future where things are automatic, these may be > > no-ops in 'auto' mode. > > > > But the requirement to have these flags will still exist, the requirement > > to do madvise() operations will still exist, we're simply expanding this > > functionality. > > > > The problem arises the other way around when we shovel mm stuff in > > kernel/sys.c. > > I think you mixing the location of the code and the interface which will > remain relevant long term. I don't see process_madvise (or madvise) good > interface for this specific functionality (i.e. process level property > that gets inherited through fork/exec). Now we can add a new syscall for > this but to me, particularly for hugepage, this functionality is needed > temporarily until hugepages are good out of the box. However if there is > a use-case other than hugepages then yes, why not a new syscall. > I disagree on several levels. Firstly of course 'process' is a vague term here, you mean address space, rather mm_struct. And really you're saying 'it's ok to associate mm_strut and madvise specific stuff with prctl() but not ok to associate mm_struct stuff with madvise code' which not quite compelling to me. Overall you can view this approach as - 'we are making an madvise() flag a default, optionally'. And this justifies us therefore doing this via an madvise() API call. It also further allows us to expand the capabilities of madvise() for free - we can address long-standing issues around what is possible with these system calls while also providing this interface. The idea is it's win/win. Otherwise we're simply doing a bunch of madvise() stuff from prctl() in the same way that PR_SET_VMA_ANON_NAME for instance (hey - that's a VMA-level feature! What's that doing in prctl() :) - where we have a bunch of mm code living over there, and we have to export mm-internal stuff to kernel/sys.c and it's a mess. As to the temporary nature of this stuff, you seem to have disregarded what I've said here rather in relation to the persistence of the MADV_[NO]HUGEPAGE flags which are required as uAPI. They therefore must remain, and thus I don't find the argument compelling re: prctl() being better suited. It seems more likely things will bitrot there?
On Tue, May 20, 2025 at 09:39:37PM +0100, Lorenzo Stoakes wrote: > On Tue, May 20, 2025 at 12:49:24PM -0700, Shakeel Butt wrote: > > On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote: > > > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote: > > > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > > > > > REVIEWERS NOTES: > > > > > ================ > > > > > > > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > > > > > 'putting it out there' for feedback. Any serious version of this will add a > > > > > bunch of self-tests to assert correct behaviour and I will more carefully > > > > > confirm everything's working. > > > > > > > > > > This is based on discussion arising from Usama's series [0], SJ's input on > > > > > the thread around process_madvise() behaviour [1] (and a subsequent > > > > > response by me [2]) and prior discussion about a new madvise() interface > > > > > [3]. > > > > > > > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > > > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > > > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > > > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > > > > > > > > > ================ > > > > > > > > > > Currently, we are rather restricted in how madvise() operations > > > > > proceed. While effort has been put in to expanding what process_madvise() > > > > > can do (that is - unrestricted application of advice to the local process > > > > > alongside recent improvements on the efficiency of TLB operations over > > > > > these batvches), we are still constrained by existing madvise() limitations > > > > > and default behaviours. > > > > > > > > > > This series makes use of the currently unused flags field in > > > > > process_madvise() to provide more flexiblity. > > > > > > > > > > It introduces four flags: > > > > > > > > > > 1. PMADV_SKIP_ERRORS > > > > > > > > > > Currently, when an error arises applying advice in any individual VMA > > > > > (keeping in mind that a range specified to madvise() or as part of the > > > > > iovec passed to process_madvise()), the operation stops where it is and > > > > > returns an error. > > > > > > > > > > This might not be the desired behaviour of the user, who may wish instead > > > > > for the operation to be 'best effort'. By setting this flag, that behaviour > > > > > is obtained. > > > > > > > > > > Since process_madvise() would trivially, if skipping errors, simply return > > > > > the input vector size, we instead return the number of entries in the > > > > > vector which completed successfully without error. > > > > > > > > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > > > > > > > > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > > > > > > > > > Currently madvise() has the peculiar behaviour of, if the range specified > > > > > to it contains unmapped range(s), completing the full operation, but > > > > > ultimately returning -ENOMEM. > > > > > > > > > > In the case of process_madvise(), this is fatal, as the operation will stop > > > > > immediately upon this occurring. > > > > > > > > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > > > > > unmapped areas to simply be entirely ignored. > > > > > > > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why > > > > PMADV_SKIP_ERRORS is not enough? I don't see a need for > > > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where > > > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS? > > > > > > I thought I already explained this above: > > > > > > "In the case of process_madvise(), this is fatal, as the operation > > > will stop immediately upon this occurring." > > > > > > This is somewhat bizarre behaviour. You specify multiple vector entries > > > spanning different ranges, but one spans some unmapped space and now the > > > whole process_madvise() operation grinds to a halt, except the vector entry > > > containing ranges including unmapped space is completed. > > > > > > This is strange behaviour, and it makes sense to me to optionally disable > > > this. > > > > > > If you were looping around doing an madvise(), this would _not_ occur, you > > > could filter out the -ENOMEM's. It's a really weird peculiarity in > > > process_madvise(). > > > > > > Moreover, you might not want an error reported, that possibly duplicates > > > _real_ -ENOMEM errors, when you simply encounter unmapped addresses. > > > > > > Finally, if you perform an operation across the entire address space as > > > proposed you may wish to stop on actual error but not on the (inevitable at > > > least in 64-bit space) gaps you'll encounter. > > > > So, we *may* wish to stop on actual error, do you have a more concrete > > example? We should not add an API on a case which may be needed. We can > > always add stuff later when the actual concrete use-cases come up. > > I feel like I just gave a concrete example? > > It's useful to not have to be absolutely sure that the range specified > includes no unmapped ranges. > > It's required for the MADV_[NO]HUGEPAGE use case proposed, specifically > applying the operation to the entire address space. > > But I think it might make this a lot more concrete when I write tests - as > these will 'concretise' the interface and provide examples. > > We can also choose to 'hide' this from users and add it back in as you say. > > Perhaps worth just waiting for a respin with tests to see this in action. Yes, that makes sense. > > > > > > > > > > > > > > > > > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > > > > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > > > > address space default to having an madvise() behaviour established by > > > > > default, in such a fashion as that this persists across fork/exec. > > > > > > > > > > Since this is a very powerful option that would make no sense for many > > > > > advice modes, we explicitly only permit known-safe flags here (currently > > > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > > > > > > > Other flags seems general enough but this one is just weird. This is > > > > exactly the scenario for prctl() like interface. You are trying to make > > > > process_madvise() like prctl() and I can see process_madvise() would be > > > > included in all the hate that prctl() receives. > > > > > > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not > > > sure what a 'prctl() like interface' means here, and you're not explaining > > > what you mean by that. > > > > I meant it applies a property at the task or process level and has > > examples where those properties are inherited to children. > > But de-facto several prctl() interfaces do not do this, and mm interfaces > like mlockall() for instance do do this? > > _In practice_ prctl() is a random hodge-podge of stuff, subject to bitrot. > I don't disagree. > > > > > > > > Presumably you mean you find this odd as you feel it sits outside the realm > > > of madvise() behaviour. > > > > The persistence across exec seems weird. > > OK I'm not quite sure how to quantify 'weird'? > > As I argue below, the idea here is we're doing 'madvise by default'. So you > can either have prctl() invoke madvise() for some stuff, and then establish > some 'madvise by default' logic, or we can do it the other way, by doing > _as much as possible_ madvise() stuff in madvise, and add the > default-across-exec there as a highly controlled, very clear flag. > > I continue to believe the latter is cleaner, more maintainable, and less > subject to bitrot. > > And I would argue invoking madvise() from prctl() is similarly odd (though > pretty much everything that happens in prctl() is, by de-facto definition, > sort of odd :) > Yes, prctl() is weird as well but see my point at the end. > > > > > > > > But I'd suggest it does not - the idea is to align _everything_ with > > > madvise(). Rather than adding an entirely arbitrary function in prctl(), we > > > are going the other way - keeping everything relating to madvise()-like > > > modification of memory _in mm_ and _in madvise()_, rather than bitrotting > > > away in kernel/sys.c. > > > > The above para seems like you are talking about code which can be moved > > to mm. > > > > > > > > So we get alignment in the fact that we're saying 'we establish a _default_ > > > madvise flag for a process'. > > > > I think this is an important point. So, we want to introduce a way to > > set a process level property which can be inherited through fork/exec. > > With that in mind, is process_madvise() (or even madvise()) really a > > good interface for it? There is no need for address ranges for such > > use-case. > > > > > > > > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what > > > you guys at meta want while also opening the door to doing so in future if > > > it makes sense to. > > > > Please drop the "you guys at meta". We should be aiming for what is good > > for all (or most) linux users. Whatever is done here will be > > incorporated in systemd which will be used very widely. > > ...! > > I've spent several hours doing review of Usama's series and proposing this > idea precisely to serve the community. I would ask you to please respect > that. > > The point I'm making here re: you guys, is that we can both serve the > community and solve your problem - because aiming at both is _the only way > this change will get merged_. > > I am doing my absolute best to try to reach this end. Thanks for doing that. My point was having a robust long term solution is preferrable over any specific company's preferred solution. So, it is ok to push back if you feel that we are pushing a solution which will benefit only a single or small set of users. > > Re: systemd, I'm not sure what you mean - has there been an indication that > they will using this? I'm not sure they make use of every prctl() interface > do they? We plan to add this in systemd which already has similar support for ksm. > > > > > > > > > This couldn't be more different than putting some arbitrary code relating > > > to mm in the 'netherrealm' of prctl(). > > > > > > > > > > > > > > Let me ask in a different way. Eventually we want to be in a state where > > > > hugepages works out of the box for all workloads. In that state what > > > > would the need for this flag unless you have use-cases other than > > > > hugepages. To me, there is a general consensus that prctl is a hacky > > > > interface, so having some intermediate solution through prctl until > > > > hugepages are good out of the box seems more reasonable. > > > > > > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will > > > have to be supported. In a future where things are automatic, these may be > > > no-ops in 'auto' mode. > > > > > > But the requirement to have these flags will still exist, the requirement > > > to do madvise() operations will still exist, we're simply expanding this > > > functionality. > > > > > > The problem arises the other way around when we shovel mm stuff in > > > kernel/sys.c. > > > > I think you mixing the location of the code and the interface which will > > remain relevant long term. I don't see process_madvise (or madvise) good > > interface for this specific functionality (i.e. process level property > > that gets inherited through fork/exec). Now we can add a new syscall for > > this but to me, particularly for hugepage, this functionality is needed > > temporarily until hugepages are good out of the box. However if there is > > a use-case other than hugepages then yes, why not a new syscall. > > > > I disagree on several levels. Firstly of course 'process' is a vague term > here, you mean address space, rather mm_struct. > > And really you're saying 'it's ok to associate mm_strut and madvise > specific stuff with prctl() but not ok to associate mm_struct stuff with > madvise code' which not quite compelling to me. > > Overall you can view this approach as - 'we are making an madvise() flag a > default, optionally'. > > And this justifies us therefore doing this via an madvise() API call. > > It also further allows us to expand the capabilities of madvise() for free > - we can address long-standing issues around what is possible with these > system calls while also providing this interface. The idea is it's win/win. > > Otherwise we're simply doing a bunch of madvise() stuff from prctl() in the > same way that PR_SET_VMA_ANON_NAME for instance (hey - that's a VMA-level > feature! What's that doing in prctl() :) - where we have a bunch of mm code > living over there, and we have to export mm-internal stuff to kernel/sys.c > and it's a mess. > > As to the temporary nature of this stuff, you seem to have disregarded what > I've said here rather in relation to the persistence of the > MADV_[NO]HUGEPAGE flags which are required as uAPI. They therefore must > remain, and thus I don't find the argument compelling re: prctl() being > better suited. It seems more likely things will bitrot there? I think we are talking past each other (and I blame myself for that). Let me try again. (Please keep aside prctl or process_madvise). We need a way to change the policy of a process (job) and at the moment we are aligned that the job loader (like systemd) will set that policy and load the target (fork/exec), so the policy persist across fork/exec. (If someone has a better way to set the policy without race, please let us know). My argument is that process_madvise() is not a good interface to set that policy because of its address range like interface. So, if not process_madvise() then what? Should we add a new syscall? (BTW we had very similar discussion on process_madvise(DONTNEED) on a remote process vs a new syscall i.e. process_mrelease()). Adding a new syscall requires that it should be generally useful and hopefully have more use-cases. Now going back to the current specific use-case where we want to override the hugepage related policy of a job, do we expect to use this override forever? I believe this is temporary because the only reason we need this is because hugepages are not yet ready for prime time (many applications do not handle them well). In future when hugepages will be awesome, do we still need this "override the hugepage policy" syscall? Now if we can show that this specific functionality is useful more than hugepages then I think new syscall seems like the best way forward. However if we think this functionality is only needed temporarily then shoving it in prctl() seems reasonable to me. If we really don't want prctl() based solution, I would recommend to discuss the new syscall approach and see if we can comeup with a more general solution.
On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote: [snip for clarity] > I think we are talking past each other (and I blame myself for that). Let > me try again. (Please keep aside prctl or process_madvise). We need a > way to change the policy of a process (job) and at the moment we are > aligned that the job loader (like systemd) will set that policy and load > the target (fork/exec), so the policy persist across fork/exec. (If > someone has a better way to set the policy without race, please let us > know). Ack, totally agree the kernel currently lacks a cohesive story for this 'adjust POLICY of a process and descendents', we have cgroups, but they're more general than a process, we have namespaces, but that's for restricting resources... I think we all want the same thing here, ultimately. > > My argument is that process_madvise() is not a good interface to set > that policy because of its address range like interface. So, if not > process_madvise() then what? Should we add a new syscall? (BTW we had > very similar discussion on process_madvise(DONTNEED) on a remote process > vs a new syscall i.e. process_mrelease()). Sure, and generally both proposed interfaces are at least _awkward_, for me prctl() is a no-go unless we have no other choice, I won't go over my objections to it yet again (and Liam has also raised his of course). > > Adding a new syscall requires that it should be generally useful and > hopefully have more use-cases. Now going back to the current specific > use-case where we want to override the hugepage related policy of a job, > do we expect to use this override forever? I believe this is temporary > because the only reason we need this is because hugepages are not yet > ready for prime time (many applications do not handle them well). In > future when hugepages will be awesome, do we still need this "override > the hugepage policy" syscall? As argued previously, I am not so sure it'll be temporary, given the proposed future 'auto' mode will be a _mode_ and we will need to support VM_[NO]HUGEPAGE scenarios forever (deep, deep sigh). Also if you add it into systemd it definitely won't be right? There's no 'throwaway' here, and scouring through prctl() (what is actually documented :), I am not sure anything ever is, frankly. So the idea is to try to make this as generic as possible and to have it sit with code it makes sense to sit with. > > Now if we can show that this specific functionality is useful more than > hugepages then I think new syscall seems like the best way forward. > However if we think this functionality is only needed temporarily then > shoving it in prctl() seems reasonable to me. If we really don't want > prctl() based solution, I would recommend to discuss the new syscall > approach and see if we can comeup with a more general solution. > So, something Liam mentioned off-list was the beautifully named 'mmadvise()'. Idea being that we have a system call _explicitly for_ mm-wide modifications. With Barry's series doing a prctl() for something similar, and a whole host of mm->flags existing for modifying behaviour, it would seem a natural fit. I could do a respin that does something like this instead. What's a pity to me re: going away from process_madvise() is losing the opportunity to be able to modify the, frankly broken, gaps handling and also being able to do 'best effort' madvise ranges. But I suppose those can always be separate series... :) I guess let me work that up so we can see how that looks? Cheers, Lorenzo
On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > So, something Liam mentioned off-list was the beautifully named > 'mmadvise()'. Idea being that we have a system call _explicitly for_ > mm-wide modifications. > > With Barry's series doing a prctl() for something similar, and a whole host > of mm->flags existing for modifying behaviour, it would seem a natural fit. That's an interesting idea. So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it might also be a good fit for the coredump stuff and ksm if we wanted to incorporate them into that (although it would duplicate the existing proc/prctl knobs). The other MMF_s are internal AFAICS. I think my main concern would be making something very generic and versatile without having sufficiently broad/popular usecases for it. But no strong feelings either way. Like I said, I don't have a strong dislike for prctl(), but this idea would obviously be cleaner if we think there is enough of a demand for a new syscall. > I guess let me work that up so we can see how that looks? I think it's worth exploring!
On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote: > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > > So, something Liam mentioned off-list was the beautifully named > > 'mmadvise()'. Idea being that we have a system call _explicitly for_ > > mm-wide modifications. > > > > With Barry's series doing a prctl() for something similar, and a whole host > > of mm->flags existing for modifying behaviour, it would seem a natural fit. > > That's an interesting idea. > > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it > might also be a good fit for the coredump stuff and ksm if we wanted > to incorporate them into that (although it would duplicate the > existing proc/prctl knobs). The other MMF_s are internal AFAICS. > > I think my main concern would be making something very generic and > versatile without having sufficiently broad/popular usecases for it. > > But no strong feelings either way. Like I said, I don't have a strong > dislike for prctl(), but this idea would obviously be cleaner if we > think there is enough of a demand for a new syscall. To me it seems like having a "global mm control" system call makes much more sense that adding more arms to prctl or overloading process_madvise(). With a dedicated syscall it's much clearer that the operation targets an mm and it works for the entire mm. And two usescase seem enough to me to justify a new syscall. And it's easier to reason about a dedicated syscall designed for a certain operation that for multiplexed ioctl() style controls. > > I guess let me work that up so we can see how that looks? > > I think it's worth exploring! > -- Sincerely yours, Mike.
On Thu, May 22, 2025 at 06:32:11PM +0300, Mike Rapoport wrote: > On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote: > > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > > > So, something Liam mentioned off-list was the beautifully named > > > 'mmadvise()'. Idea being that we have a system call _explicitly for_ > > > mm-wide modifications. > > > > > > With Barry's series doing a prctl() for something similar, and a whole host > > > of mm->flags existing for modifying behaviour, it would seem a natural fit. > > > > That's an interesting idea. > > > > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it > > might also be a good fit for the coredump stuff and ksm if we wanted > > to incorporate them into that (although it would duplicate the > > existing proc/prctl knobs). The other MMF_s are internal AFAICS. > > > > I think my main concern would be making something very generic and > > versatile without having sufficiently broad/popular usecases for it. > > > > But no strong feelings either way. Like I said, I don't have a strong > > dislike for prctl(), but this idea would obviously be cleaner if we > > think there is enough of a demand for a new syscall. > > To me it seems like having a "global mm control" system call makes much > more sense that adding more arms to prctl or overloading process_madvise(). Agreed yeah. > > With a dedicated syscall it's much clearer that the operation targets an mm > and it works for the entire mm. > And two usescase seem enough to me to justify a new syscall. Yes I think so! > > And it's easier to reason about a dedicated syscall designed for a certain > operation that for multiplexed ioctl() style controls. Yes! > > > > I guess let me work that up so we can see how that looks? > > > > I think it's worth exploring! > > > > -- > Sincerely yours, > Mike. Thanks, I will distribute a proposed mctl() prototype API to people here (+ cc you + linux-api also) so we can use that as a basis to assess this approach. Thanks!
On 21.05.25 19:32, Johannes Weiner wrote: > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: >> So, something Liam mentioned off-list was the beautifully named >> 'mmadvise()'. Idea being that we have a system call _explicitly for_ >> mm-wide modifications. As stated elsewhere (e.g., THP cabal yesterday): mctrl() or sth like that might be better. ... or anything else that doesn't (ab)use the "advise" terminology in an interface that will not only consume advises. >> >> With Barry's series doing a prctl() for something similar, and a whole host >> of mm->flags existing for modifying behaviour, it would seem a natural fit. > > That's an interesting idea. > > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it > might also be a good fit for the coredump stuff and ksm if we wanted > to incorporate them into that (although it would duplicate the > existing proc/prctl knobs). The other MMF_s are internal AFAICS. > > I think my main concern would be making something very generic and > versatile without having sufficiently broad/popular usecases for it. > > But no strong feelings either way. Like I said, I don't have a strong > dislike for prctl(), but this idea would obviously be cleaner if we > think there is enough of a demand for a new syscall. Same here. I am not 100% sure process_madvise() is really the right thing to extend, but I do enjoy the SET_DEFAULT_EXEC option. I am also not a big fan of prctl() ... -- Cheers, David / dhildenb
On Thu, May 22, 2025 at 02:45:30PM +0200, David Hildenbrand wrote: > On 21.05.25 19:32, Johannes Weiner wrote: > > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > > > So, something Liam mentioned off-list was the beautifully named > > > 'mmadvise()'. Idea being that we have a system call _explicitly for_ > > > mm-wide modifications. > > As stated elsewhere (e.g., THP cabal yesterday): mctrl() or sth like that > might be better. > > ... or anything else that doesn't (ab)use the "advise" terminology in an > interface that will not only consume advises. Ack, as per my other reply, will work up some pseudocode/API exploration (not yet code) and mail it round to participants here so we can explore this idea. > > > > > > > With Barry's series doing a prctl() for something similar, and a whole host > > > of mm->flags existing for modifying behaviour, it would seem a natural fit. > > > > That's an interesting idea. > > > > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it > > might also be a good fit for the coredump stuff and ksm if we wanted > > to incorporate them into that (although it would duplicate the > > existing proc/prctl knobs). The other MMF_s are internal AFAICS. > > > > I think my main concern would be making something very generic and > > versatile without having sufficiently broad/popular usecases for it. > > > > But no strong feelings either way. Like I said, I don't have a strong > > dislike for prctl(), but this idea would obviously be cleaner if we > > think there is enough of a demand for a new syscall. > > Same here. I am not 100% sure process_madvise() is really the right thing to > extend, but I do enjoy the SET_DEFAULT_EXEC option. I am also not a big fan > of prctl() ... Yeah agreed on both actually! :) > > -- > Cheers, > > David / dhildenb >
On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote: > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > > So, something Liam mentioned off-list was the beautifully named > > 'mmadvise()'. Idea being that we have a system call _explicitly for_ > > mm-wide modifications. > > > > With Barry's series doing a prctl() for something similar, and a whole host > > of mm->flags existing for modifying behaviour, it would seem a natural fit. > > That's an interesting idea. Thanks! > > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it > might also be a good fit for the coredump stuff and ksm if we wanted > to incorporate them into that (although it would duplicate the > existing proc/prctl knobs). The other MMF_s are internal AFAICS. > > I think my main concern would be making something very generic and > versatile without having sufficiently broad/popular usecases for it. Ack, the main argument here is to keep mm stuff together without bitrot. The process_madvise() proposal advocated for the addition of flags that would be useful in other circumstances such as eliminating the broken behaviour around gaps etc which fulfilled this requirement naturally. However it's a fair point that the fork/exec mm-scope stuff is awkwardly placed (but equally so in prctl()). It's an absolutely fair point as to specificity - but I would argue that it's a _general_ thing to want to have mm-level state changes, and while this might be specific _now_, in future the next specific thing can go here, and the next etc. Things that would each have been their own sort of special case in prctl() can now live somewhere maintained by mm people, using core mm code and avoiding bitrot. I realise I (ugh mea culpa) missed a fairly eventful THP meeting, I think David suggested 'mcontrol()' as a name for this? :) In any event absolutely love to hear input from there from anybody on that also! > > But no strong feelings either way. Like I said, I don't have a strong > dislike for prctl(), but this idea would obviously be cleaner if we > think there is enough of a demand for a new syscall. I won't belabour the point by repeating the arguments in this area :) generally I worry about seeing mm code proliferate in non-mm places. > > > I guess let me work that up so we can see how that looks? > > I think it's worth exploring! Thanks! If time permits and there isn't too much push back I"ll try spinning up an RFC. Cheers, Lorenzo
On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote: > [...] > > So, something Liam mentioned off-list was the beautifully named > 'mmadvise()'. Idea being that we have a system call _explicitly for_ > mm-wide modifications. > > With Barry's series doing a prctl() for something similar, and a whole host > of mm->flags existing for modifying behaviour, it would seem a natural fit. > > I could do a respin that does something like this instead. > Please let's first get consensus on this before starting the work. Usama, David, Johannes and others, WDYT?
On 21/05/2025 17:28, Shakeel Butt wrote: > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: >> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote: >> > [...] >> >> So, something Liam mentioned off-list was the beautifully named >> 'mmadvise()'. Idea being that we have a system call _explicitly for_ >> mm-wide modifications. >> >> With Barry's series doing a prctl() for something similar, and a whole host >> of mm->flags existing for modifying behaviour, it would seem a natural fit. >> >> I could do a respin that does something like this instead. >> > > Please let's first get consensus on this before starting the work. > > Usama, David, Johannes and others, WDYT? > I would like that. Introducing another method might make the conversation a lot more complex than it already is? I have addressed the feedback from Lorenzo for the prctl series, but am holding back sending it as RFC v4. The v3 has a NACK on it so I would imagine it would discourage people from reviewing it. If we are still progressing with sending patches, would it make sense for me to wait a couple of days to see if there are any more comments on it and send the RFC v4?
On Wed, May 21, 2025 at 05:57:48PM +0100, Usama Arif wrote: > > > On 21/05/2025 17:28, Shakeel Butt wrote: > > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > >> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote: > >> > > [...] > >> > >> So, something Liam mentioned off-list was the beautifully named > >> 'mmadvise()'. Idea being that we have a system call _explicitly for_ > >> mm-wide modifications. > >> > >> With Barry's series doing a prctl() for something similar, and a whole host > >> of mm->flags existing for modifying behaviour, it would seem a natural fit. > >> > >> I could do a respin that does something like this instead. > >> > > > > Please let's first get consensus on this before starting the work. > > > > Usama, David, Johannes and others, WDYT? > > > > I would like that. Introducing another method might make the conversation a > lot more complex than it already is? The argument is about prctl() vs. another method. Another method was proposed, arguments were made that it didn't seem suitable, so an alternative was proposed. I'm really not sure what complexity this adds? And what better means of comparing approaches than actual code? Isn't an entirely theoretical discussion less helpful? This feels a little like dismissing my input (and I note, Liam's points remain unanswered), and I have to admit, that is a little upsetting. But I suppose one has to have a thick skin in the kernel. > > I have addressed the feedback from Lorenzo for the prctl series, but am > holding back sending it as RFC v4. > The v3 has a NACK on it so I would imagine it would discourage people > from reviewing it. If we are still progressing with sending patches, would it > make sense for me to wait a couple of days to see if there are any more comments > on it and send the RFC v4? I've no problem with you respinning RFC"s, as I've said more than once. The NACK has been explained to you, it's regrettable, but necessary I feel when explicit agreed-upon review has not been actioned (obviously I realise it was a mistake - but this doesn't make it less important to be clear like this). As to stopping and having a conversation about which way forward makes most sense - I feel like that's what I've been trying to do the whole time? I would like to think my input is of value, it is a pity that it seems that it apparently is not. I am obviously happy to hear other people's input. This is what I've been working very hard to try to establish, partly by providing _actual code_ so we can see how things compare. It seems generally people don't have much of a strong opinion about the interface, other than Liam (forgive me if I am misinterpreting you Liam and please correct if so) and myself very strongly disliking prctl() for numerous aforementioned reasons. I would suggest that in that case, and since we maintain madvise() interfaces, it would make sense for us therefore to pursue an alternative approach. But for the absolute best means of determining a way forward I'd suggest an RFC is best. Thanks.
On 21/05/2025 18:39, Lorenzo Stoakes wrote: > On Wed, May 21, 2025 at 05:57:48PM +0100, Usama Arif wrote: >> >> >> On 21/05/2025 17:28, Shakeel Butt wrote: >>> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: >>>> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote: >>>> >>> [...] >>>> >>>> So, something Liam mentioned off-list was the beautifully named >>>> 'mmadvise()'. Idea being that we have a system call _explicitly for_ >>>> mm-wide modifications. >>>> >>>> With Barry's series doing a prctl() for something similar, and a whole host >>>> of mm->flags existing for modifying behaviour, it would seem a natural fit. >>>> >>>> I could do a respin that does something like this instead. >>>> >>> >>> Please let's first get consensus on this before starting the work. >>> >>> Usama, David, Johannes and others, WDYT? >>> >> >> I would like that. Introducing another method might make the conversation a >> lot more complex than it already is? > > The argument is about prctl() vs. another method. > So there are a few things that have been on the mailing list prctl, process_madvise, bpf, cgroup based (although that was shot down). I meant adding another one to all of them. But please see below. > Another method was proposed, arguments were made that it didn't seem > suitable, so an alternative was proposed. > > I'm really not sure what complexity this adds? > > And what better means of comparing approaches than actual code? Isn't an > entirely theoretical discussion less helpful? Sounds good! > This feels a little like dismissing my input (and I note, Liam's points > remain unanswered), and I have to admit, that is a little upsetting. > The current prctl version is a lot lot better only because of your input. The previous version had flags2, MMF flags, was breaking vm merge. All of it was fixed only because of your valuable input. So I have never dismissed your input, and try my best to reply quickly to your and everyone elses reviews. We disagree on the best way forward (whether prctl is appropriate) and I think disagreeing is definitely part of good engineering. Please always assume I am acting in good faith. As to not replying to Liams review, if its https://lore.kernel.org/all/ass5hz6l26jc6xhbtybmgdjf55hmb3v4vvhrxhqampv6ohl67u@qi6iacwzwfk5/ I had already done it several hours ago. If there was a email bashing prctl on how it is a bitrot and where everything goes to die, I really dont know how to reply to that anymore. Otherwise I might have missed it. I really hope one thing is clear from our interactions over the past week, that I always try and address feedback. And for the upsetting part, from all the WTF reactions on my v2, and all the rest of discussions we have had, I feel you have always been upset with whatever I have sent (patches/replies), which I really really don't understand. We will have disagreements, its part of the dev process. The current version wasnt going to get any acks, and was not going to get merged in any form, so I really dont get it. Again, please always assume I am acting in good faith. > But I suppose one has to have a thick skin in the kernel. > >> >> I have addressed the feedback from Lorenzo for the prctl series, but am >> holding back sending it as RFC v4. >> The v3 has a NACK on it so I would imagine it would discourage people >> from reviewing it. If we are still progressing with sending patches, would it >> make sense for me to wait a couple of days to see if there are any more comments >> on it and send the RFC v4? > > I've no problem with you respinning RFC"s, as I've said more than once. The > NACK has been explained to you, it's regrettable, but necessary I feel when > explicit agreed-upon review has not been actioned (obviously I realise it > was a mistake - but this doesn't make it less important to be clear like > this). > > As to stopping and having a conversation about which way forward makes most > sense - I feel like that's what I've been trying to do the whole time? I > would like to think my input is of value, it is a pity that it seems that > it apparently is not. > > I am obviously happy to hear other people's input. This is what I've been > working very hard to try to establish, partly by providing _actual code_ so > we can see how things compare. > > It seems generally people don't have much of a strong opinion about the > interface, other than Liam (forgive me if I am misinterpreting you Liam and > please correct if so) and myself very strongly disliking prctl() for > numerous aforementioned reasons. > > I would suggest that in that case, and since we maintain madvise() > interfaces, it would make sense for us therefore to pursue an alternative > approach. > > But for the absolute best means of determining a way forward I'd suggest an > RFC is best. Sounds good to me. > > Thanks.
I feel this will get circular. So let's not get lost in the weeds here. Let's see what others think, and if not too much push-back I'll put out another RFC for the mcontrol() concept and we can compare to your RFC and use that to reach consensus if that works for you? Thanks, Lorenzo
On 21/05/2025 19:40, Lorenzo Stoakes wrote: > I feel this will get circular. So let's not get lost in the weeds here. > > Let's see what others think, and if not too much push-back I'll put out > another RFC for the mcontrol() concept and we can compare to your RFC and > use that to reach consensus if that works for you? > > Thanks, Lorenzo Sure sounds good, Thanks.
On Wed, May 21, 2025 at 09:28:43AM -0700, Shakeel Butt wrote: > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote: > > On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote: > > > [...] > > > > So, something Liam mentioned off-list was the beautifully named > > 'mmadvise()'. Idea being that we have a system call _explicitly for_ > > mm-wide modifications. > > > > With Barry's series doing a prctl() for something similar, and a whole host > > of mm->flags existing for modifying behaviour, it would seem a natural fit. > > > > I could do a respin that does something like this instead. > > > > Please let's first get consensus on this before starting the work. With respect Shakeel, I'll work on whatever I want, whenever I want. These are RFC's, that is, Requests for Comment, intended to explore new ideas and to show how they might look in practice so we can make the right choice. I feel like I've said this more than once now...! > > Usama, David, Johannes and others, WDYT? > But obviously it would be good to get some input from others, more so from the actual maintainer (David) and reviewers though obviously everybody's opinion matters. Overall my view is prctl() is really something we should avoid here unless we have absolutely no other choice. I've gone into detail as to why already so won't belabour the point. Thanks.
On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote: [...] > > > > Please let's first get consensus on this before starting the work. > > With respect Shakeel, I'll work on whatever I want, whenever I want. I fail to understand why you would respond like that.
On 21.05.25 19:39, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> [...]
>>>
>>> Please let's first get consensus on this before starting the work.
>>
>> With respect Shakeel, I'll work on whatever I want, whenever I want.
>
> I fail to understand why you would respond like that.
Relax guys ... :) Really nothing to be fighting about.
Lorenzo has a lot of energy to play with things, to see how it would
look. I wish I would have that much energy, but I have no idea where it
went ... (well, okay, I have a suspicion) :P
At the same time, I hope (and assume :) ) that Lorenzo will get Usama
involved in the development once we know what we want.
To summarize my current view:
1) ebpf: most people are are not a fan of that, and I agree, at least
for this purpose. If we were talking about making better *placement*
decisions using epbf, it would be a different story.
2) prctl(): the unloved child, and I can understand why. Maybe now is
the right time to stop adding new MM things that feel weird in there.
Maybe we should already have done that with the KSM toggle (guess who
was involved in that ;) ).
3) process_madvise(): I think it's an interesting extension, but
probably we should just have something that applies to the whole
address space naturally. At least my take for now.
4) new syscall: worth exploring how it would look. I'm especially
interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
make them only apply to selected controls.
An API prototype of 4), not necessarily with the code yet, might be
valuable.
In general, the "always/madvise/never" policies are really horrible. We
should instead be prioritizing who gets THPs -- and only disable them
for selected workloads.
Because splitting THPs up because a process is not allowed to use them,
thereby increasing memory fragmentation, is really absolutely suboptimal.
But we don't have anything better right now.
So I would hope that we can at least turn the "always/VM_HUGEPAGE" into
a "prioritize for largest (m)THPs possible" in a distant future.
If only changing the semantics of VM_NOHUGEPAGE to mean "deprioritize
for THPs" couldn't break userfaultfd ... :( But maybe that can be worked
around in the future somehow (e.g., when we detect userfaultfd usage,
not sure, ...).
--
Cheers,
David / dhildenb
On Thu, May 22, 2025 at 03:05:30PM +0200, David Hildenbrand wrote: > On 21.05.25 19:39, Shakeel Butt wrote: > > On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote: > > [...] > > > > > > > > Please let's first get consensus on this before starting the work. > > > > > > With respect Shakeel, I'll work on whatever I want, whenever I want. > > > > I fail to understand why you would respond like that. > > Relax guys ... :) Really nothing to be fighting about. Agreed. [...] > > > To summarize my current view: > > 1) ebpf: most people are are not a fan of that, and I agree, at least > for this purpose. If we were talking about making better *placement* > decisions using epbf, it would be a different story. From placement decisions, do you mean placement between memory tiers/nodes or something else? > > 2) prctl(): the unloved child, and I can understand why. Maybe now is > the right time to stop adding new MM things that feel weird in there. > Maybe we should already have done that with the KSM toggle (guess who > was involved in that ;) ). At the moment systemd is the user I know of and I think it would very easy to migrate it to whatever new thing we decide here. > > 3) process_madvise(): I think it's an interesting extension, but > probably we should just have something that applies to the whole > address space naturally. At least my take for now. > > 4) new syscall: worth exploring how it would look. I'm especially > interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could > make them only apply to selected controls. Were there any previous discussion on SET_DEFAULT_EXEC? First time I am hearing about it. Overall I agree with your assessment and thus I was requesting to at least discuss the new syscall option as well.
>> >> To summarize my current view: >> >> 1) ebpf: most people are are not a fan of that, and I agree, at least >> for this purpose. If we were talking about making better *placement* >> decisions using epbf, it would be a different story. > > From placement decisions, do you mean placement between memory > tiers/nodes or something else? More like: which size to place, but it could be extended to other policies, maybe. Assume we have a page fault and have to decide which size to place. For a process that we really want to use THPs (VM_HUEPAGE?), we could use the largest free folio possible. For a process that we don't want to spend valuable THPs on (VM_HUEPAGE not set?), we could use the smallest free folio possible. Such a possibly might be encoded in an ebpf program I assume. The hints (prioritize regions/processes, deprioritize regions/processes), such as VM_HUGEPAGE, inputs into such a program. > >> >> 2) prctl(): the unloved child, and I can understand why. Maybe now is >> the right time to stop adding new MM things that feel weird in there. >> Maybe we should already have done that with the KSM toggle (guess who >> was involved in that ;) ). > > At the moment systemd is the user I know of and I think it would very > easy to migrate it to whatever new thing we decide here. Agreed. > >> >> 3) process_madvise(): I think it's an interesting extension, but >> probably we should just have something that applies to the whole >> address space naturally. At least my take for now. >> >> 4) new syscall: worth exploring how it would look. I'm especially >> interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could >> make them only apply to selected controls. > > Were there any previous discussion on SET_DEFAULT_EXEC? First time I am > hearing about it. I think it evolved in the discussion here from PMADV_SET_FORK_EXEC_DEFAULT. > > Overall I agree with your assessment and thus I was requesting to at > least discuss the new syscall option as well. Yes. I am still not sure if having a new "process" [1] mode would be a reasonable alternative to setting the VM_HUGEPAGE/VM_NOHUGEPAGE default. Assuming we would have a "process" mode, we could (a) set the policy per-process using the new syscall we discuss here, and options to (B) set the policy to use for the exec child and (c) maybe an option to seal the policy (depending on who is allowed to set the policy in the first place). On the + side, we don't lose hints/instructions from the app (VM_HUGEPAGE/VM_NOHUGEPAGE) when changing the policy on an already running process. The problem I see with the "process" policy is that people might want different "default" policies for processes, which means that we will have to add yet another toggle. How I hate THP toggles. :) [1] https://lore.kernel.org/all/CALOAHbB-KQ4+z-Lupv7RcxArfjX7qtWcrboMDdT4LdpoTXOMyw@mail.gmail.com/ -- Cheers, David / dhildenb
TL;DR - action item on below is I'll put together a proposed API (without code) and cc people here when I've done so, so we can take a look at how mctl() or mmadvise() or whatever we call it might look :) On Thu, May 22, 2025 at 03:05:30PM +0200, David Hildenbrand wrote: > On 21.05.25 19:39, Shakeel Butt wrote: > > On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote: > > [...] > > > > > > > > Please let's first get consensus on this before starting the work. > > > > > > With respect Shakeel, I'll work on whatever I want, whenever I want. > > > > I fail to understand why you would respond like that. > > Relax guys ... :) Really nothing to be fighting about. Agreed...! > > Lorenzo has a lot of energy to play with things, to see how it would look. I > wish I would have that much energy, but I have no idea where it went ... > (well, okay, I have a suspicion) :P We have cats rather than kids which might explain a bit ;) > > At the same time, I hope (and assume :) ) that Lorenzo will get Usama > involved in the development once we know what we want. > > > To summarize my current view: > > 1) ebpf: most people are are not a fan of that, and I agree, at least > for this purpose. If we were talking about making better *placement* > decisions using epbf, it would be a different story. Yeah, I think overall we have a situation that is _bad_ in terms of interface. We need something more fine-grained, but it's chicken and egg, and there are genuine needs users have _now_. So the whole discussion is about this. > > 2) prctl(): the unloved child, and I can understand why. Maybe now is > the right time to stop adding new MM things that feel weird in there. > Maybe we should already have done that with the KSM toggle (guess who > was involved in that ;) ). I won't belabour this point, at this point I might get a reputation as prctl()'s biggest hater otherwise :P But one thing I will say is - systemd makes these things permanent (hey that KSM thing that breaks VMA merging is literally an option in systemd, wasn't aware :) > > 3) process_madvise(): I think it's an interesting extension, but > probably we should just have something that applies to the whole > address space naturally. At least my take for now. Yeah that's the point of view I've come to, I mean the point was to try to make this more generic in a way that _also_ got us improved control over madvise() - sort of win/win. But the 'default the process' thing is, as Shakeel and Liam rightly say, just really out of band or doesn't quite fit this interface. I may still put forward a patch to add flags for e.g. not breaking on gaps but as a separate thing I think, I still think that'd be valuable (but I'll provide solid at least self tests to make the point). > > 4) new syscall: worth exploring how it would look. I'm especially > interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could > make them only apply to selected controls. Yeah, this is exactly what I want to play with. > > > An API prototype of 4), not necessarily with the code yet, might be > valuable. ACK, though I really find it valuable to code things up because so often you figure out what works by trying to make it work in practice. This is how guard regions happened for instance, we had a ton of conversation like this, loads of back and forth, nobody quite knew, then I wrote some prototype code and it became apparent that this thing was doable. I never intend the RFC to be _the work_ rather it's a 'proof of concept' for discussion. However, as we're still fairly vague on the API bit, I think in this case it'll be valuable to do exactly what you suggest and simply prototype an API around this without code. So I'll do that and come up with something as a separate mail, cc'ing people here. > > In general, the "always/madvise/never" policies are really horrible. We > should instead be prioritizing who gets THPs -- and only disable them for > selected workloads. I couldn't agree more. > > Because splitting THPs up because a process is not allowed to use them, > thereby increasing memory fragmentation, is really absolutely suboptimal. Yes, there's a disconnect here between - a global resource (-ish :P) - and process requirements. > > But we don't have anything better right now. I feel like all this turmoil brings us closer to longer term solutions, if perhaps via pain-inspired development (a new programming philosophy I intend to trademark ;) > > So I would hope that we can at least turn the "always/VM_HUGEPAGE" into a > "prioritize for largest (m)THPs possible" in a distant future. I suspect we might still require some legacy settings so people don't panic. Aren't uAPIs fun? > > If only changing the semantics of VM_NOHUGEPAGE to mean "deprioritize for > THPs" couldn't break userfaultfd ... :( But maybe that can be worked around > in the future somehow (e.g., when we detect userfaultfd usage, not sure, > ...). I hate how uffd is implemented (I like the concept of what it provides though!) on multiple levels. It's crept into so much and the idea it's putting restrictions on core stuff is just horrid. I do feel though we may want to introduce something new for this though, as 'never' or 'no' suddenly not being no but 'deprioritise' could be pretty concerning for people. But on the other hand, this is a resource for the kernel to determine how to manage as it sees fit so, perhaps we shouldn't care... > > -- > Cheers, > > David / dhildenb > Thanks!
On 19.05.25 22:52, Lorenzo Stoakes wrote: > REVIEWERS NOTES: > ================ > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > 'putting it out there' for feedback. Any serious version of this will add a > bunch of self-tests to assert correct behaviour and I will more carefully > confirm everything's working. > > This is based on discussion arising from Usama's series [0], SJ's input on > the thread around process_madvise() behaviour [1] (and a subsequent > response by me [2]) and prior discussion about a new madvise() interface > [3]. > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > ================ > > Currently, we are rather restricted in how madvise() operations > proceed. While effort has been put in to expanding what process_madvise() > can do (that is - unrestricted application of advice to the local process > alongside recent improvements on the efficiency of TLB operations over > these batvches), we are still constrained by existing madvise() limitations > and default behaviours. > > This series makes use of the currently unused flags field in > process_madvise() to provide more flexiblity. > In general, sounds like an interesting approach. > It introduces four flags: > > 1. PMADV_SKIP_ERRORS > > Currently, when an error arises applying advice in any individual VMA > (keeping in mind that a range specified to madvise() or as part of the > iovec passed to process_madvise()), the operation stops where it is and > returns an error. > > This might not be the desired behaviour of the user, who may wish instead > for the operation to be 'best effort'. By setting this flag, that behaviour > is obtained. > > Since process_madvise() would trivially, if skipping errors, simply return > the input vector size, we instead return the number of entries in the > vector which completed successfully without error. I would focus only on adding flags that we absolutely need to make the use case we have in mind work. We can always add other flags as we see fit for them (IOW, when really required ;) ). Regarding MADV_HUGEPAGE / MADV_NOHUGEPAGE, this would not be required, right? > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > Currently madvise() has the peculiar behaviour of, if the range specified > to it contains unmapped range(s), completing the full operation, but > ultimately returning -ENOMEM. > > In the case of process_madvise(), this is fatal, as the operation will stop > immediately upon this occurring. > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > unmapped areas to simply be entirely ignored. Again, is this really required? Couldn't we glue that to PMADV_ENTIRE_ADDRESS_SPACE for our use case? After all, I don't expect anybody to have something mapped into *the entire address space*. Well, okay, maybe on 32bit, but still ... :) > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > It may be desirable for a user to specify that all VMAs mapped in a process > address space default to having an madvise() behaviour established by > default, in such a fashion as that this persists across fork/exec. This is very specific for MADV_HUGEPAGE only, so I wonder how we could either avoid that flag or just make it clear that it shall stick around ... Having that sad, PMADV_SET_FORK_EXEC_DEFAULT is rather a suboptimal name :( -- Cheers, David / dhildenb
On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote: > On 19.05.25 22:52, Lorenzo Stoakes wrote: > > REVIEWERS NOTES: > > ================ > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > > 'putting it out there' for feedback. Any serious version of this will add a > > bunch of self-tests to assert correct behaviour and I will more carefully > > confirm everything's working. > > > > This is based on discussion arising from Usama's series [0], SJ's input on > > the thread around process_madvise() behaviour [1] (and a subsequent > > response by me [2]) and prior discussion about a new madvise() interface > > [3]. > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > > > ================ > > > > Currently, we are rather restricted in how madvise() operations > > proceed. While effort has been put in to expanding what process_madvise() > > can do (that is - unrestricted application of advice to the local process > > alongside recent improvements on the efficiency of TLB operations over > > these batvches), we are still constrained by existing madvise() limitations > > and default behaviours. > > > > This series makes use of the currently unused flags field in > > process_madvise() to provide more flexiblity. > > > > In general, sounds like an interesting approach. Thanks! If you agree this is workable, then I'll go ahead and put some more effort into writing tests etc. on the next respin. > > > It introduces four flags: > > > > 1. PMADV_SKIP_ERRORS > > > > Currently, when an error arises applying advice in any individual VMA > > (keeping in mind that a range specified to madvise() or as part of the > > iovec passed to process_madvise()), the operation stops where it is and > > returns an error. > > > > This might not be the desired behaviour of the user, who may wish instead > > for the operation to be 'best effort'. By setting this flag, that behaviour > > is obtained. > > > > Since process_madvise() would trivially, if skipping errors, simply return > > the input vector size, we instead return the number of entries in the > > vector which completed successfully without error. > > I would focus only on adding flags that we absolutely need to make the use > case we have in mind work. We can always add other flags as we see fit for > them (IOW, when really required ;) ). > > Regarding MADV_HUGEPAGE / MADV_NOHUGEPAGE, this would not be required, > right? Sure, we can restrict this to only supported flags to be conservative. The idea was though that somebody might want to simply do a 'best effort' change. However at the same time it's possibly a wee bit dangerous... > > > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > > > Currently madvise() has the peculiar behaviour of, if the range specified > > to it contains unmapped range(s), completing the full operation, but > > ultimately returning -ENOMEM. > > > > In the case of process_madvise(), this is fatal, as the operation will stop > > immediately upon this occurring. > > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > > unmapped areas to simply be entirely ignored. > > Again, is this really required? Couldn't we glue that to > PMADV_ENTIRE_ADDRESS_SPACE for our use case? After all, I don't expect > anybody to have something mapped into *the entire address space*. Well, I think it's an ongoing issue that unmapped entries cause the whole thing to break, I do think it makes sense to make this _generally_ available, actually. Obviously we should probably make PMADV_ENTIRE_MAPPING imply PMADV_NO_ERROR_ON_UNMAPPED in the same way that PMADV_SKIP_ERRORS implies PMADV_NO_ERROR_ON_UNMAPPED. And yes I don't think any sane person would map the entirety of the 64-bit address space :P > > Well, okay, maybe on 32bit, but still ... :) 32 what? :P I deny its existence... (ugh ok I guess I have to ack it, but even in that case it's not very likely either :) > > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > address space default to having an madvise() behaviour established by > > default, in such a fashion as that this persists across fork/exec. > > This is very specific for MADV_HUGEPAGE only, so I wonder how we could > either avoid that flag or just make it clear that it shall stick around ... > > Having that sad, PMADV_SET_FORK_EXEC_DEFAULT is rather a suboptimal name :( Yeah it's horrid, see Pedro's suggestions, e.g. PMADV_SET_DEFAULT | PMADV_INHERIT_EXEC. I even wonder about PMADV_, but that's probably fine. PRMADV sucks, PADV sort of loses the mm bit, PMADV is probably best we can do! > > -- > Cheers, > > David / dhildenb >
On Tue, May 20, 2025 at 06:47:48PM +0100, Lorenzo Stoakes wrote: > On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote: [snip] > > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > > address space default to having an madvise() behaviour established by > > > default, in such a fashion as that this persists across fork/exec. > > > > This is very specific for MADV_HUGEPAGE only, so I wonder how we could > > either avoid that flag or just make it clear that it shall stick around ... > > Sorry missed this bit. I don't really like the idea of only for MADV_HUGEPAGE (and MADV_NOHUGEPAGE) defaulting to doing this, I think that's far less clear than a user explicitly asking for it, plus most users using process_madvise() wouldn't expect it. So restricting this to these flags only and rejecting all others should cover this off fine I think. [snip]
On 20.05.25 20:25, Lorenzo Stoakes wrote: > On Tue, May 20, 2025 at 06:47:48PM +0100, Lorenzo Stoakes wrote: >> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote: > [snip] >>>> 3. PMADV_SET_FORK_EXEC_DEFAULT >>>> >>>> It may be desirable for a user to specify that all VMAs mapped in a process >>>> address space default to having an madvise() behaviour established by >>>> default, in such a fashion as that this persists across fork/exec. >>> >>> This is very specific for MADV_HUGEPAGE only, so I wonder how we could >>> either avoid that flag or just make it clear that it shall stick around ... >>> > > Sorry missed this bit. > > I don't really like the idea of only for MADV_HUGEPAGE (and MADV_NOHUGEPAGE) > defaulting to doing this, I think that's far less clear than a user explicitly > asking for it, plus most users using process_madvise() wouldn't expect it. The thing is that PMADV_ENTIRE_ADDRESS_SPACE already adds something unexpected: suddenly also *new* mappings will be affected. But maybe that'd be covered by the PMADV_SET_DEFAULT idea. Maybe PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SET_DEFAULT | PMADV_SET_EXEC_DEFAULT ... hm, but "PMADV_ENTIRE_ADDRESS_SPACE" can then simply be "pass in the entire address space as a range" ... and ignore errors. I'd probably have to see a revised proposal after the current discussions. Anyhow, limiting the flags to a bare minimum for now usually makes sense, as long as it is expandable. -- Cheers, David / dhildenb
On 20/05/2025 18:47, Lorenzo Stoakes wrote: > On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote: >> On 19.05.25 22:52, Lorenzo Stoakes wrote: >>> REVIEWERS NOTES: >>> ================ >>> >>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm >>> 'putting it out there' for feedback. Any serious version of this will add a >>> bunch of self-tests to assert correct behaviour and I will more carefully >>> confirm everything's working. >>> >>> This is based on discussion arising from Usama's series [0], SJ's input on >>> the thread around process_madvise() behaviour [1] (and a subsequent >>> response by me [2]) and prior discussion about a new madvise() interface >>> [3]. >>> >>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ >>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ >>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ >>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ >>> >>> ================ >>> >>> Currently, we are rather restricted in how madvise() operations >>> proceed. While effort has been put in to expanding what process_madvise() >>> can do (that is - unrestricted application of advice to the local process >>> alongside recent improvements on the efficiency of TLB operations over >>> these batvches), we are still constrained by existing madvise() limitations >>> and default behaviours. >>> >>> This series makes use of the currently unused flags field in >>> process_madvise() to provide more flexiblity. >>> >> >> In general, sounds like an interesting approach. > > Thanks! > > If you agree this is workable, then I'll go ahead and put some more effort > into writing tests etc. on the next respin. > So the prctl and process_madvise patches both are trying to accomplish a similar end goal. Would it make sense to discuss what would be the best way forward before we continue developing the solutions? If we are not at that stage and a clear picture has not formed yet, happy to continue refining the solutions. But just thought I would check. I feel like changing process_madvise which was designed to work on an array of iovec structures to have flags to skip errors and ignore the iovec makes it function similar to a prctl call is not the right approach. IMHO, prctl is a more direct solution to this. I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise. But if in the end, we want to have a THP auto way which is truly transparent, would it not be better to just have this as prctl and not change the madvise structure? Maybe in a few years we wont need any of this, and it will be lost in prctl and madvise wouldn't have changed for this? Again, this is just to have a discussion (and not an aggressive argument :)), and would love to get feedback from everyone in the community. If its too early to have this discussion, its completely fine and we can still keep developing the RFCs :) Thanks! Usama
On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote: > > > On 20/05/2025 18:47, Lorenzo Stoakes wrote: > > On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote: > >> On 19.05.25 22:52, Lorenzo Stoakes wrote: > >>> REVIEWERS NOTES: > >>> ================ > >>> > >>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm > >>> 'putting it out there' for feedback. Any serious version of this will add a > >>> bunch of self-tests to assert correct behaviour and I will more carefully > >>> confirm everything's working. > >>> > >>> This is based on discussion arising from Usama's series [0], SJ's input on > >>> the thread around process_madvise() behaviour [1] (and a subsequent > >>> response by me [2]) and prior discussion about a new madvise() interface > >>> [3]. > >>> > >>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > >>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > >>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > >>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > >>> > >>> ================ > >>> > >>> Currently, we are rather restricted in how madvise() operations > >>> proceed. While effort has been put in to expanding what process_madvise() > >>> can do (that is - unrestricted application of advice to the local process > >>> alongside recent improvements on the efficiency of TLB operations over > >>> these batvches), we are still constrained by existing madvise() limitations > >>> and default behaviours. > >>> > >>> This series makes use of the currently unused flags field in > >>> process_madvise() to provide more flexiblity. > >>> > >> > >> In general, sounds like an interesting approach. > > > > Thanks! > > > > If you agree this is workable, then I'll go ahead and put some more effort > > into writing tests etc. on the next respin. > > > > So the prctl and process_madvise patches both are trying to accomplish a > similar end goal. > > Would it make sense to discuss what would be the best way forward before we > continue developing the solutions? If we are not at that stage and a clear > picture has not formed yet, happy to continue refining the solutions. > But just thought I would check. As stated in the thread (I went out of my way to link everything above), and stated with you cc'd in every discussion (I think!), this idea arose as a concept that came out of my brainstorming whether we could better align this concept with madvise(). This arose because of discussions had in-thread where we agreed it made most sense to have this feature in effect perform a 'default madvise()'. At this point, we are essentially _duplicating what madvise does_ in the prctl() with your approach, only now all of the code that does that is bitrotting in kernel/sys.c. This approach was an attempt to avoid that. It started as a 'crazy idea' I was throwing out there, as an RFC. The idea was we could compare and contrast with the prctl() RFC. Obviously this is gaining some traction now as a concept and your respin was a little rushed out so needs rework. So, apologies if it seems this is an attempt to supercede your series or such, it truly wasn't intended that way, rather I have been working 12 hour days trying to find the best way possible to _make what you want happen_ while also doing what's best for mm and madvise (among many other tasks of course :) So the idea is we can: a. solve long-standing problems with madvise() that prevent it being used for certain things (e.g. the error on gaps which breaks process_madvise() and hides real errors) b. Also provide this functionality in a way that the absolute most minimal delta from existing logic... c. ...While keeping all this logic in mm and avoiding bitrot in kernel/sys.c. So obviously my view is that this approach is superior to the prctl() one. However the intent was that you would probably take a little longer in spinning up an RFC, and then we could compare the two approaches, if people didn't reject my 'crazy idea' :) Obviously you respan your (kinda ;) RFC way quicker than expected, and then there were a ton of bugs, which added workload and made it perhaps a little more pressing as to deciding which should be pursued. I would suggest holding off on your series until we see whether this one works on this basis. But obviously this is simply my point of view. To be clear however, this was not a planned series of events... I mean equally, we are seeing several other series from Yafang, Nico and Dev in relation to [m]THP and even a obliquely-related series from Barry, so it seems we are in contention across many planes here :) > > I feel like changing process_madvise which was designed to work on an array > of iovec structures to have flags to skip errors and ignore the iovec > makes it function similar to a prctl call is not the right approach. > IMHO, prctl is a more direct solution to this. I don't know what you mean by 'function similar to a prctl call', or what you mean by it being a more direct solution. The problem with prctl() is there is no pattern, it's a dumping ground for arbitrary stuff and a prime place for bitrot. It also means we give up maintainership of mm-specific code. It encourages misalignment with other interfaces and APIs. What is being discussed here is an effort to _better align what you want with an existing interface_ - if we treat this as a 'default madvise()' plus having additional madvise functionality (apply to all, ignoring errors e.g.) - and put all this code _alongside existing madvise code_ - this makes this vastly more maintainable, futureproof and robust. And keep in mind the proposed flags are _flags_ not default behaviours, ones which we will carefully restrict to known flags, starting with the flags you guys need. > > I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise. > But if in the end, we want to have a THP auto way which is truly transparent, > would it not be better to just have this as prctl and not change the madvise > structure? Maybe in a few years we wont need any of this, and it will be lost > in prctl and madvise wouldn't have changed for this? This really sounds a lot like your colleague Shakeel's objection, so I don't want to duplicate my response too much, but as I said to him, at this stage we would set THP mode to 'auto', but still have to support MADV_[NO]HUGEPAGE. We may then wish to set these as no-ops in auto mode right? But they'd still have to exist for uAPI reasons. So would it be better to do this in mm/madvise.c alongside literally all other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or would it be better to throw it in kernel/sys.c and hope people remember to update it? I think it's pretty clear what the answer to that is. > > Again, this is just to have a discussion (and not an aggressive argument :)), > and would love to get feedback from everyone in the community. > If its too early to have this discussion, its completely fine and we can > still keep developing the RFCs :) I try hard never to be aggressive, I am firm when I feel it is appropriate to be so (it's important to push back when necessarily I feel), but I always try to maintain civility as well as I can. Of course I am imperfect, so apologies if it comes across any other way, I really do try very hard to maintain a high standard of professionalism on-list. Again as I said, I really did not intend to supercede or interfere with your series, this was just unfortunate timing and throwing out ideas to reach the best solution. My objection to prctl() is due to bit-rot, mm code in inappropriate places, the fact it by nature lends itself to violating conventions and practices that exist in other mm APIs, not some subjective sense of evil. It is historically a place where 'things that people don't really care about but don't quite want to NACK' go also, and to me suggests that the problem hasn't necessarily been thought through to see how it might be implemented by extending existing APIs or finding ways to achieve the same thing that better align with existing intefaces. To be clear - this concept is _all_ ultimately a product of your series and your ideas leading the discussion within the community to this point where a potential alternative has arisen - without which this would not have occurred. So the idea here is to simply explore what the best solution is that aligns with what best serves the community. > > Thanks! > Usama Thanks, Lorenzo
On 20/05/2025 20:21, Lorenzo Stoakes wrote: > On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote: >> >> >> On 20/05/2025 18:47, Lorenzo Stoakes wrote: >>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote: >>>> On 19.05.25 22:52, Lorenzo Stoakes wrote: >>>>> REVIEWERS NOTES: >>>>> ================ >>>>> >>>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm >>>>> 'putting it out there' for feedback. Any serious version of this will add a >>>>> bunch of self-tests to assert correct behaviour and I will more carefully >>>>> confirm everything's working. >>>>> >>>>> This is based on discussion arising from Usama's series [0], SJ's input on >>>>> the thread around process_madvise() behaviour [1] (and a subsequent >>>>> response by me [2]) and prior discussion about a new madvise() interface >>>>> [3]. >>>>> >>>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ >>>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ >>>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ >>>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ >>>>> >>>>> ================ >>>>> >>>>> Currently, we are rather restricted in how madvise() operations >>>>> proceed. While effort has been put in to expanding what process_madvise() >>>>> can do (that is - unrestricted application of advice to the local process >>>>> alongside recent improvements on the efficiency of TLB operations over >>>>> these batvches), we are still constrained by existing madvise() limitations >>>>> and default behaviours. >>>>> >>>>> This series makes use of the currently unused flags field in >>>>> process_madvise() to provide more flexiblity. >>>>> >>>> >>>> In general, sounds like an interesting approach. >>> >>> Thanks! >>> >>> If you agree this is workable, then I'll go ahead and put some more effort >>> into writing tests etc. on the next respin. >>> >> >> So the prctl and process_madvise patches both are trying to accomplish a >> similar end goal. >> >> Would it make sense to discuss what would be the best way forward before we >> continue developing the solutions? If we are not at that stage and a clear >> picture has not formed yet, happy to continue refining the solutions. >> But just thought I would check. > > As stated in the thread (I went out of my way to link everything above), > and stated with you cc'd in every discussion (I think!), this idea arose as > a concept that came out of my brainstorming whether we could better align > this concept with madvise(). > > This arose because of discussions had in-thread where we agreed it made > most sense to have this feature in effect perform a 'default madvise()'. > > At this point, we are essentially _duplicating what madvise does_ in the > prctl() with your approach, only now all of the code that does that is > bitrotting in kernel/sys.c. > > This approach was an attempt to avoid that. > > It started as a 'crazy idea' I was throwing out there, as an RFC. The idea > was we could compare and contrast with the prctl() RFC. > > Obviously this is gaining some traction now as a concept and your respin > was a little rushed out so needs rework. > > So, apologies if it seems this is an attempt to supercede your series or > such, it truly wasn't intended that way, rather I have been working 12 hour > days trying to find the best way possible to _make what you want happen_ > while also doing what's best for mm and madvise (among many other tasks of > course :) Please don't burn out spending 12 hour days on this! Your feedback is very amazing! (Thanks again!) It will take time to come up with a solution right for the kernel! > > So the idea is we can: > > a. solve long-standing problems with madvise() that prevent it being used > for certain things (e.g. the error on gaps which breaks > process_madvise() and hides real errors) > > b. Also provide this functionality in a way that the absolute most minimal > delta from existing logic... > > c. ...While keeping all this logic in mm and avoiding bitrot in > kernel/sys.c. > > So obviously my view is that this approach is superior to the prctl() one. > > However the intent was that you would probably take a little longer in > spinning up an RFC, and then we could compare the two approaches, if people > didn't reject my 'crazy idea' :) > > Obviously you respan your (kinda ;) RFC way quicker than expected, and then > there were a ton of bugs, which added workload and made it perhaps a little > more pressing as to deciding which should be pursued. > I feel like a ton (i.e. a thousand) sounds like a lot, there are a couple of bugs in a series I meant to send as an RFC (mistakes happen :)). I will wait a couple of days to see how things develop and send a prctl RFC (will remember the tag this time) and we can have a better comparison of what this would look like. > I would suggest holding off on your series until we see whether this one > works on this basis. But obviously this is simply my point of view. > > To be clear however, this was not a planned series of events... > > I mean equally, we are seeing several other series from Yafang, Nico and > Dev in relation to [m]THP and even a obliquely-related series from Barry, > so it seems we are in contention across many planes here :) > >> >> I feel like changing process_madvise which was designed to work on an array >> of iovec structures to have flags to skip errors and ignore the iovec >> makes it function similar to a prctl call is not the right approach. >> IMHO, prctl is a more direct solution to this. > > I don't know what you mean by 'function similar to a prctl call', or what > you mean by it being a more direct solution. So I thought I was going a bit crazy, but I am glad someone else raised this as well. If we look at the man page for prctl it says it manipulates various aspects of the behavior of the calling thread or process. If we look at the man page for prcocess_madvise, it was for providing advice for the address ranges described by iovec and n What I mean (and I assume Shakeel meant as well) is this is changing what process_madvise was designed for into changing the behaviour of a process (i.e. prctl). This is what I mean by 'function similar to a prctl call'. > > The problem with prctl() is there is no pattern, it's a dumping ground for > arbitrary stuff and a prime place for bitrot. It also means we give up > maintainership of mm-specific code. > > It encourages misalignment with other interfaces and APIs. > > What is being discussed here is an effort to _better align what you want > with an existing interface_ - if we treat this as a 'default madvise()' > plus having additional madvise functionality (apply to all, ignoring errors > e.g.) - and put all this code _alongside existing madvise code_ - this > makes this vastly more maintainable, futureproof and robust. > > And keep in mind the proposed flags are _flags_ not default behaviours, > ones which we will carefully restrict to known flags, starting with the > flags you guys need. > >> >> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise. >> But if in the end, we want to have a THP auto way which is truly transparent, >> would it not be better to just have this as prctl and not change the madvise >> structure? Maybe in a few years we wont need any of this, and it will be lost >> in prctl and madvise wouldn't have changed for this? > > This really sounds a lot like your colleague Shakeel's objection, so I > don't want to duplicate my response too much, but as I said to him, at this > stage we would set THP mode to 'auto', but still have to support > MADV_[NO]HUGEPAGE. > > We may then wish to set these as no-ops in auto mode right? But they'd > still have to exist for uAPI reasons. > > So would it be better to do this in mm/madvise.c alongside literally all > other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or > would it be better to throw it in kernel/sys.c and hope people remember to > update it? > > I think it's pretty clear what the answer to that is. > >> >> Again, this is just to have a discussion (and not an aggressive argument :)), >> and would love to get feedback from everyone in the community. >> If its too early to have this discussion, its completely fine and we can >> still keep developing the RFCs :) > > I try hard never to be aggressive, I am firm when I feel it is appropriate > to be so (it's important to push back when necessarily I feel), but I > always try to maintain civility as well as I can. > > Of course I am imperfect, so apologies if it comes across any other way, I > really do try very hard to maintain a high standard of professionalism > on-list. > > Again as I said, I really did not intend to supercede or interfere with > your series, this was just unfortunate timing and throwing out ideas to > reach the best solution. > > My objection to prctl() is due to bit-rot, mm code in inappropriate places, > the fact it by nature lends itself to violating conventions and practices > that exist in other mm APIs, not some subjective sense of evil. > > It is historically a place where 'things that people don't really care > about but don't quite want to NACK' go also, and to me suggests that the > problem hasn't necessarily been thought through to see how it might be > implemented by extending existing APIs or finding ways to achieve the same > thing that better align with existing intefaces. > > To be clear - this concept is _all_ ultimately a product of your series and > your ideas leading the discussion within the community to this point where > a potential alternative has arisen - without which this would not have > occurred. > > So the idea here is to simply explore what the best solution is that aligns > with what best serves the community. > >> >> Thanks! >> Usama > > Thanks, Lorenzo
On Tue, May 20, 2025 at 08:42:50PM +0100, Usama Arif wrote: > > > On 20/05/2025 20:21, Lorenzo Stoakes wrote: > > On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote: > >> > >> > >> On 20/05/2025 18:47, Lorenzo Stoakes wrote: > >>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote: > >>>> On 19.05.25 22:52, Lorenzo Stoakes wrote: > >>>>> REVIEWERS NOTES: > >>>>> ================ > >>>>> > >>>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm > >>>>> 'putting it out there' for feedback. Any serious version of this will add a > >>>>> bunch of self-tests to assert correct behaviour and I will more carefully > >>>>> confirm everything's working. > >>>>> > >>>>> This is based on discussion arising from Usama's series [0], SJ's input on > >>>>> the thread around process_madvise() behaviour [1] (and a subsequent > >>>>> response by me [2]) and prior discussion about a new madvise() interface > >>>>> [3]. > >>>>> > >>>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/ > >>>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/ > >>>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > >>>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > >>>>> > >>>>> ================ > >>>>> > >>>>> Currently, we are rather restricted in how madvise() operations > >>>>> proceed. While effort has been put in to expanding what process_madvise() > >>>>> can do (that is - unrestricted application of advice to the local process > >>>>> alongside recent improvements on the efficiency of TLB operations over > >>>>> these batvches), we are still constrained by existing madvise() limitations > >>>>> and default behaviours. > >>>>> > >>>>> This series makes use of the currently unused flags field in > >>>>> process_madvise() to provide more flexiblity. > >>>>> > >>>> > >>>> In general, sounds like an interesting approach. > >>> > >>> Thanks! > >>> > >>> If you agree this is workable, then I'll go ahead and put some more effort > >>> into writing tests etc. on the next respin. > >>> > >> > >> So the prctl and process_madvise patches both are trying to accomplish a > >> similar end goal. > >> > >> Would it make sense to discuss what would be the best way forward before we > >> continue developing the solutions? If we are not at that stage and a clear > >> picture has not formed yet, happy to continue refining the solutions. > >> But just thought I would check. > > > > As stated in the thread (I went out of my way to link everything above), > > and stated with you cc'd in every discussion (I think!), this idea arose as > > a concept that came out of my brainstorming whether we could better align > > this concept with madvise(). > > > > This arose because of discussions had in-thread where we agreed it made > > most sense to have this feature in effect perform a 'default madvise()'. > > > > At this point, we are essentially _duplicating what madvise does_ in the > > prctl() with your approach, only now all of the code that does that is > > bitrotting in kernel/sys.c. > > > > This approach was an attempt to avoid that. > > > > It started as a 'crazy idea' I was throwing out there, as an RFC. The idea > > was we could compare and contrast with the prctl() RFC. > > > > Obviously this is gaining some traction now as a concept and your respin > > was a little rushed out so needs rework. > > > > So, apologies if it seems this is an attempt to supercede your series or > > such, it truly wasn't intended that way, rather I have been working 12 hour > > days trying to find the best way possible to _make what you want happen_ > > while also doing what's best for mm and madvise (among many other tasks of > > course :) > > Please don't burn out spending 12 hour days on this! Your feedback is very > amazing! (Thanks again!) It will take time to come up with a solution right > for the kernel! Thanks I do try, the point is that everything is in good faith here... I'm just trying to do what's right for the kernel. > > > > > So the idea is we can: > > > > a. solve long-standing problems with madvise() that prevent it being used > > for certain things (e.g. the error on gaps which breaks > > process_madvise() and hides real errors) > > > > b. Also provide this functionality in a way that the absolute most minimal > > delta from existing logic... > > > > c. ...While keeping all this logic in mm and avoiding bitrot in > > kernel/sys.c. > > > > So obviously my view is that this approach is superior to the prctl() one. > > > > However the intent was that you would probably take a little longer in > > spinning up an RFC, and then we could compare the two approaches, if people > > didn't reject my 'crazy idea' :) > > > > Obviously you respan your (kinda ;) RFC way quicker than expected, and then > > there were a ton of bugs, which added workload and made it perhaps a little > > more pressing as to deciding which should be pursued. > > > > I feel like a ton (i.e. a thousand) sounds like a lot, there are a couple of bugs > in a series I meant to send as an RFC (mistakes happen :)). I will wait a couple > of days to see how things develop and send a prctl RFC (will remember the tag this > time) and we can have a better comparison of what this would look like. Figure of speech ;) I didn't mean to disparage it, rather it felt rushed. Sorry maybe should have more carefully phrased that. Text is a bad medium etc. > > > I would suggest holding off on your series until we see whether this one > > works on this basis. But obviously this is simply my point of view. > > > > To be clear however, this was not a planned series of events... > > > > I mean equally, we are seeing several other series from Yafang, Nico and > > Dev in relation to [m]THP and even a obliquely-related series from Barry, > > so it seems we are in contention across many planes here :) > > > >> > >> I feel like changing process_madvise which was designed to work on an array > >> of iovec structures to have flags to skip errors and ignore the iovec > >> makes it function similar to a prctl call is not the right approach. > >> IMHO, prctl is a more direct solution to this. > > > > I don't know what you mean by 'function similar to a prctl call', or what > > you mean by it being a more direct solution. > > So I thought I was going a bit crazy, but I am glad someone else raised this as well. > > If we look at the man page for prctl it says it manipulates various aspects > of the behavior of the calling thread or process. > > If we look at the man page for prcocess_madvise, it was for providing advice for the > address ranges described by iovec and n And your proposal is to do precisely the same only also setting the default. > > What I mean (and I assume Shakeel meant as well) is this is changing what > process_madvise was designed for into changing the behaviour of a process (i.e. prctl). > This is what I mean by 'function similar to a prctl call'. Yeah I'm aware of the definition in the man page, it's really nebulous. I mean even 'process' is unclear, do you mean task, do you mean thread leader? Do you mean the shared virtual address space? prctl() already does VMA-specific stuff. The overlap is enormous, and in practice this is not a distinction that is in any real sense maintained. prctl() is, de facto, a 'place where random stuff is put'. And unavoidably you'll end up with mm-specific code (which now _must be exported_) in kernel/sys.c. At any rate, clearly there is still debate to be had here, so let's work on both these series and see where things end up. Obviously I'm open to feedback from ohers, but clearly I have a fairly strong view on this :) > > > > > > The problem with prctl() is there is no pattern, it's a dumping ground for > > arbitrary stuff and a prime place for bitrot. It also means we give up > > maintainership of mm-specific code. > > > > It encourages misalignment with other interfaces and APIs. > > > > What is being discussed here is an effort to _better align what you want > > with an existing interface_ - if we treat this as a 'default madvise()' > > plus having additional madvise functionality (apply to all, ignoring errors > > e.g.) - and put all this code _alongside existing madvise code_ - this > > makes this vastly more maintainable, futureproof and robust. > > > > And keep in mind the proposed flags are _flags_ not default behaviours, > > ones which we will carefully restrict to known flags, starting with the > > flags you guys need. > > > >> > >> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise. > >> But if in the end, we want to have a THP auto way which is truly transparent, > >> would it not be better to just have this as prctl and not change the madvise > >> structure? Maybe in a few years we wont need any of this, and it will be lost > >> in prctl and madvise wouldn't have changed for this? > > > > This really sounds a lot like your colleague Shakeel's objection, so I > > don't want to duplicate my response too much, but as I said to him, at this > > stage we would set THP mode to 'auto', but still have to support > > MADV_[NO]HUGEPAGE. > > > > We may then wish to set these as no-ops in auto mode right? But they'd > > still have to exist for uAPI reasons. > > > > So would it be better to do this in mm/madvise.c alongside literally all > > other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or > > would it be better to throw it in kernel/sys.c and hope people remember to > > update it? > > > > I think it's pretty clear what the answer to that is. > > > >> > >> Again, this is just to have a discussion (and not an aggressive argument :)), > >> and would love to get feedback from everyone in the community. > >> If its too early to have this discussion, its completely fine and we can > >> still keep developing the RFCs :) > > > > I try hard never to be aggressive, I am firm when I feel it is appropriate > > to be so (it's important to push back when necessarily I feel), but I > > always try to maintain civility as well as I can. > > > > Of course I am imperfect, so apologies if it comes across any other way, I > > really do try very hard to maintain a high standard of professionalism > > on-list. > > > > Again as I said, I really did not intend to supercede or interfere with > > your series, this was just unfortunate timing and throwing out ideas to > > reach the best solution. > > > > My objection to prctl() is due to bit-rot, mm code in inappropriate places, > > the fact it by nature lends itself to violating conventions and practices > > that exist in other mm APIs, not some subjective sense of evil. > > > > It is historically a place where 'things that people don't really care > > about but don't quite want to NACK' go also, and to me suggests that the > > problem hasn't necessarily been thought through to see how it might be > > implemented by extending existing APIs or finding ways to achieve the same > > thing that better align with existing intefaces. > > > > To be clear - this concept is _all_ ultimately a product of your series and > > your ideas leading the discussion within the community to this point where > > a potential alternative has arisen - without which this would not have > > occurred. > > > > So the idea here is to simply explore what the best solution is that aligns > > with what best serves the community. > > > >> > >> Thanks! > >> Usama > > > > Thanks, Lorenzo >
On Mon, May 19, 2025 at 10:53 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > 3. PMADV_SET_FORK_EXEC_DEFAULT > > It may be desirable for a user to specify that all VMAs mapped in a process > address space default to having an madvise() behaviour established by > default, in such a fashion as that this persists across fork/exec. Settings that persist across exec are generally a bit dodgy and you have to ask questions like "what happens on setuid execution, could this somehow influence the behavior of a setuid binary in a security-relevant way", and I'm not sure whether that is the case with hugepage flags but I guess it could maybe influence the alignment with which mappings are created or something like that? And if you add more flags to this list later, you'll again have to think about annoying setuid considerations each time. For comparison, personality flags are explicitly supposed to persist across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets cleared only if the execution is privileged. (Annoyingly, the PER_CLEAR_ON_SETID handling is currently implemented separately for each type of privileged execution we can have [setuid/setgid/fscaps/selinux transition/apparmor transition/smack transition], but I guess you could probably gate it on bprm->secureexec instead...). It would be nice if you could either make this a property of the mm_struct that does not persist across exec, or if that would break your intended usecase, alternatively wipe it on privileged execution. > Since this is a very powerful option that would make no sense for many > advice modes, we explicitly only permit known-safe flags here (currently > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). Ah, sort of like a more generic version of prctl(PR_SET_THP_DISABLE, flag, 0, 0, 0) that also allows opt-in on top of opt-out.
On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote: > On Mon, May 19, 2025 at 10:53 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > > > It may be desirable for a user to specify that all VMAs mapped in a process > > address space default to having an madvise() behaviour established by > > default, in such a fashion as that this persists across fork/exec. > > Settings that persist across exec are generally a bit dodgy and you > have to ask questions like "what happens on setuid execution, could > this somehow influence the behavior of a setuid binary in a > security-relevant way", and I'm not sure whether that is the case with > hugepage flags but I guess it could maybe influence the alignment with > which mappings are created or something like that? And if you add more > flags to this list later, you'll again have to think about annoying > setuid considerations each time. I do absolutely agree they're dodgy, which is why this is significantly restricted. The intent is that we'd _only ever_ add anything to this list after such careful consideration had been made. And obviously then you can assert whatever checks make sense for each permitted madvise() flag. Obviously this change is motivated by Usama's series, looking at an alternative approach (as well as - at the same time - expanding what we can do with [process_]madvise() - an ongoing bugbear for some time). So this provides a less 'tacked on' (and thus bit rotting) version of this change, to enforce that behaviour is consistent, aligned with madvise() generally, and also importantly, maintained by mm :) > > For comparison, personality flags are explicitly supposed to persist > across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC > and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets > cleared only if the execution is privileged. (Annoyingly, the > PER_CLEAR_ON_SETID handling is currently implemented separately for > each type of privileged execution we can have > [setuid/setgid/fscaps/selinux transition/apparmor transition/smack > transition], but I guess you could probably gate it on > bprm->secureexec instead...). > > It would be nice if you could either make this a property of the > mm_struct that does not persist across exec, or if that would break > your intended usecase, alternatively wipe it on privileged execution. The use case specifically requires persistence, unfortunately (we are still determining whether this makes sense however - it is by no means a 'done deal' that we're accepting this as a thing). I suppose wiping on privileged execution could be achieved by storing a mask of these permitted flags and clearing that mask in mm->def_flags at this point? Mightn't a better solution be to just require ns_capable(CAP_SYS_ADMIN)? Usama - would wiping these flags on privileged execution, or requiring a privileged process to be able to do this break your use case? > > > Since this is a very powerful option that would make no sense for many > > advice modes, we explicitly only permit known-safe flags here (currently > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > Ah, sort of like a more generic version of prctl(PR_SET_THP_DISABLE, > flag, 0, 0, 0) that also allows opt-in on top of opt-out. Right. Thanks for taking a look at this! Your input on the security side is absolutely invaluable :)
On Tue, May 20, 2025 at 7:36 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote: > > For comparison, personality flags are explicitly supposed to persist > > across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC > > and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets > > cleared only if the execution is privileged. (Annoyingly, the > > PER_CLEAR_ON_SETID handling is currently implemented separately for > > each type of privileged execution we can have > > [setuid/setgid/fscaps/selinux transition/apparmor transition/smack > > transition], but I guess you could probably gate it on > > bprm->secureexec instead...). > > > > It would be nice if you could either make this a property of the > > mm_struct that does not persist across exec, or if that would break > > your intended usecase, alternatively wipe it on privileged execution. > > The use case specifically requires persistence, unfortunately (we are still > determining whether this makes sense however - it is by no means a 'done > deal' that we're accepting this as a thing). > > I suppose wiping on privileged execution could be achieved by storing a > mask of these permitted flags and clearing that mask in mm->def_flags at > this point? Oh, I see, we're already inheriting VM_NOHUGEPAGE on execve through mm->def_flags, with the bitmask VM_INIT_DEF_MASK controlling what is inheritable? Hmmmm... I guess turning hugepages _off_ should be fine... Yeah I guess I'd do this by adding another bitmask VM_INIT_DEF_MASK_SECUREEXEC or something like that, and then applying that bitmask on setuid execution.
On Tue, May 20, 2025 at 06:04:47PM +0200, Jann Horn wrote: > On Tue, May 20, 2025 at 7:36 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote: > > > For comparison, personality flags are explicitly supposed to persist > > > across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC > > > and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets > > > cleared only if the execution is privileged. (Annoyingly, the > > > PER_CLEAR_ON_SETID handling is currently implemented separately for > > > each type of privileged execution we can have > > > [setuid/setgid/fscaps/selinux transition/apparmor transition/smack > > > transition], but I guess you could probably gate it on > > > bprm->secureexec instead...). > > > > > > It would be nice if you could either make this a property of the > > > mm_struct that does not persist across exec, or if that would break > > > your intended usecase, alternatively wipe it on privileged execution. > > > > The use case specifically requires persistence, unfortunately (we are still > > determining whether this makes sense however - it is by no means a 'done > > deal' that we're accepting this as a thing). > > > > I suppose wiping on privileged execution could be achieved by storing a > > mask of these permitted flags and clearing that mask in mm->def_flags at > > this point? > > Oh, I see, we're already inheriting VM_NOHUGEPAGE on execve through > mm->def_flags, with the bitmask VM_INIT_DEF_MASK controlling what is > inheritable? Hmmmm... I guess turning hugepages _off_ should be > fine... > > Yeah I guess I'd do this by adding another bitmask > VM_INIT_DEF_MASK_SECUREEXEC or something like that, and then applying > that bitmask on setuid execution. I guess we could do it this way, as it would only otherwise limit a non-sys admin user, and we should try to keep things as flexible as possible. Let me do this for v2 and see how it works. As it seems there's some general traction here I can also write some tests...
© 2016 - 2025 Red Hat, Inc.