arch/x86/mm/pat/set_memory.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
The following commit has been merged into the x86/mm branch of tip:
Commit-ID: d75a256b6a64132fc7aab57ad4c96218e3ae383b
Gitweb: https://git.kernel.org/tip/d75a256b6a64132fc7aab57ad4c96218e3ae383b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
AuthorDate: Tue, 25 Feb 2025 19:37:32
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 25 Feb 2025 20:59:32 +01:00
x86/mm: Clear _PAGE_DIRTY when we clear _PAGE_RW
The bit pattern of _PAGE_DIRTY set and _PAGE_RW clear is used to
mark shadow stacks. This is currently checked for in mk_pte() but
not pfn_pte(). If we add the check to pfn_pte(), it catches vfree()
calling set_direct_map_invalid_noflush() which calls __change_page_attr()
which loads the old protection bits from the PTE, clears the specified
bits and uses pfn_pte() to construct the new PTE.
We should, therefore, clear the _PAGE_DIRTY bit whenever we clear
_PAGE_RW. I opted to do it in the callers in case we want to use
__change_page_attr() to create shadow stacks inside the kernel at some
point in the future. Arguably, we might also want to clear _PAGE_ACCESSED
here.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/oe-lkp/202502241646.719f4651-lkp@intel.com
---
arch/x86/mm/pat/set_memory.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 84d0bca..d174015 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2628,7 +2628,7 @@ static int __set_pages_np(struct page *page, int numpages)
.pgd = NULL,
.numpages = numpages,
.mask_set = __pgprot(0),
- .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW),
+ .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY),
.flags = CPA_NO_CHECK_ALIAS };
/*
@@ -2715,7 +2715,7 @@ int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
.pgd = pgd,
.numpages = numpages,
.mask_set = __pgprot(0),
- .mask_clr = __pgprot(~page_flags & (_PAGE_NX|_PAGE_RW)),
+ .mask_clr = __pgprot(~page_flags & (_PAGE_NX|_PAGE_RW|_PAGE_DIRTY)),
.flags = CPA_NO_CHECK_ALIAS,
};
@@ -2758,7 +2758,7 @@ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address,
.pgd = pgd,
.numpages = numpages,
.mask_set = __pgprot(0),
- .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW),
+ .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY),
.flags = CPA_NO_CHECK_ALIAS,
};
On Tue, 25 Feb 2025 at 12:10, tip-bot2 for Matthew Wilcox (Oracle)
<tip-bot2@linutronix.de> wrote:
>
> We should, therefore, clear the _PAGE_DIRTY bit whenever we clear
> _PAGE_RW. I opted to do it in the callers in case we want to use
> __change_page_attr() to create shadow stacks inside the kernel at some
> point in the future. Arguably, we might also want to clear _PAGE_ACCESSED
> here.
This explanation makes ZERO sense, and screams "this is a major bug" to me.
If a page is dirty, it doesn't magically turn clean just because it
becomes read-only. The dirty data remains and may need to be written
back to memory.
Imagine writing to a shared memory area, and then marking it all
read-only after you're done. It's still dirty, even if it's read-only.
Now, I don't actually expect this patch to be wrong, I'm literally
just complaining about the explanation. Because the explanation is
very lacking. That's particularly true for the __set_pages_np() case
which also clears _PAGE_PRESENT, because then the whole shadow stacks
explanation flies right out the window: the shadow stack rules simply
do NOT APPLY to non-present pte's in the first place.
So honestly, I think this wants an explanation for why it's actually a
safe change, and how the dirty bit has been saved before the
operation.
Linus
On Tue, Feb 25, 2025 at 01:07:08PM -0800, Linus Torvalds wrote: > On Tue, 25 Feb 2025 at 12:10, tip-bot2 for Matthew Wilcox (Oracle) > <tip-bot2@linutronix.de> wrote: > > > > We should, therefore, clear the _PAGE_DIRTY bit whenever we clear > > _PAGE_RW. I opted to do it in the callers in case we want to use > > __change_page_attr() to create shadow stacks inside the kernel at some > > point in the future. Arguably, we might also want to clear _PAGE_ACCESSED > > here. > > This explanation makes ZERO sense, and screams "this is a major bug" to me. > > If a page is dirty, it doesn't magically turn clean just because it > becomes read-only. The dirty data remains and may need to be written > back to memory. Are you saying that the PTE dirty bit controls whether the CPU flushes cache back to memory? That isn't how I understand CPUs to work. > Imagine writing to a shared memory area, and then marking it all > read-only after you're done. It's still dirty, even if it's read-only. > > Now, I don't actually expect this patch to be wrong, I'm literally > just complaining about the explanation. Because the explanation is > very lacking. That's particularly true for the __set_pages_np() case > which also clears _PAGE_PRESENT, because then the whole shadow stacks > explanation flies right out the window: the shadow stack rules simply > do NOT APPLY to non-present pte's in the first place. Dave and I talked about that case. We were concerned not that _this_ manipulation would lead to a shadow stack entry appearing (since the present bit is being cleared), but that the next manipulation would just set the present bit without setting the RW bit and we'd accidentally end up with one. > So honestly, I think this wants an explanation for why it's actually a > safe change, and how the dirty bit has been saved before the > operation. I don't understand why the dirty bit needs to be saved. At least in the vfree() case, we're freeing the memory so any unflushed writes to it may as well disappear. But I don't know all the circumstances in which these functions are called.
On Tue, 25 Feb 2025 at 18:45, Matthew Wilcox <willy@infradead.org> wrote:
>
> Are you saying that the PTE dirty bit controls whether the CPU flushes
> cache back to memory? That isn't how I understand CPUs to work.
No, I'm saying that dropping the dirty bit drops information.
Yes, yes, we have the SW-dirty bit, and hopefully that is indeed
always set, but the way we test for "is this page dirty and needs
writeback" is literally to test *both*.
return pte_flags(pte) & _PAGE_DIRTY_BITS;
> I don't understand why the dirty bit needs to be saved. At least in
> the vfree() case, we're freeing the memory so any unflushed writes to
> it may as well disappear. But I don't know all the circumstances in
> which these functions are called.
I'm not saying that it needs to be saved.
But I *am* saying that it needs to be explained why dropping it is fine.
And I am also saying that your explanation for why it should be
cleared makes no sense, since two out of three cases also cleared
_PAGE_PRESENT, at which point the whole shadow stack issue is
completely irrelevant.
So please explain *why* clearing PAGE_DIRTY is ok. Don't bring up
issues like the shadow stack setting that is irrelevant for at least
two of the cases that you changed.
If all of these are purely used for vmalloc() or direct mappings, then
*that* is a valid explanation ("we don't care about dirty bits on
kernel mappings"), for example.
Linus
On Tue, Feb 25, 2025 at 07:31:01PM -0800, Linus Torvalds wrote:
> > I don't understand why the dirty bit needs to be saved. At least in
> > the vfree() case, we're freeing the memory so any unflushed writes to
> > it may as well disappear. But I don't know all the circumstances in
> > which these functions are called.
>
> I'm not saying that it needs to be saved.
>
> But I *am* saying that it needs to be explained why dropping it is fine.
>
> And I am also saying that your explanation for why it should be
> cleared makes no sense, since two out of three cases also cleared
> _PAGE_PRESENT, at which point the whole shadow stack issue is
> completely irrelevant.
>
> So please explain *why* clearing PAGE_DIRTY is ok. Don't bring up
> issues like the shadow stack setting that is irrelevant for at least
> two of the cases that you changed.
>
> If all of these are purely used for vmalloc() or direct mappings, then
> *that* is a valid explanation ("we don't care about dirty bits on
> kernel mappings"), for example.
I think the entire point of this file is to manipulate kernel mappings.
On Tue, 25 Feb 2025 at 20:00, Matthew Wilcox <willy@infradead.org> wrote:
>
> I think the entire point of this file is to manipulate kernel mappings.
Very likely. But just looking at the patch, it was very non-obvious.
Linus
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 25 Feb 2025 at 20:00, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > I think the entire point of this file is to manipulate kernel mappings.
>
> Very likely. But just looking at the patch, it was very non-obvious.
Yeah, agreed - and I've extended the changelog the following way:
Subject: [PATCH] x86/mm: Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
The bit pattern of _PAGE_DIRTY set and _PAGE_RW clear is used to mark
shadow stacks. This is currently checked for in mk_pte() but not
pfn_pte(). If we add the check to pfn_pte(), it catches vfree()
calling set_direct_map_invalid_noflush() which calls
__change_page_attr() which loads the old protection bits from the
PTE, clears the specified bits and uses pfn_pte() to construct the
new PTE.
We should, therefore, for kernel mappings, clear the _PAGE_DIRTY bit
consistently whenever we clear _PAGE_RW. I opted to do it in the
callers in case we want to use __change_page_attr() to create shadow
stacks inside the kernel at some point in the future. Arguably, we
might also want to clear _PAGE_ACCESSED here.
Note that the 3 functions involved:
__set_pages_np()
kernel_map_pages_in_pgd()
kernel_unmap_pages_in_pgd()
Only ever manipulate non-swappable kernel mappings, so maintaining
the DIRTY:1|RW:0 special pattern for shadow stacks and DIRTY:0
pattern for non-shadow-stack entries can be maintained consistently
and doesn't result in the unintended clearing of a live dirty bit
that could corrupt (destroy) dirty bit information for user mappings.
Is this explanation better?
Thanks,
Ingo
On Wed, 26 Feb 2025 at 02:56, Ingo Molnar <mingo@kernel.org> wrote:
>
> Is this explanation better?
Ack. Just making it clear that it's always kernel pages, never a user
address range makes it fine in my book,
Linus
The following commit has been merged into the x86/mm branch of tip:
Commit-ID: c1fcf41cf37f7a3fd3bbf6f0c04aba3ea4258888
Gitweb: https://git.kernel.org/tip/c1fcf41cf37f7a3fd3bbf6f0c04aba3ea4258888
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
AuthorDate: Tue, 25 Feb 2025 19:37:32
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 27 Feb 2025 09:58:17 +01:00
x86/mm: Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
The bit pattern of _PAGE_DIRTY set and _PAGE_RW clear is used to mark
shadow stacks. This is currently checked for in mk_pte() but not
pfn_pte(). If we add the check to pfn_pte(), it catches vfree()
calling set_direct_map_invalid_noflush() which calls
__change_page_attr() which loads the old protection bits from the
PTE, clears the specified bits and uses pfn_pte() to construct the
new PTE.
We should, therefore, for kernel mappings, clear the _PAGE_DIRTY bit
consistently whenever we clear _PAGE_RW. I opted to do it in the
callers in case we want to use __change_page_attr() to create shadow
stacks inside the kernel at some point in the future. Arguably, we
might also want to clear _PAGE_ACCESSED here.
Note that the 3 functions involved:
__set_pages_np()
kernel_map_pages_in_pgd()
kernel_unmap_pages_in_pgd()
Only ever manipulate non-swappable kernel mappings, so maintaining
the DIRTY:1|RW:0 special pattern for shadow stacks and DIRTY:0
pattern for non-shadow-stack entries can be maintained consistently
and doesn't result in the unintended clearing of a live dirty bit
that could corrupt (destroy) dirty bit information for user mappings.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/174051422675.10177.13226545170101706336.tip-bot2@tip-bot2
Closes: https://lore.kernel.org/oe-lkp/202502241646.719f4651-lkp@intel.com
---
arch/x86/mm/pat/set_memory.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 84d0bca..d174015 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2628,7 +2628,7 @@ static int __set_pages_np(struct page *page, int numpages)
.pgd = NULL,
.numpages = numpages,
.mask_set = __pgprot(0),
- .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW),
+ .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY),
.flags = CPA_NO_CHECK_ALIAS };
/*
@@ -2715,7 +2715,7 @@ int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
.pgd = pgd,
.numpages = numpages,
.mask_set = __pgprot(0),
- .mask_clr = __pgprot(~page_flags & (_PAGE_NX|_PAGE_RW)),
+ .mask_clr = __pgprot(~page_flags & (_PAGE_NX|_PAGE_RW|_PAGE_DIRTY)),
.flags = CPA_NO_CHECK_ALIAS,
};
@@ -2758,7 +2758,7 @@ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address,
.pgd = pgd,
.numpages = numpages,
.mask_set = __pgprot(0),
- .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW),
+ .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY),
.flags = CPA_NO_CHECK_ALIAS,
};
© 2016 - 2026 Red Hat, Inc.