arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------ include/linux/pgtable.h | 28 ++++++++++++++++----- kernel/fork.c | 4 +++ mm/memory.c | 11 +++------ 4 files changed, 58 insertions(+), 37 deletions(-)
If track_pfn_copy() fails, we already added the dst VMA to the maple
tree. As fork() fails, we'll cleanup the maple tree, and stumble over
the dst VMA for which we neither performed any reservation nor copied
any page tables.
Consequently untrack_pfn() will see VM_PAT and try obtaining the
PAT information from the page table -- which fails because the page
table was not copied.
The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
if track_pfn_copy() fails. However, the whole thing is about "simply"
clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
and performed a reservation, but copying the page tables fails, we'll
simply clear the VM_PAT flag, not properly undoing the reservation ...
which is also wrong.
So let's fix it properly: set the VM_PAT flag only if the reservation
succeeded (leaving it clear initially), and undo the reservation if
anything goes wrong while copying the page tables: clearing the VM_PAT
flag after undoing the reservation.
Note that any copied page table entries will get zapped when the VMA will
get removed later, after copy_page_range() succeeded; as VM_PAT is not set
then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
happy. Note that leaving these page tables in place without a reservation
is not a problem, as we are aborting fork(); this process will never run.
A reproducer can trigger this usually at the first try:
https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c
[ 45.239440] WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
[ 45.241082] Modules linked in: ...
[ 45.249119] CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
[ 45.250598] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[ 45.252181] RIP: 0010:get_pat_info+0xf6/0x110
...
[ 45.268513] Call Trace:
[ 45.269003] <TASK>
[ 45.269425] ? __warn.cold+0xb7/0x14d
[ 45.270131] ? get_pat_info+0xf6/0x110
[ 45.270846] ? report_bug+0xff/0x140
[ 45.271519] ? handle_bug+0x58/0x90
[ 45.272192] ? exc_invalid_op+0x17/0x70
[ 45.272935] ? asm_exc_invalid_op+0x1a/0x20
[ 45.273717] ? get_pat_info+0xf6/0x110
[ 45.274438] ? get_pat_info+0x71/0x110
[ 45.275165] untrack_pfn+0x52/0x110
[ 45.275835] unmap_single_vma+0xa6/0xe0
[ 45.276549] unmap_vmas+0x105/0x1f0
[ 45.277256] exit_mmap+0xf6/0x460
[ 45.277913] __mmput+0x4b/0x120
[ 45.278512] copy_process+0x1bf6/0x2aa0
[ 45.279264] kernel_clone+0xab/0x440
[ 45.279959] __do_sys_clone+0x66/0x90
[ 45.280650] do_syscall_64+0x95/0x180
Likely this case was missed in commit d155df53f310 ("x86/mm/pat: clear
VM_PAT if copy_p4d_range failed")
... and instead of undoing the reservation we simply cleared the VM_PAT flag.
Keep the documentation of these functions in include/linux/pgtable.h,
one place is more than sufficient -- we should clean that up for the other
functions like track_pfn_remap/untrack_pfn separately.
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yuxin wang <wang1315768607@163.com>
Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/
Reported-by: Marius Fleischer <fleischermarius@gmail.com>
Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
v2 -> v3:
* Make some !MMU configs happy by just moving the code into memtype.c
v1 -> v2:
* Avoid a second get_pat_info() [and thereby fix the error checking]
by passing the pfn from track_pfn_copy() to untrack_pfn_copy()
* Simplify untrack_pfn_copy() by calling untrack_pfn().
* Retested
Not sure if we want to CC stable ... it's really hard to trigger in
sane environments.
---
arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------
include/linux/pgtable.h | 28 ++++++++++++++++-----
kernel/fork.c | 4 +++
mm/memory.c | 11 +++------
4 files changed, 58 insertions(+), 37 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index feb8cc6a12bf2..d721cc19addbd 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -984,29 +984,42 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
return -EINVAL;
}
-/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
- *
- * If the vma has a linear pfn mapping for the entire range, we get the prot
- * from pte and reserve the entire vma range with single reserve_pfn_range call.
- */
-int track_pfn_copy(struct vm_area_struct *vma)
+int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma, unsigned long *pfn)
{
+ const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
resource_size_t paddr;
- unsigned long vma_size = vma->vm_end - vma->vm_start;
pgprot_t pgprot;
+ int rc;
- if (vma->vm_flags & VM_PAT) {
- if (get_pat_info(vma, &paddr, &pgprot))
- return -EINVAL;
- /* reserve the whole chunk covered by vma. */
- return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
- }
+ if (!(src_vma->vm_flags & VM_PAT))
+ return 0;
+
+ /*
+ * Duplicate the PAT information for the dst VMA based on the src
+ * VMA.
+ */
+ if (get_pat_info(src_vma, &paddr, &pgprot))
+ return -EINVAL;
+ rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+ if (rc)
+ return rc;
+ /* Reservation for the destination VMA succeeded. */
+ vm_flags_set(dst_vma, VM_PAT);
+ *pfn = PHYS_PFN(paddr);
return 0;
}
+void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
+{
+ untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
+ /*
+ * Reservation was freed, any copied page tables will get cleaned
+ * up later, but without getting PAT involved again.
+ */
+}
+
/*
* prot is passed in as a parameter for the new mapping. If the vma has
* a linear pfn mapping for the entire range, or no vma is provided,
@@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
}
}
-/*
- * untrack_pfn_clear is called if the following situation fits:
- *
- * 1) while mremapping a pfnmap for a new region, with the old vma after
- * its pfnmap page table has been removed. The new vma has a new pfnmap
- * to the same pfn & cache type with VM_PAT set.
- * 2) while duplicating vm area, the new vma fails to copy the pgtable from
- * old vma.
- */
void untrack_pfn_clear(struct vm_area_struct *vma)
{
vm_flags_clear(vma, VM_PAT);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d02372e..4c107e17c547e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1508,14 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
}
/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
+ * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
+ * tables copied during copy_page_range(). On success, stores the pfn to be
+ * passed to untrack_pfn_copy().
*/
-static inline int track_pfn_copy(struct vm_area_struct *vma)
+static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma, unsigned long *pfn)
{
return 0;
}
+/*
+ * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
+ * copy_page_range(), but after track_pfn_copy() was already called.
+ */
+static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+ unsigned long pfn)
+{
+}
+
/*
* untrack_pfn is called while unmapping a pfnmap for a region.
* untrack can be called for a specific region indicated by pfn and size or
@@ -1528,8 +1539,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
}
/*
- * untrack_pfn_clear is called while mremapping a pfnmap for a new region
- * or fails to copy pgtable during duplicate vm area.
+ * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
+ *
+ * 1) During mremap() on the src VMA after the page tables were moved.
+ * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
*/
static inline void untrack_pfn_clear(struct vm_area_struct *vma)
{
@@ -1540,7 +1553,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
unsigned long size);
extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
pfn_t pfn);
-extern int track_pfn_copy(struct vm_area_struct *vma);
+extern int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma, unsigned long *pfn);
+extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+ unsigned long pfn);
extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
unsigned long size, bool mm_wr_locked);
extern void untrack_pfn_clear(struct vm_area_struct *vma);
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f32..ca2ca3884f763 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
vma_numab_state_init(new);
dup_anon_vma_name(orig, new);
+ /* track_pfn_copy() will later take care of copying internal state. */
+ if (unlikely(new->vm_flags & VM_PFNMAP))
+ untrack_pfn_clear(new);
+
return new;
}
diff --git a/mm/memory.c b/mm/memory.c
index fb7b8dc751679..dc8efa1358e94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1362,12 +1362,12 @@ int
copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
{
pgd_t *src_pgd, *dst_pgd;
- unsigned long next;
unsigned long addr = src_vma->vm_start;
unsigned long end = src_vma->vm_end;
struct mm_struct *dst_mm = dst_vma->vm_mm;
struct mm_struct *src_mm = src_vma->vm_mm;
struct mmu_notifier_range range;
+ unsigned long next, pfn;
bool is_cow;
int ret;
@@ -1378,11 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
- /*
- * We do not free on error cases below as remove_vma
- * gets called on error from higher level routine
- */
- ret = track_pfn_copy(src_vma);
+ ret = track_pfn_copy(dst_vma, src_vma, &pfn);
if (ret)
return ret;
}
@@ -1419,7 +1415,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
continue;
if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
addr, next))) {
- untrack_pfn_clear(dst_vma);
ret = -ENOMEM;
break;
}
@@ -1429,6 +1424,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
raw_write_seqcount_end(&src_mm->write_protect_seq);
mmu_notifier_invalidate_range_end(&range);
}
+ if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
+ untrack_pfn_copy(dst_vma, pfn);
return ret;
}
base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
--
2.48.1
+cc Liam
On Tue, Mar 25, 2025 at 08:19:51PM +0100, David Hildenbrand wrote:
> If track_pfn_copy() fails, we already added the dst VMA to the maple
> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
> the dst VMA for which we neither performed any reservation nor copied
> any page tables.
Ugh god.
This code path seriously worries me (and also Liam :), we have some very
weird cases that can occur here.
>
> Consequently untrack_pfn() will see VM_PAT and try obtaining the
> PAT information from the page table -- which fails because the page
> table was not copied.
Good lord. How have you found such a satanic combination of hellish
factors... :) I'm guessing some terrible splat...
>
> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
> if track_pfn_copy() fails. However, the whole thing is about "simply"
> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
> and performed a reservation, but copying the page tables fails, we'll
> simply clear the VM_PAT flag, not properly undoing the reservation ...
> which is also wrong.
>
> So let's fix it properly: set the VM_PAT flag only if the reservation
> succeeded (leaving it clear initially), and undo the reservation if
> anything goes wrong while copying the page tables: clearing the VM_PAT
> flag after undoing the reservation.
This sounds sensible.
>
> Note that any copied page table entries will get zapped when the VMA will
> get removed later, after copy_page_range() succeeded; as VM_PAT is not set
> then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
> happy. Note that leaving these page tables in place without a reservation
> is not a problem, as we are aborting fork(); this process will never run.
>
> A reproducer can trigger this usually at the first try:
>
> https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c
>
> [ 45.239440] WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
> [ 45.241082] Modules linked in: ...
> [ 45.249119] CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
> [ 45.250598] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> [ 45.252181] RIP: 0010:get_pat_info+0xf6/0x110
> ...
> [ 45.268513] Call Trace:
> [ 45.269003] <TASK>
> [ 45.269425] ? __warn.cold+0xb7/0x14d
> [ 45.270131] ? get_pat_info+0xf6/0x110
> [ 45.270846] ? report_bug+0xff/0x140
> [ 45.271519] ? handle_bug+0x58/0x90
> [ 45.272192] ? exc_invalid_op+0x17/0x70
> [ 45.272935] ? asm_exc_invalid_op+0x1a/0x20
> [ 45.273717] ? get_pat_info+0xf6/0x110
> [ 45.274438] ? get_pat_info+0x71/0x110
> [ 45.275165] untrack_pfn+0x52/0x110
> [ 45.275835] unmap_single_vma+0xa6/0xe0
> [ 45.276549] unmap_vmas+0x105/0x1f0
> [ 45.277256] exit_mmap+0xf6/0x460
> [ 45.277913] __mmput+0x4b/0x120
> [ 45.278512] copy_process+0x1bf6/0x2aa0
> [ 45.279264] kernel_clone+0xab/0x440
> [ 45.279959] __do_sys_clone+0x66/0x90
> [ 45.280650] do_syscall_64+0x95/0x180
>
> Likely this case was missed in commit d155df53f310 ("x86/mm/pat: clear
> VM_PAT if copy_p4d_range failed")
>
> ... and instead of undoing the reservation we simply cleared the VM_PAT flag.
>
> Keep the documentation of these functions in include/linux/pgtable.h,
> one place is more than sufficient -- we should clean that up for the other
> functions like track_pfn_remap/untrack_pfn separately.
>
> Reported-by: xingwei lee <xrivendell7@gmail.com>
> Reported-by: yuxin wang <wang1315768607@163.com>
> Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/
> Reported-by: Marius Fleischer <fleischermarius@gmail.com>
> Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
Oh OK I see it was reported previously.
> Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
> Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dan Carpenter <dan.carpenter@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>
> v2 -> v3:
> * Make some !MMU configs happy by just moving the code into memtype.c
Obviously we need to make the bots happy once again, re the issue at [0]...
[0]: https://lore.kernel.org/all/9b3b3296-ab21-418b-a0ff-8f5248f9b4ec@lucifer.local/
Which by the way you... didn't seem to be cc'd on, unless I missed it? I
had to manually add you in which is... weird.
>
> v1 -> v2:
> * Avoid a second get_pat_info() [and thereby fix the error checking]
> by passing the pfn from track_pfn_copy() to untrack_pfn_copy()
> * Simplify untrack_pfn_copy() by calling untrack_pfn().
> * Retested
>
> Not sure if we want to CC stable ... it's really hard to trigger in
> sane environments.
This kind of code path is probably in reality... theoretical. So I'm fine
with this.
>
> ---
> arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------
> include/linux/pgtable.h | 28 ++++++++++++++++-----
> kernel/fork.c | 4 +++
> mm/memory.c | 11 +++------
> 4 files changed, 58 insertions(+), 37 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index feb8cc6a12bf2..d721cc19addbd 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -984,29 +984,42 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
> return -EINVAL;
> }
>
> -/*
> - * track_pfn_copy is called when vma that is covering the pfnmap gets
> - * copied through copy_page_range().
> - *
> - * If the vma has a linear pfn mapping for the entire range, we get the prot
> - * from pte and reserve the entire vma range with single reserve_pfn_range call.
> - */
> -int track_pfn_copy(struct vm_area_struct *vma)
> +int track_pfn_copy(struct vm_area_struct *dst_vma,
> + struct vm_area_struct *src_vma, unsigned long *pfn)
I think we need an additional 'tracked' parameter so we know whether or not
this pfn is valid.
It's kind of icky... see the bot report for context, but we we sort of need
to differentiate between 'error' and 'nothing to do'. Of course PFN can
conceivably be 0 so we can't just return that or an error (plus return
values that can be both errors and values are fraught anyway).
So I guess -maybe- least horrid thing is:
int track_pfn_copy(struct vm_area_struct *dst_vma,
struct vm_area_struct *src_vma, unsigned long *pfn,
bool *pfn_tracked);
Then you can obviously invoke with track_pfn_copy(..., &pfn_tracked); and
do an if (pfn_tracked) untrack_pfn_copy(...).
I'm really not in favour of just initialising PFN to 0 because there are
code paths where this might actually get passed around and used
incorrectly.
But on the other hand - I get that this is disgusting.
> {
> + const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
> resource_size_t paddr;
> - unsigned long vma_size = vma->vm_end - vma->vm_start;
> pgprot_t pgprot;
> + int rc;
>
> - if (vma->vm_flags & VM_PAT) {
> - if (get_pat_info(vma, &paddr, &pgprot))
> - return -EINVAL;
> - /* reserve the whole chunk covered by vma. */
> - return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> - }
> + if (!(src_vma->vm_flags & VM_PAT))
> + return 0;
I do always like the use of the guard clause pattern :)
But here we have a case where now error = 0, pfn not set, and we will try
to untrack it despite !VM_PAT.
> +
> + /*
> + * Duplicate the PAT information for the dst VMA based on the src
> + * VMA.
> + */
> + if (get_pat_info(src_vma, &paddr, &pgprot))
> + return -EINVAL;
> + rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> + if (rc)
> + return rc;
I mean it's a crazy nit, but we use ret elsewhere but rc here, maybe better
to use ret in both places.
But also feel free to ignore this.
>
> + /* Reservation for the destination VMA succeeded. */
> + vm_flags_set(dst_vma, VM_PAT);
> + *pfn = PHYS_PFN(paddr);
> return 0;
> }
>
> +void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
> +{
> + untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
> + /*
> + * Reservation was freed, any copied page tables will get cleaned
> + * up later, but without getting PAT involved again.
> + */
> +}
> +
> /*
> * prot is passed in as a parameter for the new mapping. If the vma has
> * a linear pfn mapping for the entire range, or no vma is provided,
> @@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> }
> }
>
> -/*
> - * untrack_pfn_clear is called if the following situation fits:
> - *
> - * 1) while mremapping a pfnmap for a new region, with the old vma after
> - * its pfnmap page table has been removed. The new vma has a new pfnmap
> - * to the same pfn & cache type with VM_PAT set.
> - * 2) while duplicating vm area, the new vma fails to copy the pgtable from
> - * old vma.
> - */
This just wrong now?
> void untrack_pfn_clear(struct vm_area_struct *vma)
> {
> vm_flags_clear(vma, VM_PAT);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 94d267d02372e..4c107e17c547e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1508,14 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> }
>
> /*
> - * track_pfn_copy is called when vma that is covering the pfnmap gets
> - * copied through copy_page_range().
> + * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
> + * tables copied during copy_page_range(). On success, stores the pfn to be
> + * passed to untrack_pfn_copy().
> */
> -static inline int track_pfn_copy(struct vm_area_struct *vma)
> +static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
> + struct vm_area_struct *src_vma, unsigned long *pfn)
> {
> return 0;
> }
>
> +/*
> + * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
> + * copy_page_range(), but after track_pfn_copy() was already called.
> + */
Do we really care to put a comment like this on a stub function?
> +static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> + unsigned long pfn)
> +{
> +}
> +
> /*
> * untrack_pfn is called while unmapping a pfnmap for a region.
> * untrack can be called for a specific region indicated by pfn and size or
> @@ -1528,8 +1539,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
> }
>
> /*
> - * untrack_pfn_clear is called while mremapping a pfnmap for a new region
> - * or fails to copy pgtable during duplicate vm area.
> + * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
> + *
> + * 1) During mremap() on the src VMA after the page tables were moved.
> + * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
> */
Can I say as an aside that I hate this kind of hook? Like quite a lot?
I mean I've been looking at mremap() of anon mappings as you know obv. but
the thought of PFN mapping mremap()ing is kind of also a bit ugh.
> static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> {
> @@ -1540,7 +1553,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> unsigned long size);
> extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> pfn_t pfn);
> -extern int track_pfn_copy(struct vm_area_struct *vma);
> +extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> + struct vm_area_struct *src_vma, unsigned long *pfn);
> +extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> + unsigned long pfn);
> extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> unsigned long size, bool mm_wr_locked);
> extern void untrack_pfn_clear(struct vm_area_struct *vma);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 735405a9c5f32..ca2ca3884f763 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> vma_numab_state_init(new);
> dup_anon_vma_name(orig, new);
>
> + /* track_pfn_copy() will later take care of copying internal state. */
> + if (unlikely(new->vm_flags & VM_PFNMAP))
> + untrack_pfn_clear(new);
OK so maybe I'm being stupid here, but - is it the case that
a. We duplicate a VMA that has a PAT-tracked PFN map
b. We must clear any existing tracking so everything is 'reset' to zero
c. track_pfn_copy() will later in fork process set anything up we need here.
Is this correct?
> +
> return new;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fb7b8dc751679..dc8efa1358e94 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1362,12 +1362,12 @@ int
> copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> {
> pgd_t *src_pgd, *dst_pgd;
> - unsigned long next;
> unsigned long addr = src_vma->vm_start;
> unsigned long end = src_vma->vm_end;
> struct mm_struct *dst_mm = dst_vma->vm_mm;
> struct mm_struct *src_mm = src_vma->vm_mm;
> struct mmu_notifier_range range;
> + unsigned long next, pfn;
> bool is_cow;
> int ret;
>
> @@ -1378,11 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>
> if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> - /*
> - * We do not free on error cases below as remove_vma
> - * gets called on error from higher level routine
> - */
> - ret = track_pfn_copy(src_vma);
> + ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> if (ret)
> return ret;
> }
> @@ -1419,7 +1415,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> continue;
> if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
> addr, next))) {
> - untrack_pfn_clear(dst_vma);
> ret = -ENOMEM;
> break;
> }
> @@ -1429,6 +1424,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> raw_write_seqcount_end(&src_mm->write_protect_seq);
> mmu_notifier_invalidate_range_end(&range);
> }
> + if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> + untrack_pfn_copy(dst_vma, pfn);
Yeah, the problem here is that !(src_vma->vm_flags & VM_PFNMAP) is not the
_only_ way we can not have a valid pfn.
Do we still want to untrack_pfn_copy() even if !VM_PAT?
If not then it seems easier, if a bit gross, to use this 'tracked_pfn'
boolean parameter and then here all we need do is:
if (ret && tracked_pfn) ...
Which then also allows the track_pfn_copy() to assert the fact that we only
care if VM_PFNMAP also... which is maybe some small neatness that comes out
of it?
> return ret;
> }
>
>
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
> --
> 2.48.1
>
>
>>
>> v2 -> v3:
>> * Make some !MMU configs happy by just moving the code into memtype.c
>
> Obviously we need to make the bots happy once again, re the issue at [0]...
>
> [0]: https://lore.kernel.org/all/9b3b3296-ab21-418b-a0ff-8f5248f9b4ec@lucifer.local/
>
> Which by the way you... didn't seem to be cc'd on, unless I missed it? I
> had to manually add you in which is... weird.
>
>>
>> v1 -> v2:
>> * Avoid a second get_pat_info() [and thereby fix the error checking]
>> by passing the pfn from track_pfn_copy() to untrack_pfn_copy()
>> * Simplify untrack_pfn_copy() by calling untrack_pfn().
>> * Retested
>>
>> Not sure if we want to CC stable ... it's really hard to trigger in
>> sane environments.
>
> This kind of code path is probably in reality... theoretical. So I'm fine
> with this.
>
Thanks a bunch for your review!
>>
>> ---
>> arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------
>> include/linux/pgtable.h | 28 ++++++++++++++++-----
>> kernel/fork.c | 4 +++
>> mm/memory.c | 11 +++------
>> 4 files changed, 58 insertions(+), 37 deletions(-)
>>
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index feb8cc6a12bf2..d721cc19addbd 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -984,29 +984,42 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
>> return -EINVAL;
>> }
>>
>> -/*
>> - * track_pfn_copy is called when vma that is covering the pfnmap gets
>> - * copied through copy_page_range().
>> - *
>> - * If the vma has a linear pfn mapping for the entire range, we get the prot
>> - * from pte and reserve the entire vma range with single reserve_pfn_range call.
>> - */
>> -int track_pfn_copy(struct vm_area_struct *vma)
>> +int track_pfn_copy(struct vm_area_struct *dst_vma,
>> + struct vm_area_struct *src_vma, unsigned long *pfn)
>
> I think we need an additional 'tracked' parameter so we know whether or not
> this pfn is valid.
See below.
>
> It's kind of icky... see the bot report for context, but we we sort of need
> to differentiate between 'error' and 'nothing to do'. Of course PFN can
> conceivably be 0 so we can't just return that or an error (plus return
> values that can be both errors and values are fraught anyway).
>
> So I guess -maybe- least horrid thing is:
>
> int track_pfn_copy(struct vm_area_struct *dst_vma,
> struct vm_area_struct *src_vma, unsigned long *pfn,
> bool *pfn_tracked);
>
> Then you can obviously invoke with track_pfn_copy(..., &pfn_tracked); and
> do an if (pfn_tracked) untrack_pfn_copy(...).
>
> I'm really not in favour of just initialising PFN to 0 because there are
> code paths where this might actually get passed around and used
> incorrectly.
>
> But on the other hand - I get that this is disgusting.
I'm in favor of letting VM_PAT take care of that. Observe how
untrack_pfn_copy() -> untrack_pfn() takes care of that by checking for
VM_PAT.
So this should be working as expected? No need to add something on top
that makes it even more ugly in the caller.
>
>
>> {
>> + const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
>> resource_size_t paddr;
>> - unsigned long vma_size = vma->vm_end - vma->vm_start;
>> pgprot_t pgprot;
>> + int rc;
>>
>> - if (vma->vm_flags & VM_PAT) {
>> - if (get_pat_info(vma, &paddr, &pgprot))
>> - return -EINVAL;
>> - /* reserve the whole chunk covered by vma. */
>> - return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
>> - }
>> + if (!(src_vma->vm_flags & VM_PAT))
>> + return 0;
>
> I do always like the use of the guard clause pattern :)
>
> But here we have a case where now error = 0, pfn not set, and we will try
> to untrack it despite !VM_PAT.
Right, and untrack_pfn() is smart enough to filter that out. (just like
for any other invokation)
>
>> +
>> + /*
>> + * Duplicate the PAT information for the dst VMA based on the src
>> + * VMA.
>> + */
>> + if (get_pat_info(src_vma, &paddr, &pgprot))
>> + return -EINVAL;
>> + rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
>> + if (rc)
>> + return rc;
>
> I mean it's a crazy nit, but we use ret elsewhere but rc here, maybe better
> to use ret in both places.
>
> But also feel free to ignore this.
"int retval;" ? ;)
>
>>
>> + /* Reservation for the destination VMA succeeded. */
>> + vm_flags_set(dst_vma, VM_PAT);
>> + *pfn = PHYS_PFN(paddr);
>> return 0;
>> }
>>
>> +void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
>> +{
>> + untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
>> + /*
>> + * Reservation was freed, any copied page tables will get cleaned
>> + * up later, but without getting PAT involved again.
>> + */
>> +}
>> +
>> /*
>> * prot is passed in as a parameter for the new mapping. If the vma has
>> * a linear pfn mapping for the entire range, or no vma is provided,
>> @@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
>> }
>> }
>>
>> -/*
>> - * untrack_pfn_clear is called if the following situation fits:
>> - *
>> - * 1) while mremapping a pfnmap for a new region, with the old vma after
>> - * its pfnmap page table has been removed. The new vma has a new pfnmap
>> - * to the same pfn & cache type with VM_PAT set.
>> - * 2) while duplicating vm area, the new vma fails to copy the pgtable from
>> - * old vma.
>> - */
>
> This just wrong now?
Note that I'm keeping the doc to a single place -- the stub in the
header. (below)
Or can you elaborate what exactly is "wrong"?
>
>> void untrack_pfn_clear(struct vm_area_struct *vma)
>> {
>> vm_flags_clear(vma, VM_PAT);
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 94d267d02372e..4c107e17c547e 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1508,14 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
>> }
>>
>> /*
>> - * track_pfn_copy is called when vma that is covering the pfnmap gets
>> - * copied through copy_page_range().
>> + * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
>> + * tables copied during copy_page_range(). On success, stores the pfn to be
>> + * passed to untrack_pfn_copy().
>> */
>> -static inline int track_pfn_copy(struct vm_area_struct *vma)
>> +static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
>> + struct vm_area_struct *src_vma, unsigned long *pfn)
>> {
>> return 0;
>> }
>>
>> +/*
>> + * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
>> + * copy_page_range(), but after track_pfn_copy() was already called.
>> + */
>
> Do we really care to put a comment like this on a stub function?
Whoever started this beautiful VM_PAT code decided to do it like that:
and I think it's better kept at a single location.
>
>> +static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>> + unsigned long pfn)
>> +{
>> +}
>> +
>> /*
>> * untrack_pfn is called while unmapping a pfnmap for a region.
>> * untrack can be called for a specific region indicated by pfn and size or
>> @@ -1528,8 +1539,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
>> }
>>
>> /*
>> - * untrack_pfn_clear is called while mremapping a pfnmap for a new region
>> - * or fails to copy pgtable during duplicate vm area.
>> + * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
>> + *
>> + * 1) During mremap() on the src VMA after the page tables were moved.
>> + * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
>> */
>
> Can I say as an aside that I hate this kind of hook? Like quite a lot?
>
> I mean I've been looking at mremap() of anon mappings as you know obv. but
> the thought of PFN mapping mremap()ing is kind of also a bit ugh.
I absolutely hate all of that, but I'll have to leave any cleanups to
people with more spare time ;)
>
>> static inline void untrack_pfn_clear(struct vm_area_struct *vma)
>> {
>> @@ -1540,7 +1553,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>> unsigned long size);
>> extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
>> pfn_t pfn);
>> -extern int track_pfn_copy(struct vm_area_struct *vma);
>> +extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>> + struct vm_area_struct *src_vma, unsigned long *pfn);
>> +extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>> + unsigned long pfn);
>> extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
>> unsigned long size, bool mm_wr_locked);
>> extern void untrack_pfn_clear(struct vm_area_struct *vma);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 735405a9c5f32..ca2ca3884f763 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>> vma_numab_state_init(new);
>> dup_anon_vma_name(orig, new);
>>
>> + /* track_pfn_copy() will later take care of copying internal state. */
>> + if (unlikely(new->vm_flags & VM_PFNMAP))
>> + untrack_pfn_clear(new);
>
> OK so maybe I'm being stupid here, but - is it the case that
>
> a. We duplicate a VMA that has a PAT-tracked PFN map
> b. We must clear any existing tracking so everything is 'reset' to
zero> c. track_pfn_copy() will later in fork process set anything up we
need here.
>
> Is this correct?
Right. But b) is actually not "clearing any tracking" (because there is
no tracking/reservation for the copied version yet) but marking it as
"not tracked".
>
>> +
>> return new;
>> }
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb7b8dc751679..dc8efa1358e94 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1362,12 +1362,12 @@ int
>> copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> {
>> pgd_t *src_pgd, *dst_pgd;
>> - unsigned long next;
>> unsigned long addr = src_vma->vm_start;
>> unsigned long end = src_vma->vm_end;
>> struct mm_struct *dst_mm = dst_vma->vm_mm;
>> struct mm_struct *src_mm = src_vma->vm_mm;
>> struct mmu_notifier_range range;
>> + unsigned long next, pfn;
>> bool is_cow;
>> int ret;
>>
>> @@ -1378,11 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>>
>> if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
>> - /*
>> - * We do not free on error cases below as remove_vma
>> - * gets called on error from higher level routine
>> - */
>> - ret = track_pfn_copy(src_vma);
>> + ret = track_pfn_copy(dst_vma, src_vma, &pfn);
>> if (ret)
>> return ret;
>> }
>> @@ -1419,7 +1415,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> continue;
>> if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
>> addr, next))) {
>> - untrack_pfn_clear(dst_vma);
>> ret = -ENOMEM;
>> break;
>> }
>> @@ -1429,6 +1424,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> raw_write_seqcount_end(&src_mm->write_protect_seq);
>> mmu_notifier_invalidate_range_end(&range);
>> }
>> + if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
>> + untrack_pfn_copy(dst_vma, pfn);
>
> Yeah, the problem here is that !(src_vma->vm_flags & VM_PFNMAP) is not the
> _only_ way we can not have a valid pfn.
>
> Do we still want to untrack_pfn_copy() even if !VM_PAT?
Sure, let that be handled internally, where all the ugly VM_PAT handling
resides.
Unless there is very good reason to do it differently.
Thanks!
--
Cheers,
David / dhildenb
TL;DR is I agree with you :P I'm not sure where to put R-b tag given you sent a
fix-patch, as this is obviously smatch/clang-broken as-is so feels wrong to put
on main bit.
I guess I'll put on fix-patch and Andrew? Are you taking this? If so maybe from
there you can propagate?
Thanks!
On Wed, Apr 02, 2025 at 02:20:24PM +0200, David Hildenbrand wrote:
> > >
> > > v2 -> v3:
> > > * Make some !MMU configs happy by just moving the code into memtype.c
> >
> > Obviously we need to make the bots happy once again, re the issue at [0]...
> >
> > [0]: https://lore.kernel.org/all/9b3b3296-ab21-418b-a0ff-8f5248f9b4ec@lucifer.local/
> >
> > Which by the way you... didn't seem to be cc'd on, unless I missed it? I
> > had to manually add you in which is... weird.
> >
> > >
> > > v1 -> v2:
> > > * Avoid a second get_pat_info() [and thereby fix the error checking]
> > > by passing the pfn from track_pfn_copy() to untrack_pfn_copy()
> > > * Simplify untrack_pfn_copy() by calling untrack_pfn().
> > > * Retested
> > >
> > > Not sure if we want to CC stable ... it's really hard to trigger in
> > > sane environments.
> >
> > This kind of code path is probably in reality... theoretical. So I'm fine
> > with this.
> >
>
> Thanks a bunch for your review!
No probs! :)
>
> > >
> > > ---
> > > arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------
> > > include/linux/pgtable.h | 28 ++++++++++++++++-----
> > > kernel/fork.c | 4 +++
> > > mm/memory.c | 11 +++------
> > > 4 files changed, 58 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > index feb8cc6a12bf2..d721cc19addbd 100644
> > > --- a/arch/x86/mm/pat/memtype.c
> > > +++ b/arch/x86/mm/pat/memtype.c
> > > @@ -984,29 +984,42 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
> > > return -EINVAL;
> > > }
> > >
> > > -/*
> > > - * track_pfn_copy is called when vma that is covering the pfnmap gets
> > > - * copied through copy_page_range().
> > > - *
> > > - * If the vma has a linear pfn mapping for the entire range, we get the prot
> > > - * from pte and reserve the entire vma range with single reserve_pfn_range call.
> > > - */
> > > -int track_pfn_copy(struct vm_area_struct *vma)
> > > +int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > + struct vm_area_struct *src_vma, unsigned long *pfn)
> >
> > I think we need an additional 'tracked' parameter so we know whether or not
> > this pfn is valid.
>
> See below.
>
> >
> > It's kind of icky... see the bot report for context, but we we sort of need
> > to differentiate between 'error' and 'nothing to do'. Of course PFN can
> > conceivably be 0 so we can't just return that or an error (plus return
> > values that can be both errors and values are fraught anyway).
> >
> > So I guess -maybe- least horrid thing is:
> >
> > int track_pfn_copy(struct vm_area_struct *dst_vma,
> > struct vm_area_struct *src_vma, unsigned long *pfn,
> > bool *pfn_tracked);
> >
> > Then you can obviously invoke with track_pfn_copy(..., &pfn_tracked); and
> > do an if (pfn_tracked) untrack_pfn_copy(...).
> >
> > I'm really not in favour of just initialising PFN to 0 because there are
> > code paths where this might actually get passed around and used
> > incorrectly.
> >
> > But on the other hand - I get that this is disgusting.
>
> I'm in favor of letting VM_PAT take care of that. Observe how
> untrack_pfn_copy() -> untrack_pfn() takes care of that by checking for
> VM_PAT.
Ahhh ok that makes a big difference.
If that handles it then fine, let's just init to 0.
>
> So this should be working as expected? No need to add something on top that
> makes it even more ugly in the caller.
Yes, agreed, if this is already being handled in the one hideous place let's
make it hideous there only.
But maybe a comment...?
>
> >
> >
> > > {
> > > + const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
> > > resource_size_t paddr;
> > > - unsigned long vma_size = vma->vm_end - vma->vm_start;
> > > pgprot_t pgprot;
> > > + int rc;
> > >
> > > - if (vma->vm_flags & VM_PAT) {
> > > - if (get_pat_info(vma, &paddr, &pgprot))
> > > - return -EINVAL;
> > > - /* reserve the whole chunk covered by vma. */
> > > - return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> > > - }
> > > + if (!(src_vma->vm_flags & VM_PAT))
> > > + return 0;
> >
> > I do always like the use of the guard clause pattern :)
> >
> > But here we have a case where now error = 0, pfn not set, and we will try
> > to untrack it despite !VM_PAT.
>
> Right, and untrack_pfn() is smart enough to filter that out. (just like for
> any other invokation)
Ack.
>
> >
> > > +
> > > + /*
> > > + * Duplicate the PAT information for the dst VMA based on the src
> > > + * VMA.
> > > + */
> > > + if (get_pat_info(src_vma, &paddr, &pgprot))
> > > + return -EINVAL;
> > > + rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> > > + if (rc)
> > > + return rc;
> >
> > I mean it's a crazy nit, but we use ret elsewhere but rc here, maybe better
> > to use ret in both places.
> >
> > But also feel free to ignore this.
>
> "int retval;" ? ;)
Lol, 'rv'?
Maybe let's leave it as is :P
>
> >
> > >
> > > + /* Reservation for the destination VMA succeeded. */
> > > + vm_flags_set(dst_vma, VM_PAT);
> > > + *pfn = PHYS_PFN(paddr);
> > > return 0;
> > > }
> > >
> > > +void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
> > > +{
> > > + untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
> > > + /*
> > > + * Reservation was freed, any copied page tables will get cleaned
> > > + * up later, but without getting PAT involved again.
> > > + */
> > > +}
> > > +
> > > /*
> > > * prot is passed in as a parameter for the new mapping. If the vma has
> > > * a linear pfn mapping for the entire range, or no vma is provided,
> > > @@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> > > }
> > > }
> > >
> > > -/*
> > > - * untrack_pfn_clear is called if the following situation fits:
> > > - *
> > > - * 1) while mremapping a pfnmap for a new region, with the old vma after
> > > - * its pfnmap page table has been removed. The new vma has a new pfnmap
> > > - * to the same pfn & cache type with VM_PAT set.
> > > - * 2) while duplicating vm area, the new vma fails to copy the pgtable from
> > > - * old vma.
> > > - */
> >
> > This just wrong now?
>
> Note that I'm keeping the doc to a single place -- the stub in the header.
> (below)
>
> Or can you elaborate what exactly is "wrong"?
Ah ok maybe I just missed this. I was asking whether it was wrong, and this is
why maybe you are removing (perhaps, not very clearly :)
>
> >
> > > void untrack_pfn_clear(struct vm_area_struct *vma)
> > > {
> > > vm_flags_clear(vma, VM_PAT);
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index 94d267d02372e..4c107e17c547e 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -1508,14 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> > > }
> > >
> > > /*
> > > - * track_pfn_copy is called when vma that is covering the pfnmap gets
> > > - * copied through copy_page_range().
> > > + * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
> > > + * tables copied during copy_page_range(). On success, stores the pfn to be
> > > + * passed to untrack_pfn_copy().
> > > */
> > > -static inline int track_pfn_copy(struct vm_area_struct *vma)
> > > +static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > + struct vm_area_struct *src_vma, unsigned long *pfn)
> > > {
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
> > > + * copy_page_range(), but after track_pfn_copy() was already called.
> > > + */
> >
> > Do we really care to put a comment like this on a stub function?
>
> Whoever started this beautiful VM_PAT code decided to do it like that: and I
> think it's better kept at a single location.
Lol. Fair enough!
>
> >
> > > +static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > + unsigned long pfn)
> > > +{
> > > +}
> > > +
> > > /*
> > > * untrack_pfn is called while unmapping a pfnmap for a region.
> > > * untrack can be called for a specific region indicated by pfn and size or
> > > @@ -1528,8 +1539,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
> > > }
> > >
> > > /*
> > > - * untrack_pfn_clear is called while mremapping a pfnmap for a new region
> > > - * or fails to copy pgtable during duplicate vm area.
> > > + * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
> > > + *
> > > + * 1) During mremap() on the src VMA after the page tables were moved.
> > > + * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
> > > */
> >
> > Can I say as an aside that I hate this kind of hook? Like quite a lot?
> >
> > I mean I've been looking at mremap() of anon mappings as you know obv. but
> > the thought of PFN mapping mremap()ing is kind of also a bit ugh.
>
> I absolutely hate all of that, but I'll have to leave any cleanups to people
> with more spare time ;)
Lol well... maybe at some point I will find some for this... when things get
ugly enough I find that I make the time in the end ;)
>
> >
> > > static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> > > {
> > > @@ -1540,7 +1553,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> > > unsigned long size);
> > > extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> > > pfn_t pfn);
> > > -extern int track_pfn_copy(struct vm_area_struct *vma);
> > > +extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > + struct vm_area_struct *src_vma, unsigned long *pfn);
> > > +extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > + unsigned long pfn);
> > > extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> > > unsigned long size, bool mm_wr_locked);
> > > extern void untrack_pfn_clear(struct vm_area_struct *vma);
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 735405a9c5f32..ca2ca3884f763 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > > vma_numab_state_init(new);
> > > dup_anon_vma_name(orig, new);
> > >
> > > + /* track_pfn_copy() will later take care of copying internal state. */
> > > + if (unlikely(new->vm_flags & VM_PFNMAP))
> > > + untrack_pfn_clear(new);
> >
> > OK so maybe I'm being stupid here, but - is it the case that
> >
> > a. We duplicate a VMA that has a PAT-tracked PFN map
> > b. We must clear any existing tracking so everything is 'reset' to zero>
> c. track_pfn_copy() will later in fork process set anything up we need here.
> >
> > Is this correct?
>
> Right. But b) is actually not "clearing any tracking" (because there is no
> tracking/reservation for the copied version yet) but marking it as "not
> tracked".
Ack, thanks!
>
> >
> > > +
> > > return new;
> > > }
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index fb7b8dc751679..dc8efa1358e94 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1362,12 +1362,12 @@ int
> > > copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > {
> > > pgd_t *src_pgd, *dst_pgd;
> > > - unsigned long next;
> > > unsigned long addr = src_vma->vm_start;
> > > unsigned long end = src_vma->vm_end;
> > > struct mm_struct *dst_mm = dst_vma->vm_mm;
> > > struct mm_struct *src_mm = src_vma->vm_mm;
> > > struct mmu_notifier_range range;
> > > + unsigned long next, pfn;
> > > bool is_cow;
> > > int ret;
> > >
> > > @@ -1378,11 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
> > >
> > > if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> > > - /*
> > > - * We do not free on error cases below as remove_vma
> > > - * gets called on error from higher level routine
> > > - */
> > > - ret = track_pfn_copy(src_vma);
> > > + ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> > > if (ret)
> > > return ret;
> > > }
> > > @@ -1419,7 +1415,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > continue;
> > > if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
> > > addr, next))) {
> > > - untrack_pfn_clear(dst_vma);
> > > ret = -ENOMEM;
> > > break;
> > > }
> > > @@ -1429,6 +1424,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > raw_write_seqcount_end(&src_mm->write_protect_seq);
> > > mmu_notifier_invalidate_range_end(&range);
> > > }
> > > + if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> > > + untrack_pfn_copy(dst_vma, pfn);
> >
> > Yeah, the problem here is that !(src_vma->vm_flags & VM_PFNMAP) is not the
> > _only_ way we can not have a valid pfn.
> >
> > Do we still want to untrack_pfn_copy() even if !VM_PAT?
>
> Sure, let that be handled internally, where all the ugly VM_PAT handling
> resides.
>
> Unless there is very good reason to do it differently.
Yeah agreed, missing context for me was that we already handle this.
>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>
On 02.04.25 14:31, Lorenzo Stoakes wrote:
> TL;DR is I agree with you :P I'm not sure where to put R-b tag given you sent a
> fix-patch, as this is obviously smatch/clang-broken as-is so feels wrong to put
> on main bit.
I'll respin! :)
>
> I guess I'll put on fix-patch and Andrew? Are you taking this? If so maybe from
> there you can propagate?
[...]
>
> If that handles it then fine, let's just init to 0.
>
>>
>> So this should be working as expected? No need to add something on top that
>> makes it even more ugly in the caller.
>
> Yes, agreed, if this is already being handled in the one hideous place let's
> make it hideous there only.
>
> But maybe a comment...?
I can add that that function handles the need for actual untracking
internally.
[...]
>>>> +
>>>> + /*
>>>> + * Duplicate the PAT information for the dst VMA based on the src
>>>> + * VMA.
>>>> + */
>>>> + if (get_pat_info(src_vma, &paddr, &pgprot))
>>>> + return -EINVAL;
>>>> + rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
>>>> + if (rc)
>>>> + return rc;
>>>
>>> I mean it's a crazy nit, but we use ret elsewhere but rc here, maybe better
>>> to use ret in both places.
>>>
>>> But also feel free to ignore this.
>>
>> "int retval;" ? ;)
>
> Lol, 'rv'?
>
> Maybe let's leave it as is :P
I think "ret" is used in the file, so I'll use that.
>
>>
>>>
>>>>
>>>> + /* Reservation for the destination VMA succeeded. */
>>>> + vm_flags_set(dst_vma, VM_PAT);
>>>> + *pfn = PHYS_PFN(paddr);
>>>> return 0;
>>>> }
>>>>
>>>> +void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
>>>> +{
>>>> + untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
>>>> + /*
>>>> + * Reservation was freed, any copied page tables will get cleaned
>>>> + * up later, but without getting PAT involved again.
>>>> + */
>>>> +}
>>>> +
>>>> /*
>>>> * prot is passed in as a parameter for the new mapping. If the vma has
>>>> * a linear pfn mapping for the entire range, or no vma is provided,
>>>> @@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
>>>> }
>>>> }
>>>>
>>>> -/*
>>>> - * untrack_pfn_clear is called if the following situation fits:
>>>> - *
>>>> - * 1) while mremapping a pfnmap for a new region, with the old vma after
>>>> - * its pfnmap page table has been removed. The new vma has a new pfnmap
>>>> - * to the same pfn & cache type with VM_PAT set.
>>>> - * 2) while duplicating vm area, the new vma fails to copy the pgtable from
>>>> - * old vma.
>>>> - */
>>>
>>> This just wrong now?
>>
>> Note that I'm keeping the doc to a single place -- the stub in the header.
>> (below)
>>
>> Or can you elaborate what exactly is "wrong"?
>
> Ah ok maybe I just missed this. I was asking whether it was wrong, and this is
> why maybe you are removing (perhaps, not very clearly :)
Ah, sorry. Yes, it's just deduplicated to be adjusted in the other copy :)
[...]
>>> Can I say as an aside that I hate this kind of hook? Like quite a lot?
>>>
>>> I mean I've been looking at mremap() of anon mappings as you know obv. but
>>> the thought of PFN mapping mremap()ing is kind of also a bit ugh.
>>
>> I absolutely hate all of that, but I'll have to leave any cleanups to people
>> with more spare time ;)
>
> Lol well... maybe at some point I will find some for this... when things get
> ugly enough I find that I make the time in the end ;)
We're kind-of attaching metadata to a VMA, that is not directly linked
to the VMA. And the duplication of a VMA cannot handle that, so we defer
copying of that metadata. Hm ...
--
Cheers,
David / dhildenb
On 25.03.25 20:19, David Hildenbrand wrote:
> If track_pfn_copy() fails, we already added the dst VMA to the maple
> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
> the dst VMA for which we neither performed any reservation nor copied
> any page tables.
>
> Consequently untrack_pfn() will see VM_PAT and try obtaining the
> PAT information from the page table -- which fails because the page
> table was not copied.
>
> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
> if track_pfn_copy() fails. However, the whole thing is about "simply"
> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
> and performed a reservation, but copying the page tables fails, we'll
> simply clear the VM_PAT flag, not properly undoing the reservation ...
> which is also wrong.
>
> So let's fix it properly: set the VM_PAT flag only if the reservation
> succeeded (leaving it clear initially), and undo the reservation if
> anything goes wrong while copying the page tables: clearing the VM_PAT
> flag after undoing the reservation.
>
> Note that any copied page table entries will get zapped when the VMA will
> get removed later, after copy_page_range() succeeded; as VM_PAT is not set
> then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
> happy. Note that leaving these page tables in place without a reservation
> is not a problem, as we are aborting fork(); this process will never run.
>
> A reproducer can trigger this usually at the first try:
>
> https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c
>
> [ 45.239440] WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
> [ 45.241082] Modules linked in: ...
> [ 45.249119] CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
> [ 45.250598] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> [ 45.252181] RIP: 0010:get_pat_info+0xf6/0x110
> ...
> [ 45.268513] Call Trace:
> [ 45.269003] <TASK>
> [ 45.269425] ? __warn.cold+0xb7/0x14d
> [ 45.270131] ? get_pat_info+0xf6/0x110
> [ 45.270846] ? report_bug+0xff/0x140
> [ 45.271519] ? handle_bug+0x58/0x90
> [ 45.272192] ? exc_invalid_op+0x17/0x70
> [ 45.272935] ? asm_exc_invalid_op+0x1a/0x20
> [ 45.273717] ? get_pat_info+0xf6/0x110
> [ 45.274438] ? get_pat_info+0x71/0x110
> [ 45.275165] untrack_pfn+0x52/0x110
> [ 45.275835] unmap_single_vma+0xa6/0xe0
> [ 45.276549] unmap_vmas+0x105/0x1f0
> [ 45.277256] exit_mmap+0xf6/0x460
> [ 45.277913] __mmput+0x4b/0x120
> [ 45.278512] copy_process+0x1bf6/0x2aa0
> [ 45.279264] kernel_clone+0xab/0x440
> [ 45.279959] __do_sys_clone+0x66/0x90
> [ 45.280650] do_syscall_64+0x95/0x180
>
> Likely this case was missed in commit d155df53f310 ("x86/mm/pat: clear
> VM_PAT if copy_p4d_range failed")
>
> ... and instead of undoing the reservation we simply cleared the VM_PAT flag.
>
> Keep the documentation of these functions in include/linux/pgtable.h,
> one place is more than sufficient -- we should clean that up for the other
> functions like track_pfn_remap/untrack_pfn separately.
>
> Reported-by: xingwei lee <xrivendell7@gmail.com>
> Reported-by: yuxin wang <wang1315768607@163.com>
> Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/
> Reported-by: Marius Fleischer <fleischermarius@gmail.com>
> Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
> Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
> Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dan Carpenter <dan.carpenter@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
Apparently smatch is not happy about some scenarios. The following might
make it happy, and make track_pfn_copy() obey the documentation "pfn set
if rc == 0".
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index d721cc19addbd..a51d21d2e5198 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -992,8 +992,10 @@ int track_pfn_copy(struct vm_area_struct *dst_vma,
pgprot_t pgprot;
int rc;
- if (!(src_vma->vm_flags & VM_PAT))
+ if (!(src_vma->vm_flags & VM_PAT)) {
+ *pfn = 0;
return 0;
+ }
/*
* Duplicate the PAT information for the dst VMA based on the src
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 4c107e17c547e..d4b564aacab8f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1515,6 +1515,7 @@ static inline void track_pfn_insert(struct
vm_area_struct *vma, pgprot_t *prot,
static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
struct vm_area_struct *src_vma, unsigned long *pfn)
{
+ *pfn = 0;
return 0;
}
--
Cheers,
David / dhildenb
For the whole thing with this fix-patch:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
On Wed, Apr 02, 2025 at 01:36:32PM +0200, David Hildenbrand wrote:
> On 25.03.25 20:19, David Hildenbrand wrote:
> > If track_pfn_copy() fails, we already added the dst VMA to the maple
> > tree. As fork() fails, we'll cleanup the maple tree, and stumble over
> > the dst VMA for which we neither performed any reservation nor copied
> > any page tables.
> >
> > Consequently untrack_pfn() will see VM_PAT and try obtaining the
> > PAT information from the page table -- which fails because the page
> > table was not copied.
> >
> > The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
> > if track_pfn_copy() fails. However, the whole thing is about "simply"
> > clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
> > and performed a reservation, but copying the page tables fails, we'll
> > simply clear the VM_PAT flag, not properly undoing the reservation ...
> > which is also wrong.
> >
> > So let's fix it properly: set the VM_PAT flag only if the reservation
> > succeeded (leaving it clear initially), and undo the reservation if
> > anything goes wrong while copying the page tables: clearing the VM_PAT
> > flag after undoing the reservation.
> >
> > Note that any copied page table entries will get zapped when the VMA will
> > get removed later, after copy_page_range() succeeded; as VM_PAT is not set
> > then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
> > happy. Note that leaving these page tables in place without a reservation
> > is not a problem, as we are aborting fork(); this process will never run.
> >
> > A reproducer can trigger this usually at the first try:
> >
> > https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c
> >
> > [ 45.239440] WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
> > [ 45.241082] Modules linked in: ...
> > [ 45.249119] CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
> > [ 45.250598] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> > [ 45.252181] RIP: 0010:get_pat_info+0xf6/0x110
> > ...
> > [ 45.268513] Call Trace:
> > [ 45.269003] <TASK>
> > [ 45.269425] ? __warn.cold+0xb7/0x14d
> > [ 45.270131] ? get_pat_info+0xf6/0x110
> > [ 45.270846] ? report_bug+0xff/0x140
> > [ 45.271519] ? handle_bug+0x58/0x90
> > [ 45.272192] ? exc_invalid_op+0x17/0x70
> > [ 45.272935] ? asm_exc_invalid_op+0x1a/0x20
> > [ 45.273717] ? get_pat_info+0xf6/0x110
> > [ 45.274438] ? get_pat_info+0x71/0x110
> > [ 45.275165] untrack_pfn+0x52/0x110
> > [ 45.275835] unmap_single_vma+0xa6/0xe0
> > [ 45.276549] unmap_vmas+0x105/0x1f0
> > [ 45.277256] exit_mmap+0xf6/0x460
> > [ 45.277913] __mmput+0x4b/0x120
> > [ 45.278512] copy_process+0x1bf6/0x2aa0
> > [ 45.279264] kernel_clone+0xab/0x440
> > [ 45.279959] __do_sys_clone+0x66/0x90
> > [ 45.280650] do_syscall_64+0x95/0x180
> >
> > Likely this case was missed in commit d155df53f310 ("x86/mm/pat: clear
> > VM_PAT if copy_p4d_range failed")
> >
> > ... and instead of undoing the reservation we simply cleared the VM_PAT flag.
> >
> > Keep the documentation of these functions in include/linux/pgtable.h,
> > one place is more than sufficient -- we should clean that up for the other
> > functions like track_pfn_remap/untrack_pfn separately.
> >
> > Reported-by: xingwei lee <xrivendell7@gmail.com>
> > Reported-by: yuxin wang <wang1315768607@163.com>
> > Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/
> > Reported-by: Marius Fleischer <fleischermarius@gmail.com>
> > Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
> > Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
> > Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dan Carpenter <dan.carpenter@linaro.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
>
> Apparently smatch is not happy about some scenarios. The following might
> make it happy, and make track_pfn_copy() obey the documentation "pfn set if
> rc == 0".
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index d721cc19addbd..a51d21d2e5198 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -992,8 +992,10 @@ int track_pfn_copy(struct vm_area_struct *dst_vma,
> pgprot_t pgprot;
> int rc;
>
> - if (!(src_vma->vm_flags & VM_PAT))
> + if (!(src_vma->vm_flags & VM_PAT)) {
> + *pfn = 0;
> return 0;
> + }
>
> /*
> * Duplicate the PAT information for the dst VMA based on the src
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 4c107e17c547e..d4b564aacab8f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1515,6 +1515,7 @@ static inline void track_pfn_insert(struct
> vm_area_struct *vma, pgprot_t *prot,
> static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
> struct vm_area_struct *src_vma, unsigned long *pfn)
> {
> + *pfn = 0;
> return 0;
> }
OK interesting, I would have thought it'd be setting the pfn in the local var,
but this is probably actually better + clearer so we consistently set the value
in track_pfn_copy() (in non-error case).
>
>
>
> --
> Cheers,
>
> David / dhildenb
>
On 02.04.25 14:32, Lorenzo Stoakes wrote:
> For the whole thing with this fix-patch:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
So, we already have now
commit dc84bc2aba85a1508f04a936f9f9a15f64ebfb31
Author: David Hildenbrand <david@redhat.com>
Date: Fri Mar 21 12:23:23 2025 +0100
x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
So I'll only send a fixup (but will likely keep rc vs. ret unchanged for that).
--
Cheers,
David / dhildenb
On Thu, Apr 03, 2025 at 04:47:47PM +0200, David Hildenbrand wrote: > On 02.04.25 14:32, Lorenzo Stoakes wrote: > > For the whole thing with this fix-patch: > > > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > So, we already have now > > commit dc84bc2aba85a1508f04a936f9f9a15f64ebfb31 > Author: David Hildenbrand <david@redhat.com> > Date: Fri Mar 21 12:23:23 2025 +0100 > > x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range() > So I'll only send a fixup (but will likely keep rc vs. ret unchanged for that). I'm completely fine with either, as long as the fixup sorts the stupid build isuses, the rest is fine, only nits :) Feel free to take tag for things as-is (with fixup!). > > -- > Cheers, > > David / dhildenb >
© 2016 - 2025 Red Hat, Inc.