[PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()

David Hildenbrand posted 1 patch 1 year, 3 months ago
arch/x86/mm/pat/memtype.c | 66 +++++++++++++++++++++++++--------------
include/linux/pgtable.h   | 27 ++++++++++++----
kernel/fork.c             |  4 +++
mm/memory.c               |  9 ++----
4 files changed, 70 insertions(+), 36 deletions(-)
[PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 1 year, 3 months ago
If track_pfn_copy() fails, we already added the dst VMA to the maple
tree. As fork() fails, we'll cleanup the maple tree, and stumble over
the dst VMA for which we neither performed any reservation nor copied
any page tables.

Consequently untrack_pfn() will see VM_PAT and try obtaining the
PAT information from the page table -- which fails because the page
table was not copied.

The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
if track_pfn_copy() fails. However, the whole thing is about "simply"
clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
and performed a reservation, but copying the page tables fails, we'll
simply clear the VM_PAT flag, not properly undoing the reservation ...
which is also wrong.

So let's fix it properly: set the VM_PAT flag only if the reservation
succeeded (leaving it clear initially), and undo the reservation if
anything goes wrong while copying the page tables: clearing the VM_PAT
flag after undoing the reservation.

Note that any copied page table entries will get zapped when the VMA will
get removed later, after copy_page_range() succeeded; as VM_PAT is not set
then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
happy. Note that leaving these page tables in place without a reservation
is not a problem, as we are aborting fork(); this process will never run.

A reproducer [1] can trigger this usually at the first try:

[   45.239440] WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
[   45.241082] Modules linked in: ...
[   45.249119] CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
[   45.250598] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[   45.252181] RIP: 0010:get_pat_info+0xf6/0x110
...
[   45.268513] Call Trace:
[   45.269003]  <TASK>
[   45.269425]  ? __warn.cold+0xb7/0x14d
[   45.270131]  ? get_pat_info+0xf6/0x110
[   45.270846]  ? report_bug+0xff/0x140
[   45.271519]  ? handle_bug+0x58/0x90
[   45.272192]  ? exc_invalid_op+0x17/0x70
[   45.272935]  ? asm_exc_invalid_op+0x1a/0x20
[   45.273717]  ? get_pat_info+0xf6/0x110
[   45.274438]  ? get_pat_info+0x71/0x110
[   45.275165]  untrack_pfn+0x52/0x110
[   45.275835]  unmap_single_vma+0xa6/0xe0
[   45.276549]  unmap_vmas+0x105/0x1f0
[   45.277256]  exit_mmap+0xf6/0x460
[   45.277913]  __mmput+0x4b/0x120
[   45.278512]  copy_process+0x1bf6/0x2aa0
[   45.279264]  kernel_clone+0xab/0x440
[   45.279959]  __do_sys_clone+0x66/0x90
[   45.280650]  do_syscall_64+0x95/0x180

Likely this case was missed in commit d155df53f310
("x86/mm/pat: clear VM_PAT if copy_p4d_range failed"), and instead of
undoing the reservation we simply cleared the VM_PAT flag.

Keep the documentation of these functions in include/linux/pgtable.h,
one place is more than sufficient -- we should clean that up for the other
functions like track_pfn_remap/untrack_pfn separately.

[1] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c

Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yuxin wang <wang1315768607@163.com>
Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/
Reported-by: Marius Fleischer <fleischermarius@gmail.com>
Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ma Wupeng <mawupeng1@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 66 +++++++++++++++++++++++++--------------
 include/linux/pgtable.h   | 27 ++++++++++++----
 kernel/fork.c             |  4 +++
 mm/memory.c               |  9 ++----
 4 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index feb8cc6a12bf..3a9e6dd58e2f 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -984,27 +984,54 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
 	return -EINVAL;
 }
 
-/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
- *
- * If the vma has a linear pfn mapping for the entire range, we get the prot
- * from pte and reserve the entire vma range with single reserve_pfn_range call.
- */
-int track_pfn_copy(struct vm_area_struct *vma)
+int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
 {
+	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
 	resource_size_t paddr;
-	unsigned long vma_size = vma->vm_end - vma->vm_start;
 	pgprot_t pgprot;
+	int rc;
 
-	if (vma->vm_flags & VM_PAT) {
-		if (get_pat_info(vma, &paddr, &pgprot))
-			return -EINVAL;
-		/* reserve the whole chunk covered by vma. */
-		return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+	if (!(src_vma->vm_flags & VM_PAT))
+		return 0;
+
+	/*
+	 * Duplicate the PAT information for the dst VMA based on the src
+	 * VMA.
+	 */
+	if (get_pat_info(src_vma, &paddr, &pgprot))
+		return -EINVAL;
+	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+	if (!rc)
+		/* Reservation for the destination VMA succeeded. */
+		vm_flags_set(dst_vma, VM_PAT);
+	return rc;
+}
+
+void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
+{
+	resource_size_t paddr;
+	unsigned long size;
+
+	if (!(dst_vma->vm_flags & VM_PAT))
+		return;
+
+	/*
+	 * As the page tables might not have been copied yet, the PAT
+	 * information is obtained from the src VMA, just like during
+	 * track_pfn_copy().
+	 */
+	if (get_pat_info(src_vma, &paddr, NULL)) {
+		size = src_vma->vm_end - src_vma->vm_start;
+		free_pfn_range(paddr, size);
 	}
 
-	return 0;
+	/*
+	 * Reservation was freed, any copied page tables will get cleaned
+	 * up later, but without getting PAT involved again.
+	 */
+	vm_flags_clear(dst_vma, VM_PAT);
 }
 
 /*
@@ -1095,15 +1122,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 	}
 }
 
-/*
- * untrack_pfn_clear is called if the following situation fits:
- *
- * 1) while mremapping a pfnmap for a new region,  with the old vma after
- * its pfnmap page table has been removed.  The new vma has a new pfnmap
- * to the same pfn & cache type with VM_PAT set.
- * 2) while duplicating vm area, the new vma fails to copy the pgtable from
- * old vma.
- */
 void untrack_pfn_clear(struct vm_area_struct *vma)
 {
 	vm_flags_clear(vma, VM_PAT);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8b2ac6bd2ae..616707b4ecb8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1518,14 +1518,24 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 }
 
 /*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
+ * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
+ * tables copied during copy_page_range().
  */
-static inline int track_pfn_copy(struct vm_area_struct *vma)
+static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
 {
 	return 0;
 }
 
+/*
+ * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
+ * copy_page_range(), but after track_pfn_copy() was already called.
+ */
+static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
+{
+}
+
 /*
  * untrack_pfn is called while unmapping a pfnmap for a region.
  * untrack can be called for a specific region indicated by pfn and size or
@@ -1538,8 +1548,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
 }
 
 /*
- * untrack_pfn_clear is called while mremapping a pfnmap for a new region
- * or fails to copy pgtable during duplicate vm area.
+ * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
+ *
+ * 1) During mremap() on the src VMA after the page tables were moved.
+ * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
  */
 static inline void untrack_pfn_clear(struct vm_area_struct *vma)
 {
@@ -1550,7 +1562,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			   unsigned long size);
 extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 			     pfn_t pfn);
-extern int track_pfn_copy(struct vm_area_struct *vma);
+extern int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma);
+extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 			unsigned long size, bool mm_wr_locked);
 extern void untrack_pfn_clear(struct vm_area_struct *vma);
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..02a7a8b44107 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
+	/* track_pfn_copy() will later take care of copying internal state. */
+	if (unlikely(new->vm_flags & VM_PFNMAP))
+		untrack_pfn_clear(new);
+
 	return new;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 3ccee51adfbb..f7fbf099e8f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1372,11 +1372,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
 
 	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
-		/*
-		 * We do not free on error cases below as remove_vma
-		 * gets called on error from higher level routine
-		 */
-		ret = track_pfn_copy(src_vma);
+		ret = track_pfn_copy(dst_vma, src_vma);
 		if (ret)
 			return ret;
 	}
@@ -1413,7 +1409,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 			continue;
 		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
 					    addr, next))) {
-			untrack_pfn_clear(dst_vma);
 			ret = -ENOMEM;
 			break;
 		}
@@ -1423,6 +1418,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		raw_write_seqcount_end(&src_mm->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
 	}
+	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
+		untrack_pfn_copy(dst_vma, src_vma);
 	return ret;
 }
 

base-commit: 0f4cb420b38489c9bab9d091c3815714be8cb69d
-- 
2.47.0
Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by Peter Xu 1 year, 3 months ago
On Tue, Oct 29, 2024 at 10:03:31PM +0100, David Hildenbrand wrote:
> If track_pfn_copy() fails, we already added the dst VMA to the maple
> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
> the dst VMA for which we neither performed any reservation nor copied
> any page tables.
> 
> Consequently untrack_pfn() will see VM_PAT and try obtaining the
> PAT information from the page table -- which fails because the page
> table was not copied.
> 
> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
> if track_pfn_copy() fails. However, the whole thing is about "simply"
> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
> and performed a reservation, but copying the page tables fails, we'll
> simply clear the VM_PAT flag, not properly undoing the reservation ...
> which is also wrong.

David,

Sorry to not have chance yet reply to your other email..

The only concern I have with the current fix to fork() is.. we started to
have device drivers providing fault() on PFNMAPs as vfio-pci does, then I
think it means we could potentially start to hit the same issue even
without fork(), but as long as the 1st pgtable entry of the PFNMAP range is
not mapped when the process with VM_PAT vma exit()s, or munmap() the vma.

So I do feel like at some point we still need to make get_pat_info() work
without walking the pgtable, so as to fix all possible such issues.

Thanks,

-- 
Peter Xu
Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 1 year, 3 months ago
On 30.10.24 22:32, Peter Xu wrote:
> On Tue, Oct 29, 2024 at 10:03:31PM +0100, David Hildenbrand wrote:
>> If track_pfn_copy() fails, we already added the dst VMA to the maple
>> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
>> the dst VMA for which we neither performed any reservation nor copied
>> any page tables.
>>
>> Consequently untrack_pfn() will see VM_PAT and try obtaining the
>> PAT information from the page table -- which fails because the page
>> table was not copied.
>>
>> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
>> if track_pfn_copy() fails. However, the whole thing is about "simply"
>> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
>> and performed a reservation, but copying the page tables fails, we'll
>> simply clear the VM_PAT flag, not properly undoing the reservation ...
>> which is also wrong.
> 
> David,
> 

Hi Peter,

> Sorry to not have chance yet reply to your other email..
> 
> The only concern I have with the current fix to fork() is.. we started to
> have device drivers providing fault() on PFNMAPs as vfio-pci does, then I
> think it means we could potentially start to hit the same issue even
> without fork(), but as long as the 1st pgtable entry of the PFNMAP range is
> not mapped when the process with VM_PAT vma exit()s, or munmap() the vma.

As these drivers are not using remap_pfn_range, there is no way they 
could currently get VM_PAT set.

So what you describe is independent of the current state we are fixing 
here, and this fix should sort out the issues with current VM_PAT handling.

It indeed is an interesting question how to handle reservations when 
*not* using remap_pfn_range() to cover the whole area.

remap_pfn_range() handles VM_PAT automatically because it can do it: it 
knows that the whole range will map consecutive PFNs with the same 
protection, and we expect not parts of the range suddenly getting 
unmapped (and any driver that does that is buggy).

This behavior is, however, not guaranteed to be the case when 
remap_pfn_range() is *not* called on the whole range.

For that case (i.e., vfio-pci) I still wonder if the driver shouldn't do 
the reservation and leave VM_PAT alone.

In the driver, we'd do the reservation once and not worry about fork() 
etc ... and we'd undo the reservation once the last relevant VM_PFNMAP 
VMA is gone or the driver let's go of the device. I assume there are 
already mechanisms in place to deal with that to some degree, because 
the driver cannot go away while any VMA still has the VM_PFNMAP mapping 
-- otherwise something would be seriously messed up.

Long story short: let's look into not using VM_PAT for that use case.

Looking at the VM_PAT issues we had over time, not making it more 
complicated sounds like a very reasonable thing to me :)

-- 
Cheers,

David / dhildenb
Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by mawupeng 1 year, 3 months ago

On 2024/10/31 17:47, David Hildenbrand wrote:
> On 30.10.24 22:32, Peter Xu wrote:
>> On Tue, Oct 29, 2024 at 10:03:31PM +0100, David Hildenbrand wrote:
>>> If track_pfn_copy() fails, we already added the dst VMA to the maple
>>> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
>>> the dst VMA for which we neither performed any reservation nor copied
>>> any page tables.
>>>
>>> Consequently untrack_pfn() will see VM_PAT and try obtaining the
>>> PAT information from the page table -- which fails because the page
>>> table was not copied.
>>>
>>> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
>>> if track_pfn_copy() fails. However, the whole thing is about "simply"
>>> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
>>> and performed a reservation, but copying the page tables fails, we'll
>>> simply clear the VM_PAT flag, not properly undoing the reservation ...
>>> which is also wrong.
>>
>> David,
>>
> 
> Hi Peter,
> 
>> Sorry to not have chance yet reply to your other email..
>>
>> The only concern I have with the current fix to fork() is.. we started to
>> have device drivers providing fault() on PFNMAPs as vfio-pci does, then I
>> think it means we could potentially start to hit the same issue even
>> without fork(), but as long as the 1st pgtable entry of the PFNMAP range is
>> not mapped when the process with VM_PAT vma exit()s, or munmap() the vma.
> 
> As these drivers are not using remap_pfn_range, there is no way they could currently get VM_PAT set.
> 
> So what you describe is independent of the current state we are fixing here, and this fix should sort out the issues with current VM_PAT handling.
> 
> It indeed is an interesting question how to handle reservations when *not* using remap_pfn_range() to cover the whole area.
> 
> remap_pfn_range() handles VM_PAT automatically because it can do it: it knows that the whole range will map consecutive PFNs with the same protection, and we expect not parts of the range suddenly getting unmapped (and any driver that does that is buggy).
> 
> This behavior is, however, not guaranteed to be the case when remap_pfn_range() is *not* called on the whole range.
> 
> For that case (i.e., vfio-pci) I still wonder if the driver shouldn't do the reservation and leave VM_PAT alone.
> 
> In the driver, we'd do the reservation once and not worry about fork() etc ... and we'd undo the reservation once the last relevant VM_PFNMAP VMA is gone or the driver let's go of the device. I assume there are already mechanisms in place to deal with that to some degree, because the driver cannot go away while any VMA still has the VM_PFNMAP mapping -- otherwise something would be seriously messed up.
> 
> Long story short: let's look into not using VM_PAT for that use case.
> 
> Looking at the VM_PAT issues we had over time, not making it more complicated sounds like a very reasonable thing to me :)

Hi David,

The VM_PAT reservation do seems complicated. It can trigger the same warning in get_pat_info if remap_p4d_range fails:

remap_pfn_range
  remap_pfn_range_notrack
    remap_pfn_range_internal
      remap_p4d_range	// page allocation can failed here
    zap_page_range_single
      unmap_single_vma
        untrack_pfn
          get_pat_info
            WARN_ON_ONCE(1);

Any idea on this problem?

>
Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 1 year, 3 months ago
On 07.11.24 09:43, mawupeng wrote:
> 
> 
> On 2024/10/31 17:47, David Hildenbrand wrote:
>> On 30.10.24 22:32, Peter Xu wrote:
>>> On Tue, Oct 29, 2024 at 10:03:31PM +0100, David Hildenbrand wrote:
>>>> If track_pfn_copy() fails, we already added the dst VMA to the maple
>>>> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
>>>> the dst VMA for which we neither performed any reservation nor copied
>>>> any page tables.
>>>>
>>>> Consequently untrack_pfn() will see VM_PAT and try obtaining the
>>>> PAT information from the page table -- which fails because the page
>>>> table was not copied.
>>>>
>>>> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
>>>> if track_pfn_copy() fails. However, the whole thing is about "simply"
>>>> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
>>>> and performed a reservation, but copying the page tables fails, we'll
>>>> simply clear the VM_PAT flag, not properly undoing the reservation ...
>>>> which is also wrong.
>>>
>>> David,
>>>
>>
>> Hi Peter,
>>
>>> Sorry to not have chance yet reply to your other email..
>>>
>>> The only concern I have with the current fix to fork() is.. we started to
>>> have device drivers providing fault() on PFNMAPs as vfio-pci does, then I
>>> think it means we could potentially start to hit the same issue even
>>> without fork(), but as long as the 1st pgtable entry of the PFNMAP range is
>>> not mapped when the process with VM_PAT vma exit()s, or munmap() the vma.
>>
>> As these drivers are not using remap_pfn_range, there is no way they could currently get VM_PAT set.
>>
>> So what you describe is independent of the current state we are fixing here, and this fix should sort out the issues with current VM_PAT handling.
>>
>> It indeed is an interesting question how to handle reservations when *not* using remap_pfn_range() to cover the whole area.
>>
>> remap_pfn_range() handles VM_PAT automatically because it can do it: it knows that the whole range will map consecutive PFNs with the same protection, and we expect not parts of the range suddenly getting unmapped (and any driver that does that is buggy).
>>
>> This behavior is, however, not guaranteed to be the case when remap_pfn_range() is *not* called on the whole range.
>>
>> For that case (i.e., vfio-pci) I still wonder if the driver shouldn't do the reservation and leave VM_PAT alone.
>>
>> In the driver, we'd do the reservation once and not worry about fork() etc ... and we'd undo the reservation once the last relevant VM_PFNMAP VMA is gone or the driver let's go of the device. I assume there are already mechanisms in place to deal with that to some degree, because the driver cannot go away while any VMA still has the VM_PFNMAP mapping -- otherwise something would be seriously messed up.
>>
>> Long story short: let's look into not using VM_PAT for that use case.
>>
>> Looking at the VM_PAT issues we had over time, not making it more complicated sounds like a very reasonable thing to me :)
> 
> Hi David,
> 
> The VM_PAT reservation do seems complicated. It can trigger the same warning in get_pat_info if remap_p4d_range fails:
> 
> remap_pfn_range
>    remap_pfn_range_notrack
>      remap_pfn_range_internal
>        remap_p4d_range	// page allocation can failed here
>      zap_page_range_single
>        unmap_single_vma
>          untrack_pfn
>            get_pat_info
>              WARN_ON_ONCE(1);
> 
> Any idea on this problem?

In remap_pfn_range(), if remap_pfn_range_notrack() fails, we call 
untrack_pfn(), to undo the tracking.

The problem is that zap_page_range_single() shouldn't do that 
untrack_pfn() call.

That should be fixed by Peter's patch:

https://lore.kernel.org/all/20240712144244.3090089-1-peterx@redhat.com/T/#u

-- 
Cheers,

David / dhildenb
Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by Fedor Pchelkin 10 months ago
Hi, David, Peter

Sorry for reviving an old thread. I've tried to keep the context as-is.
Here is an original link in the archives:
https://lore.kernel.org/lkml/20241029210331.1339581-1-david@redhat.com/T/#u

Please see below.

On 07.11.24 10:08, David Hildenbrand wrote
> On 07.11.24 09:43, mawupeng wrote:
> > On 2024/10/31 17:47, David Hildenbrand wrote:
> >> On 30.10.24 22:32, Peter Xu wrote:
> >>> On Tue, Oct 29, 2024 at 10:03:31PM +0100, David Hildenbrand wrote:
> >>>> If track_pfn_copy() fails, we already added the dst VMA to the maple
> >>>> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
> >>>> the dst VMA for which we neither performed any reservation nor copied
> >>>> any page tables.
> >>>>
> >>>> Consequently untrack_pfn() will see VM_PAT and try obtaining the
> >>>> PAT information from the page table -- which fails because the page
> >>>> table was not copied.
> >>>>
> >>>> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
> >>>> if track_pfn_copy() fails. However, the whole thing is about "simply"
> >>>> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
> >>>> and performed a reservation, but copying the page tables fails, we'll
> >>>> simply clear the VM_PAT flag, not properly undoing the reservation ...
> >>>> which is also wrong.
> >>>
> >>> David,
> >>>
> >>
> >> Hi Peter,
> >>
> >>> Sorry to not have chance yet reply to your other email..
> >>>
> >>> The only concern I have with the current fix to fork() is.. we started to
> >>> have device drivers providing fault() on PFNMAPs as vfio-pci does, then I
> >>> think it means we could potentially start to hit the same issue even
> >>> without fork(), but as long as the 1st pgtable entry of the PFNMAP range is
> >>> not mapped when the process with VM_PAT vma exit()s, or munmap() the vma.
> >>
> >> As these drivers are not using remap_pfn_range, there is no way they could currently get VM_PAT set.
> >>
> >> So what you describe is independent of the current state we are fixing here, and this fix should sort out the issues with current VM_PAT handling.
> >>
> >> It indeed is an interesting question how to handle reservations when *not* using remap_pfn_range() to cover the whole area.
> >>
> >> remap_pfn_range() handles VM_PAT automatically because it can do it: it knows that the whole range will map consecutive PFNs with the same protection, and we expect not parts of the range suddenly getting unmapped (and any driver that does that is buggy).
> >>
> >> This behavior is, however, not guaranteed to be the case when remap_pfn_range() is *not* called on the whole range.
> >>
> >> For that case (i.e., vfio-pci) I still wonder if the driver shouldn't do the reservation and leave VM_PAT alone.
> >>
> >> In the driver, we'd do the reservation once and not worry about fork() etc ... and we'd undo the reservation once the last relevant VM_PFNMAP VMA is gone or the driver let's go of the device. I assume there are already mechanisms in place to deal with that to some degree, because the driver cannot go away while any VMA still has the VM_PFNMAP mapping -- otherwise something would be seriously messed up.
> >>
> >> Long story short: let's look into not using VM_PAT for that use case.
> >>
> >> Looking at the VM_PAT issues we had over time, not making it more complicated sounds like a very reasonable thing to me :)
> > 
> > Hi David,
> > 
> > The VM_PAT reservation do seems complicated. It can trigger the same warning in get_pat_info if remap_p4d_range fails:
> > 
> > remap_pfn_range
> >    remap_pfn_range_notrack
> >      remap_pfn_range_internal
> >        remap_p4d_range	// page allocation can failed here
> >      zap_page_range_single
> >        unmap_single_vma
> >          untrack_pfn
> >            get_pat_info
> >              WARN_ON_ONCE(1);
> > 
> > Any idea on this problem?
> 
> In remap_pfn_range(), if remap_pfn_range_notrack() fails, we call 
> untrack_pfn(), to undo the tracking.
> 
> The problem is that zap_page_range_single() shouldn't do that 
> untrack_pfn() call.
> 
> That should be fixed by Peter's patch:
> 
> https://lore.kernel.org/all/20240712144244.3090089-1-peterx@redhat.com/T/#u


The fix seemingly has not been applied so the issue in question still
persists. There is a long thread on that patch without an explicit
conclusion. Did the patch cause any problems or its status changed?


Thanks for your time!


> 
> -- 
> Cheers,
> 
> David / dhildenb
>
Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 10 months ago
On 07.04.25 10:43, Fedor Pchelkin wrote:
> Hi, David, Peter
> 
> Sorry for reviving an old thread. I've tried to keep the context as-is.
> Here is an original link in the archives:
> https://lore.kernel.org/lkml/20241029210331.1339581-1-david@redhat.com/T/#u
> 
> Please see below.
> 
> On 07.11.24 10:08, David Hildenbrand wrote
>> On 07.11.24 09:43, mawupeng wrote:
>>> On 2024/10/31 17:47, David Hildenbrand wrote:
>>>> On 30.10.24 22:32, Peter Xu wrote:
>>>>> On Tue, Oct 29, 2024 at 10:03:31PM +0100, David Hildenbrand wrote:
>>>>>> If track_pfn_copy() fails, we already added the dst VMA to the maple
>>>>>> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
>>>>>> the dst VMA for which we neither performed any reservation nor copied
>>>>>> any page tables.
>>>>>>
>>>>>> Consequently untrack_pfn() will see VM_PAT and try obtaining the
>>>>>> PAT information from the page table -- which fails because the page
>>>>>> table was not copied.
>>>>>>
>>>>>> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
>>>>>> if track_pfn_copy() fails. However, the whole thing is about "simply"
>>>>>> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
>>>>>> and performed a reservation, but copying the page tables fails, we'll
>>>>>> simply clear the VM_PAT flag, not properly undoing the reservation ...
>>>>>> which is also wrong.
>>>>>
>>>>> David,
>>>>>
>>>>
>>>> Hi Peter,
>>>>
>>>>> Sorry to not have chance yet reply to your other email..
>>>>>
>>>>> The only concern I have with the current fix to fork() is.. we started to
>>>>> have device drivers providing fault() on PFNMAPs as vfio-pci does, then I
>>>>> think it means we could potentially start to hit the same issue even
>>>>> without fork(), but as long as the 1st pgtable entry of the PFNMAP range is
>>>>> not mapped when the process with VM_PAT vma exit()s, or munmap() the vma.
>>>>
>>>> As these drivers are not using remap_pfn_range, there is no way they could currently get VM_PAT set.
>>>>
>>>> So what you describe is independent of the current state we are fixing here, and this fix should sort out the issues with current VM_PAT handling.
>>>>
>>>> It indeed is an interesting question how to handle reservations when *not* using remap_pfn_range() to cover the whole area.
>>>>
>>>> remap_pfn_range() handles VM_PAT automatically because it can do it: it knows that the whole range will map consecutive PFNs with the same protection, and we expect not parts of the range suddenly getting unmapped (and any driver that does that is buggy).
>>>>
>>>> This behavior is, however, not guaranteed to be the case when remap_pfn_range() is *not* called on the whole range.
>>>>
>>>> For that case (i.e., vfio-pci) I still wonder if the driver shouldn't do the reservation and leave VM_PAT alone.
>>>>
>>>> In the driver, we'd do the reservation once and not worry about fork() etc ... and we'd undo the reservation once the last relevant VM_PFNMAP VMA is gone or the driver let's go of the device. I assume there are already mechanisms in place to deal with that to some degree, because the driver cannot go away while any VMA still has the VM_PFNMAP mapping -- otherwise something would be seriously messed up.
>>>>
>>>> Long story short: let's look into not using VM_PAT for that use case.
>>>>
>>>> Looking at the VM_PAT issues we had over time, not making it more complicated sounds like a very reasonable thing to me :)
>>>
>>> Hi David,
>>>
>>> The VM_PAT reservation do seems complicated. It can trigger the same warning in get_pat_info if remap_p4d_range fails:
>>>
>>> remap_pfn_range
>>>     remap_pfn_range_notrack
>>>       remap_pfn_range_internal
>>>         remap_p4d_range	// page allocation can failed here
>>>       zap_page_range_single
>>>         unmap_single_vma
>>>           untrack_pfn
>>>             get_pat_info
>>>               WARN_ON_ONCE(1);
>>>
>>> Any idea on this problem?
>>
>> In remap_pfn_range(), if remap_pfn_range_notrack() fails, we call
>> untrack_pfn(), to undo the tracking.
>>
>> The problem is that zap_page_range_single() shouldn't do that
>> untrack_pfn() call.
>>
>> That should be fixed by Peter's patch:
>>
>> https://lore.kernel.org/all/20240712144244.3090089-1-peterx@redhat.com/T/#u
> 
> 
> The fix seemingly has not been applied so the issue in question still
> persists. There is a long thread on that patch without an explicit
> conclusion. Did the patch cause any problems or its status changed?

That one still needs to be applied. Peter is currently out for a couple 
of weeks; I might be able to revive that in the meantime.

-- 
Cheers,

David / dhildenb
Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling when fork() fails in copy_page_range()
Posted by mawupeng 1 year, 3 months ago

On 2024/11/7 17:08, David Hildenbrand wrote:
> On 07.11.24 09:43, mawupeng wrote:
>>
>>
>> On 2024/10/31 17:47, David Hildenbrand wrote:
>>> On 30.10.24 22:32, Peter Xu wrote:
>>>> On Tue, Oct 29, 2024 at 10:03:31PM +0100, David Hildenbrand wrote:
>>>>> If track_pfn_copy() fails, we already added the dst VMA to the maple
>>>>> tree. As fork() fails, we'll cleanup the maple tree, and stumble over
>>>>> the dst VMA for which we neither performed any reservation nor copied
>>>>> any page tables.
>>>>>
>>>>> Consequently untrack_pfn() will see VM_PAT and try obtaining the
>>>>> PAT information from the page table -- which fails because the page
>>>>> table was not copied.
>>>>>
>>>>> The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
>>>>> if track_pfn_copy() fails. However, the whole thing is about "simply"
>>>>> clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
>>>>> and performed a reservation, but copying the page tables fails, we'll
>>>>> simply clear the VM_PAT flag, not properly undoing the reservation ...
>>>>> which is also wrong.
>>>>
>>>> David,
>>>>
>>>
>>> Hi Peter,
>>>
>>>> Sorry to not have chance yet reply to your other email..
>>>>
>>>> The only concern I have with the current fix to fork() is.. we started to
>>>> have device drivers providing fault() on PFNMAPs as vfio-pci does, then I
>>>> think it means we could potentially start to hit the same issue even
>>>> without fork(), but as long as the 1st pgtable entry of the PFNMAP range is
>>>> not mapped when the process with VM_PAT vma exit()s, or munmap() the vma.
>>>
>>> As these drivers are not using remap_pfn_range, there is no way they could currently get VM_PAT set.
>>>
>>> So what you describe is independent of the current state we are fixing here, and this fix should sort out the issues with current VM_PAT handling.
>>>
>>> It indeed is an interesting question how to handle reservations when *not* using remap_pfn_range() to cover the whole area.
>>>
>>> remap_pfn_range() handles VM_PAT automatically because it can do it: it knows that the whole range will map consecutive PFNs with the same protection, and we expect not parts of the range suddenly getting unmapped (and any driver that does that is buggy).
>>>
>>> This behavior is, however, not guaranteed to be the case when remap_pfn_range() is *not* called on the whole range.
>>>
>>> For that case (i.e., vfio-pci) I still wonder if the driver shouldn't do the reservation and leave VM_PAT alone.
>>>
>>> In the driver, we'd do the reservation once and not worry about fork() etc ... and we'd undo the reservation once the last relevant VM_PFNMAP VMA is gone or the driver let's go of the device. I assume there are already mechanisms in place to deal with that to some degree, because the driver cannot go away while any VMA still has the VM_PFNMAP mapping -- otherwise something would be seriously messed up.
>>>
>>> Long story short: let's look into not using VM_PAT for that use case.
>>>
>>> Looking at the VM_PAT issues we had over time, not making it more complicated sounds like a very reasonable thing to me :)
>>
>> Hi David,
>>
>> The VM_PAT reservation do seems complicated. It can trigger the same warning in get_pat_info if remap_p4d_range fails:
>>
>> remap_pfn_range
>>    remap_pfn_range_notrack
>>      remap_pfn_range_internal
>>        remap_p4d_range    // page allocation can failed here
>>      zap_page_range_single
>>        unmap_single_vma
>>          untrack_pfn
>>            get_pat_info
>>              WARN_ON_ONCE(1);
>>
>> Any idea on this problem?
> 
> In remap_pfn_range(), if remap_pfn_range_notrack() fails, we call untrack_pfn(), to undo the tracking.
> 
> The problem is that zap_page_range_single() shouldn't do that untrack_pfn() call.
> 
> That should be fixed by Peter's patch:
> 
> https://lore.kernel.org/all/20240712144244.3090089-1-peterx@redhat.com/T/#u

Thank you for your prompt reply.

This do fix this issue.

> 

[tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by tip-bot2 for David Hildenbrand 11 months, 1 week ago
The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     4e1c520c95849e16f8dfbcacbfd37be5330447b9
Gitweb:        https://git.kernel.org/tip/4e1c520c95849e16f8dfbcacbfd37be5330447b9
Author:        David Hildenbrand <david@redhat.com>
AuthorDate:    Tue, 29 Oct 2024 22:03:31 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 03 Mar 2025 13:39:14 +01:00

x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()

If track_pfn_copy() fails, we already added the dst VMA to the maple
tree. As fork() fails, we'll cleanup the maple tree, and stumble over
the dst VMA for which we neither performed any reservation nor copied
any page tables.

Consequently untrack_pfn() will see VM_PAT and try obtaining the
PAT information from the page table -- which fails because the page
table was not copied.

The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
if track_pfn_copy() fails. However, the whole thing is about "simply"
clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
and performed a reservation, but copying the page tables fails, we'll
simply clear the VM_PAT flag, not properly undoing the reservation ...
which is also wrong.

So let's fix it properly: set the VM_PAT flag only if the reservation
succeeded (leaving it clear initially), and undo the reservation if
anything goes wrong while copying the page tables: clearing the VM_PAT
flag after undoing the reservation.

Note that any copied page table entries will get zapped when the VMA will
get removed later, after copy_page_range() succeeded; as VM_PAT is not set
then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
happy. Note that leaving these page tables in place without a reservation
is not a problem, as we are aborting fork(); this process will never run.

A reproducer can trigger this usually at the first try:

  https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c

  [   45.239440] WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
  [   45.241082] Modules linked in: ...
  [   45.249119] CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
  [   45.250598] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
  [   45.252181] RIP: 0010:get_pat_info+0xf6/0x110
  ...
  [   45.268513] Call Trace:
  [   45.269003]  <TASK>
  [   45.269425]  ? __warn.cold+0xb7/0x14d
  [   45.270131]  ? get_pat_info+0xf6/0x110
  [   45.270846]  ? report_bug+0xff/0x140
  [   45.271519]  ? handle_bug+0x58/0x90
  [   45.272192]  ? exc_invalid_op+0x17/0x70
  [   45.272935]  ? asm_exc_invalid_op+0x1a/0x20
  [   45.273717]  ? get_pat_info+0xf6/0x110
  [   45.274438]  ? get_pat_info+0x71/0x110
  [   45.275165]  untrack_pfn+0x52/0x110
  [   45.275835]  unmap_single_vma+0xa6/0xe0
  [   45.276549]  unmap_vmas+0x105/0x1f0
  [   45.277256]  exit_mmap+0xf6/0x460
  [   45.277913]  __mmput+0x4b/0x120
  [   45.278512]  copy_process+0x1bf6/0x2aa0
  [   45.279264]  kernel_clone+0xab/0x440
  [   45.279959]  __do_sys_clone+0x66/0x90
  [   45.280650]  do_syscall_64+0x95/0x180

Likely this case was missed in:

  d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")

... and instead of undoing the reservation we simply cleared the VM_PAT flag.

Keep the documentation of these functions in include/linux/pgtable.h,
one place is more than sufficient -- we should clean that up for the other
functions like track_pfn_remap/untrack_pfn separately.

Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yuxin wang <wang1315768607@163.com>
Reported-by: Marius Fleischer <fleischermarius@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241029210331.1339581-1-david@redhat.com
Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Xu <peterx@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 66 ++++++++++++++++++++++++--------------
 include/linux/pgtable.h   | 27 ++++++++++++----
 kernel/fork.c             |  4 ++-
 mm/memory.c               |  9 +----
 4 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index feb8cc6..3a9e6dd 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -984,27 +984,54 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
 	return -EINVAL;
 }
 
-/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
- *
- * If the vma has a linear pfn mapping for the entire range, we get the prot
- * from pte and reserve the entire vma range with single reserve_pfn_range call.
- */
-int track_pfn_copy(struct vm_area_struct *vma)
+int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
 {
+	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
 	resource_size_t paddr;
-	unsigned long vma_size = vma->vm_end - vma->vm_start;
 	pgprot_t pgprot;
+	int rc;
 
-	if (vma->vm_flags & VM_PAT) {
-		if (get_pat_info(vma, &paddr, &pgprot))
-			return -EINVAL;
-		/* reserve the whole chunk covered by vma. */
-		return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+	if (!(src_vma->vm_flags & VM_PAT))
+		return 0;
+
+	/*
+	 * Duplicate the PAT information for the dst VMA based on the src
+	 * VMA.
+	 */
+	if (get_pat_info(src_vma, &paddr, &pgprot))
+		return -EINVAL;
+	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+	if (!rc)
+		/* Reservation for the destination VMA succeeded. */
+		vm_flags_set(dst_vma, VM_PAT);
+	return rc;
+}
+
+void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
+{
+	resource_size_t paddr;
+	unsigned long size;
+
+	if (!(dst_vma->vm_flags & VM_PAT))
+		return;
+
+	/*
+	 * As the page tables might not have been copied yet, the PAT
+	 * information is obtained from the src VMA, just like during
+	 * track_pfn_copy().
+	 */
+	if (get_pat_info(src_vma, &paddr, NULL)) {
+		size = src_vma->vm_end - src_vma->vm_start;
+		free_pfn_range(paddr, size);
 	}
 
-	return 0;
+	/*
+	 * Reservation was freed, any copied page tables will get cleaned
+	 * up later, but without getting PAT involved again.
+	 */
+	vm_flags_clear(dst_vma, VM_PAT);
 }
 
 /*
@@ -1095,15 +1122,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 	}
 }
 
-/*
- * untrack_pfn_clear is called if the following situation fits:
- *
- * 1) while mremapping a pfnmap for a new region,  with the old vma after
- * its pfnmap page table has been removed.  The new vma has a new pfnmap
- * to the same pfn & cache type with VM_PAT set.
- * 2) while duplicating vm area, the new vma fails to copy the pgtable from
- * old vma.
- */
 void untrack_pfn_clear(struct vm_area_struct *vma)
 {
 	vm_flags_clear(vma, VM_PAT);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d..acf387d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1508,15 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 }
 
 /*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
+ * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
+ * tables copied during copy_page_range().
  */
-static inline int track_pfn_copy(struct vm_area_struct *vma)
+static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
 {
 	return 0;
 }
 
 /*
+ * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
+ * copy_page_range(), but after track_pfn_copy() was already called.
+ */
+static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
+{
+}
+
+/*
  * untrack_pfn is called while unmapping a pfnmap for a region.
  * untrack can be called for a specific region indicated by pfn and size or
  * can be for the entire vma (in which case pfn, size are zero).
@@ -1528,8 +1538,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
 }
 
 /*
- * untrack_pfn_clear is called while mremapping a pfnmap for a new region
- * or fails to copy pgtable during duplicate vm area.
+ * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
+ *
+ * 1) During mremap() on the src VMA after the page tables were moved.
+ * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
  */
 static inline void untrack_pfn_clear(struct vm_area_struct *vma)
 {
@@ -1540,7 +1552,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			   unsigned long size);
 extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 			     pfn_t pfn);
-extern int track_pfn_copy(struct vm_area_struct *vma);
+extern int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma);
+extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 			unsigned long size, bool mm_wr_locked);
 extern void untrack_pfn_clear(struct vm_area_struct *vma);
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a..ca2ca38 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
+	/* track_pfn_copy() will later take care of copying internal state. */
+	if (unlikely(new->vm_flags & VM_PFNMAP))
+		untrack_pfn_clear(new);
+
 	return new;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7..890333c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1379,11 +1379,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
 
 	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
-		/*
-		 * We do not free on error cases below as remove_vma
-		 * gets called on error from higher level routine
-		 */
-		ret = track_pfn_copy(src_vma);
+		ret = track_pfn_copy(dst_vma, src_vma);
 		if (ret)
 			return ret;
 	}
@@ -1420,7 +1416,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 			continue;
 		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
 					    addr, next))) {
-			untrack_pfn_clear(dst_vma);
 			ret = -ENOMEM;
 			break;
 		}
@@ -1430,6 +1425,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		raw_write_seqcount_end(&src_mm->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
 	}
+	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
+		untrack_pfn_copy(dst_vma, src_vma);
 	return ret;
 }
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 10 months, 3 weeks ago
> +void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> +		struct vm_area_struct *src_vma)
> +{
> +	resource_size_t paddr;
> +	unsigned long size;
> +
> +	if (!(dst_vma->vm_flags & VM_PAT))
> +		return;
> +
> +	/*
> +	 * As the page tables might not have been copied yet, the PAT
> +	 * information is obtained from the src VMA, just like during
> +	 * track_pfn_copy().
> +	 */
> +	if (get_pat_info(src_vma, &paddr, NULL)) {
> +		size = src_vma->vm_end - src_vma->vm_start;
> +		free_pfn_range(paddr, size);
>   	}
>   

@Ingo, can you drop this patch for now? It's supposed to be 
"!get_pat_info" here, and I want to re-verify now that a couple of 
months passed, whether it's all working as expected with that.

(we could actually complain if get_pat_info() would fail at this point, 
let me think about that)

I'll resend once I get to it. Thanks!

-- 
Cheers,

David / dhildenb
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by Ingo Molnar 10 months, 3 weeks ago
* David Hildenbrand <david@redhat.com> wrote:

> > +void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > +		struct vm_area_struct *src_vma)
> > +{
> > +	resource_size_t paddr;
> > +	unsigned long size;
> > +
> > +	if (!(dst_vma->vm_flags & VM_PAT))
> > +		return;
> > +
> > +	/*
> > +	 * As the page tables might not have been copied yet, the PAT
> > +	 * information is obtained from the src VMA, just like during
> > +	 * track_pfn_copy().
> > +	 */
> > +	if (get_pat_info(src_vma, &paddr, NULL)) {
> > +		size = src_vma->vm_end - src_vma->vm_start;
> > +		free_pfn_range(paddr, size);
> >   	}
> 
> @Ingo, can you drop this patch for now?

Done.

> I'll resend once I get to it. Thanks!

Great, thanks!

	Ingo
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 10 months, 3 weeks ago
On 19.03.25 12:01, Ingo Molnar wrote:
> 
> * David Hildenbrand <david@redhat.com> wrote:
> 
>>> +void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>>> +		struct vm_area_struct *src_vma)
>>> +{
>>> +	resource_size_t paddr;
>>> +	unsigned long size;
>>> +
>>> +	if (!(dst_vma->vm_flags & VM_PAT))
>>> +		return;
>>> +
>>> +	/*
>>> +	 * As the page tables might not have been copied yet, the PAT
>>> +	 * information is obtained from the src VMA, just like during
>>> +	 * track_pfn_copy().
>>> +	 */
>>> +	if (get_pat_info(src_vma, &paddr, NULL)) {
>>> +		size = src_vma->vm_end - src_vma->vm_start;
>>> +		free_pfn_range(paddr, size);
>>>    	}
>>
>> @Ingo, can you drop this patch for now?
> 
> Done.

I can resend the whole thing, or just the fixup suggested by Boris, just 
let me know.

-- 
Cheers,

David / dhildenb
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by Ingo Molnar 10 months, 3 weeks ago
* David Hildenbrand <david@redhat.com> wrote:

> On 19.03.25 12:01, Ingo Molnar wrote:
> > 
> > * David Hildenbrand <david@redhat.com> wrote:
> > 
> > > > +void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > > +		struct vm_area_struct *src_vma)
> > > > +{
> > > > +	resource_size_t paddr;
> > > > +	unsigned long size;
> > > > +
> > > > +	if (!(dst_vma->vm_flags & VM_PAT))
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * As the page tables might not have been copied yet, the PAT
> > > > +	 * information is obtained from the src VMA, just like during
> > > > +	 * track_pfn_copy().
> > > > +	 */
> > > > +	if (get_pat_info(src_vma, &paddr, NULL)) {
> > > > +		size = src_vma->vm_end - src_vma->vm_start;
> > > > +		free_pfn_range(paddr, size);
> > > >    	}
> > > 
> > > @Ingo, can you drop this patch for now?
> > 
> > Done.
> 
> I can resend the whole thing, or just the fixup suggested by Boris, just let
> me know.

Please do, thanks!

Thanks,

	Ingo
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by Borislav Petkov 10 months, 3 weeks ago
On Wed, Mar 19, 2025 at 09:15:25AM +0100, David Hildenbrand wrote:
> @Ingo, can you drop this patch for now? It's supposed to be "!get_pat_info"
> here, and I want to re-verify now that a couple of months passed, whether
> it's all working as expected with that.
> 
> (we could actually complain if get_pat_info() would fail at this point, let
> me think about that)
> 
> I'll resend once I get to it. Thanks!

That patch is deep into the x86/mm branch. We could

- rebase: not good, especially one week before the merge window

- send a revert: probably better along with an explanation why we're reverting

- do a small fix which disables it ontop

- fix it properly: probably best! :-)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 10 months, 3 weeks ago
On 19.03.25 10:53, Borislav Petkov wrote:
> On Wed, Mar 19, 2025 at 09:15:25AM +0100, David Hildenbrand wrote:
>> @Ingo, can you drop this patch for now? It's supposed to be "!get_pat_info"
>> here, and I want to re-verify now that a couple of months passed, whether
>> it's all working as expected with that.
>>
>> (we could actually complain if get_pat_info() would fail at this point, let
>> me think about that)
>>
>> I'll resend once I get to it. Thanks!
> 
> That patch is deep into the x86/mm branch. We could
> 
> - rebase: not good, especially one week before the merge window
> 
> - send a revert: probably better along with an explanation why we're reverting
> 
> - do a small fix which disables it ontop
> 
> - fix it properly: probably best! :-)

Ahh, the commit id is already supposed to be stable, got it.

I'm currently testing the following as fix, that avoids the second lookup completely.

 From 0f42e29d5ba413affa2494f9cbbdf7b5b6b4ae2e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Fri, 18 Oct 2024 12:44:59 +0200
Subject: [PATCH v1] x86/mm/pat: fix error handling in untrack_pfn_copy()

We perform another get_pat_info() to lookup the PFN, but we
accidentally

Let's fix it by just avoiding another get_pat_info() completely,
simplifying untrack_pfn_copy() to simply call untrack_pfn() with the pfn
obtained through track_pfn_copy().

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lkml.kernel.org/r/lore.kernel.org/r/1d5de3d6-999b-47ca-8d43-22703b8442bc@stanley.mountain
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ma Wupeng <mawupeng1@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
  arch/x86/mm/pat/memtype.c | 32 ++++----------------------------
  include/linux/pgtable.h   | 23 ++++++++++++++++++-----
  mm/memory.c               |  6 +++---
  3 files changed, 25 insertions(+), 36 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 3a9e6dd58e2f0..dc5c8e6e3001e 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -985,7 +985,7 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
  }
  
  int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma)
+		struct vm_area_struct *src_vma, unsigned long *pfn)
  {
  	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
  	resource_size_t paddr;
@@ -1002,36 +1002,12 @@ int track_pfn_copy(struct vm_area_struct *dst_vma,
  	if (get_pat_info(src_vma, &paddr, &pgprot))
  		return -EINVAL;
  	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
-	if (!rc)
+	if (!rc) {
  		/* Reservation for the destination VMA succeeded. */
  		vm_flags_set(dst_vma, VM_PAT);
-	return rc;
-}
-
-void untrack_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma)
-{
-	resource_size_t paddr;
-	unsigned long size;
-
-	if (!(dst_vma->vm_flags & VM_PAT))
-		return;
-
-	/*
-	 * As the page tables might not have been copied yet, the PAT
-	 * information is obtained from the src VMA, just like during
-	 * track_pfn_copy().
-	 */
-	if (get_pat_info(src_vma, &paddr, NULL)) {
-		size = src_vma->vm_end - src_vma->vm_start;
-		free_pfn_range(paddr, size);
+		*pfn = PHYS_PFN(paddr);
  	}
-
-	/*
-	 * Reservation was freed, any copied page tables will get cleaned
-	 * up later, but without getting PAT involved again.
-	 */
-	vm_flags_clear(dst_vma, VM_PAT);
+	return rc;
  }
  
  /*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index acf387d199d7b..97f8afccfec76 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1509,10 +1509,11 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
  
  /*
   * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
- * tables copied during copy_page_range().
+ * tables copied during copy_page_range(). Returns the pfn to be passed to
+ * untrack_pfn_copy() if anything goes wrong while copying page tables.
   */
  static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma)
+		struct vm_area_struct *src_vma, unsigned long *pfn)
  {
  	return 0;
  }
@@ -1553,14 +1554,26 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
  extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
  			     pfn_t pfn);
  extern int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma);
-extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma);
+		struct vm_area_struct *src_vma, unsigned long *pfn);
  extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
  			unsigned long size, bool mm_wr_locked);
  extern void untrack_pfn_clear(struct vm_area_struct *vma);
  #endif
  
+/*
+ * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
+ * copy_page_range(), but after track_pfn_copy() was already called.
+ */
+static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		unsigned long pfn)
+{
+	untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
+	/*
+	 * Reservation was freed, any copied page tables will get cleaned
+	 * up later, but without getting PAT involved again.
+	 */
+}
+
  #ifdef CONFIG_MMU
  #ifdef __HAVE_COLOR_ZERO_PAGE
  static inline int is_zero_pfn(unsigned long pfn)
diff --git a/mm/memory.c b/mm/memory.c
index e4b6e599c34d8..dc8efa1358e94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1362,12 +1362,12 @@ int
  copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  {
  	pgd_t *src_pgd, *dst_pgd;
-	unsigned long next;
  	unsigned long addr = src_vma->vm_start;
  	unsigned long end = src_vma->vm_end;
  	struct mm_struct *dst_mm = dst_vma->vm_mm;
  	struct mm_struct *src_mm = src_vma->vm_mm;
  	struct mmu_notifier_range range;
+	unsigned long next, pfn;
  	bool is_cow;
  	int ret;
  
@@ -1378,7 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
  
  	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
-		ret = track_pfn_copy(dst_vma, src_vma);
+		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
  		if (ret)
  			return ret;
  	}
@@ -1425,7 +1425,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  		mmu_notifier_invalidate_range_end(&range);
  	}
  	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
-		untrack_pfn_copy(dst_vma, src_vma);
+		untrack_pfn_copy(dst_vma, pfn);
  	return ret;
  }
  
-- 
2.48.1


-- 
Cheers,

David / dhildenb
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by Borislav Petkov 10 months, 3 weeks ago
On Wed, Mar 19, 2025 at 11:16:36AM +0100, David Hildenbrand wrote:
> Ahh, the commit id is already supposed to be stable, got it.

Yap, we try to avoid rebasing when it becomes really hairy and the commits
have been stable and out there for a while...

> I'm currently testing the following as fix, that avoids the second lookup
> completely.

Cool, please holler asap what happens so that we can act accordingly.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by David Hildenbrand 10 months, 3 weeks ago
On 19.03.25 11:24, Borislav Petkov wrote:
> On Wed, Mar 19, 2025 at 11:16:36AM +0100, David Hildenbrand wrote:
>> Ahh, the commit id is already supposed to be stable, got it.
> 
> Yap, we try to avoid rebasing when it becomes really hairy and the commits
> have been stable and out there for a while...
> 
>> I'm currently testing the following as fix, that avoids the second lookup
>> completely.
> 
> Cool, please holler asap what happens so that we can act accordingly.

Yes, expect it later today -- have to refresh my brain how I managed to 
reproduce the original issue.

-- 
Cheers,

David / dhildenb
Re: [tip: x86/mm] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Posted by Borislav Petkov 10 months, 3 weeks ago
On Wed, Mar 19, 2025 at 11:27:19AM +0100, David Hildenbrand wrote:
> Yes, expect it later today

Thanks!

> -- have to refresh my brain how I managed to reproduce the original issue.

Tell me about it. :-\

I have a big fat mostly enlarging and seldom collapsing text file called
todo.txt.

:-P

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette