mm: Fix mmap_assert_locked() in follow_pte()

[PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by Pei Li 1 year, 7 months ago

This patch fixes this warning by acquiring read lock before entering
untrack_pfn() while write lock is not held.

syzbot has tested the proposed patch and the reproducer did not
trigger any issue.

Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
Signed-off-by: Pei Li <peili.dev@gmail.com>
---
Syzbot reported the following warning in follow_pte():

WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980

This is because we are assuming that mm->mmap_lock should be held when
entering follow_pte(). This is added in commit c5541ba378e3 (mm:
follow_pte() improvements).

However, in the following call stack, we are not acquring the lock:
 follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
 get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
 untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
 unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
 zap_page_range_single+0x326/0x560 mm/memory.c:1920

In zap_page_range_single(), we passed mm_wr_locked as false, as we do
not expect write lock to be held.
In the special case where vma->vm_flags is set as VM_PFNMAP, we are
hitting untrack_pfn() which eventually calls into follow_phys.

This patch fixes this warning by acquiring read lock before entering
untrack_pfn() while write lock is not held. 

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Tested on:

commit:         9d9a2f29 Merge tag 'mm-hotfixes-stable-2024-07-10-13-1..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13be8021980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=3456bae478301dc8
dashboard link: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
patch:          https://syzkaller.appspot.com/x/patch.diff?x=145e3441980000

Note: testing is done by a robot and is best-effort only.
---
 mm/memory.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index d10e616d7389..75d7959b835b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,9 +1815,16 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 	if (vma->vm_file)
 		uprobe_munmap(vma, start, end);
 
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
+	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
+		if (!mm_wr_locked)
+			mmap_read_lock(vma->vm_mm);
+
 		untrack_pfn(vma, 0, 0, mm_wr_locked);
 
+		if (!mm_wr_locked)
+			mmap_read_unlock(vma->vm_mm);
+	}
+
 	if (start != end) {
 		if (unlikely(is_vm_hugetlb_page(vma))) {
 			/*

---
base-commit: 734610514cb0234763cc97ddbd235b7981889445
change-id: 20240710-bug12-fecfe4bb0dcb

Best regards,
-- 
Pei Li <peili.dev@gmail.com>

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by David Hildenbrand 1 year, 7 months ago

On 11.07.24 07:13, Pei Li wrote:
> This patch fixes this warning by acquiring read lock before entering
> untrack_pfn() while write lock is not held.
> 
> syzbot has tested the proposed patch and the reproducer did not
> trigger any issue.
> 
> Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
> Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
> Signed-off-by: Pei Li <peili.dev@gmail.com>
> ---
> Syzbot reported the following warning in follow_pte():
> 
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
> 
> This is because we are assuming that mm->mmap_lock should be held when
> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
> follow_pte() improvements).
> 
> However, in the following call stack, we are not acquring the lock:
>   follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>   get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>   untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>   unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>   zap_page_range_single+0x326/0x560 mm/memory.c:1920

That implies that unmap_vmas() is called without the mmap lock in read 
mode, correct?

Do we know how this happens?

* exit_mmap() holds the mmap lock in read mode
* unmap_region is documented to hold the mmap lock in read mode

> 
> In zap_page_range_single(), we passed mm_wr_locked as false, as we do
> not expect write lock to be held.
> In the special case where vma->vm_flags is set as VM_PFNMAP, we are
> hitting untrack_pfn() which eventually calls into follow_phys.
> 
> This patch fixes this warning by acquiring read lock before entering
> untrack_pfn() while write lock is not held.
> 
> syzbot has tested the proposed patch and the reproducer did not trigger any issue:
> 
> Tested on:
> 
> commit:         9d9a2f29 Merge tag 'mm-hotfixes-stable-2024-07-10-13-1..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=13be8021980000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3456bae478301dc8
> dashboard link: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
> compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> patch:          https://syzkaller.appspot.com/x/patch.diff?x=145e3441980000
> 
> Note: testing is done by a robot and is best-effort only.
> ---
>   mm/memory.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index d10e616d7389..75d7959b835b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,9 +1815,16 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>   	if (vma->vm_file)
>   		uprobe_munmap(vma, start, end);
>   
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> +	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
> +		if (!mm_wr_locked)
> +			mmap_read_lock(vma->vm_mm);
> +
>   		untrack_pfn(vma, 0, 0, mm_wr_locked);
>   
> +		if (!mm_wr_locked)
> +			mmap_read_unlock(vma->vm_mm);
> +	}
> +
>   	if (start != end) {
>   		if (unlikely(is_vm_hugetlb_page(vma))) {

I'm not sure if this is the right fix. I like to understand how we end 
up without the mmap lock at least in read mode in that path?

-- 
Cheers,

David / dhildenb

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by Sergey Senozhatsky 1 year, 7 months ago

On (24/07/11 23:33), David Hildenbrand wrote:
[..]
> > @@ -1815,9 +1815,16 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >   	if (vma->vm_file)
> >   		uprobe_munmap(vma, start, end);
> > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > +	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
> > +		if (!mm_wr_locked)
> > +			mmap_read_lock(vma->vm_mm);
> > +
> >   		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > +		if (!mm_wr_locked)
> > +			mmap_read_unlock(vma->vm_mm);
> > +	}
> > +
> >   	if (start != end) {
> >   		if (unlikely(is_vm_hugetlb_page(vma))) {
>
> I'm not sure if this is the right fix. I like to understand how we end up
> without the mmap lock at least in read mode in that path?

I suspect this is causing a deadlock:

[   10.263161] ============================================
[   10.263165] WARNING: possible recursive locking detected
[   10.263170] 6.10.0-rc7-next-20240712+ #645 Tainted: G                 N
[   10.263177] --------------------------------------------
[   10.263179] (direxec)/166 is trying to acquire lock:
[   10.263184] ffff88810b4f0198 (&mm->mmap_lock){++++}-{3:3}, at: mmap_read_lock+0x12/0x40
[   10.263217]
[   10.263217] but task is already holding lock:
[   10.263219] ffff88810b4f0198 (&mm->mmap_lock){++++}-{3:3}, at: exit_mmap+0x9c/0x830
[   10.263238]
[   10.263238] other info that might help us debug this:
[   10.263241]  Possible unsafe locking scenario:
[   10.263241]
[   10.263243]        CPU0
[   10.263245]        ----
[   10.263247]   lock(&mm->mmap_lock);
[   10.263252]   lock(&mm->mmap_lock);
[   10.263257]
[   10.263257]  *** DEADLOCK ***
[   10.263257]
[   10.263259]  May be due to missing lock nesting notation
[   10.263259]
[   10.263262] 3 locks held by (direxec)/166:
[   10.263267]  #0: ffff88810b4e8548 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x70/0x1110
[   10.263286]  #1: ffff88810b4e85e0 (&sig->exec_update_lock){+.+.}-{3:3}, at: exec_mmap+0x9f/0x510
[   10.263302]  #2: ffff88810b4f0198 (&mm->mmap_lock){++++}-{3:3}, at: exit_mmap+0x9c/0x830
[   10.263318]
[   10.263318] stack backtrace:
[   10.263329] CPU: 6 UID: 0 PID: 166 Comm: (direxec) Tainted: G                 N 6.10.0-rc7-next-20240712+ #645
[   10.263340] Tainted: [N]=TEST
[   10.263349] Call Trace:
[   10.263355]  <TASK>
[   10.263360]  dump_stack_lvl+0xa3/0xeb
[   10.263375]  print_deadlock_bug+0x4d5/0x680
[   10.263387]  __lock_acquire+0x65fb/0x7830
[   10.263408]  ? lock_is_held_type+0xdd/0x150
[   10.263425]  lock_acquire+0x14c/0x3e0
[   10.263433]  ? mmap_read_lock+0x12/0x40
[   10.263445]  ? lock_is_held_type+0xdd/0x150
[   10.263454]  down_read+0x58/0x9a0
[   10.263461]  ? mmap_read_lock+0x12/0x40
[   10.263476]  mmap_read_lock+0x12/0x40
[   10.263485]  unmap_single_vma+0x1bf/0x240
[   10.263497]  unmap_vmas+0x146/0x1c0
[   10.263511]  exit_mmap+0x13d/0x830
[   10.263533]  __mmput+0xc2/0x2c0
[   10.263556]  exec_mmap+0x4cb/0x510
[   10.263580]  begin_new_exec+0xfe6/0x1ba0
[   10.263612]  load_elf_binary+0x797/0x22a0
[   10.263637]  ? load_misc_binary+0x53a/0x930
[   10.263656]  ? lock_release+0x50f/0x830
[   10.263673]  ? bprm_execve+0x6d7/0x1110
[   10.263693]  bprm_execve+0x70d/0x1110
[   10.263730]  do_execveat_common+0x44b/0x600
[   10.263745]  __x64_sys_execve+0x8e/0xa0
[   10.263754]  do_syscall_64+0x71/0x110
[   10.263764]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by David Hildenbrand 1 year, 7 months ago

On 11.07.24 23:33, David Hildenbrand wrote:
> On 11.07.24 07:13, Pei Li wrote:
>> This patch fixes this warning by acquiring read lock before entering
>> untrack_pfn() while write lock is not held.
>>
>> syzbot has tested the proposed patch and the reproducer did not
>> trigger any issue.
>>
>> Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>> Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
>> Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>> Signed-off-by: Pei Li <peili.dev@gmail.com>
>> ---
>> Syzbot reported the following warning in follow_pte():
>>
>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
>>
>> This is because we are assuming that mm->mmap_lock should be held when
>> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
>> follow_pte() improvements).
>>
>> However, in the following call stack, we are not acquring the lock:
>>    follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>>    get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>>    untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>>    unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>>    zap_page_range_single+0x326/0x560 mm/memory.c:1920
> 
> That implies that unmap_vmas() is called without the mmap lock in read
> mode, correct?
> 
> Do we know how this happens?
> 
> * exit_mmap() holds the mmap lock in read mode
> * unmap_region is documented to hold the mmap lock in read mode

I think this is it (missed the call from zap_page_range_single()):

  follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
  get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
  untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
  unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
  zap_page_range_single+0x326/0x560 mm/memory.c:1920
  unmap_mapping_range_vma mm/memory.c:3684 [inline]
  unmap_mapping_range_tree mm/memory.c:3701 [inline]
  unmap_mapping_pages mm/memory.c:3767 [inline]
  unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
  truncate_pagecache+0x53/0x90 mm/truncate.c:731
  simple_setattr+0xf2/0x120 fs/libfs.c:886
  notify_change+0xec6/0x11f0 fs/attr.c:499
  do_truncate+0x15c/0x220 fs/open.c:65
  handle_truncate fs/namei.c:3308 [inline]

I think Peter recently questioned whether untrack_pfn() should be even 
called from the place, but I might misremember things.

Fix should work (I suspect we are not violating some locking rules?), 
PFNMAP should not happen there too often that we really care.

If everything fails, we could drop the assert, but I'm hoping we can 
avoid that. We really want most users of follow_pte() to do the right thing.

-- 
Cheers,

David / dhildenb

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by David Hildenbrand 1 year, 7 months ago

On 11.07.24 23:45, David Hildenbrand wrote:
> On 11.07.24 23:33, David Hildenbrand wrote:
>> On 11.07.24 07:13, Pei Li wrote:
>>> This patch fixes this warning by acquiring read lock before entering
>>> untrack_pfn() while write lock is not held.
>>>
>>> syzbot has tested the proposed patch and the reproducer did not
>>> trigger any issue.
>>>
>>> Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>>> Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
>>> Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>>> Signed-off-by: Pei Li <peili.dev@gmail.com>
>>> ---
>>> Syzbot reported the following warning in follow_pte():
>>>
>>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
>>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
>>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
>>>
>>> This is because we are assuming that mm->mmap_lock should be held when
>>> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
>>> follow_pte() improvements).
>>>
>>> However, in the following call stack, we are not acquring the lock:
>>>     follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>>>     get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>>>     untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>>>     unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>>>     zap_page_range_single+0x326/0x560 mm/memory.c:1920
>>
>> That implies that unmap_vmas() is called without the mmap lock in read
>> mode, correct?
>>
>> Do we know how this happens?
>>
>> * exit_mmap() holds the mmap lock in read mode
>> * unmap_region is documented to hold the mmap lock in read mode
> 
> I think this is it (missed the call from zap_page_range_single()):
> 
>    follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>    get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>    untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>    unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>    zap_page_range_single+0x326/0x560 mm/memory.c:1920
>    unmap_mapping_range_vma mm/memory.c:3684 [inline]
>    unmap_mapping_range_tree mm/memory.c:3701 [inline]
>    unmap_mapping_pages mm/memory.c:3767 [inline]
>    unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
>    truncate_pagecache+0x53/0x90 mm/truncate.c:731
>    simple_setattr+0xf2/0x120 fs/libfs.c:886
>    notify_change+0xec6/0x11f0 fs/attr.c:499
>    do_truncate+0x15c/0x220 fs/open.c:65
>    handle_truncate fs/namei.c:3308 [inline]
> 
> I think Peter recently questioned whether untrack_pfn() should be even
> called from the place, but I might misremember things.
> 
> Fix should work (I suspect we are not violating some locking rules?),
> PFNMAP should not happen there too often that we really care.

... thinking again, likely we reach this point with "!mm_wr_locked" and 
the mmap lock already held in read mode. So I suspect the fix won't work 
as is.

-- 
Cheers,

David / dhildenb

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by Peter Xu 1 year, 7 months ago

On Thu, Jul 11, 2024 at 11:51:41PM +0200, David Hildenbrand wrote:
> On 11.07.24 23:45, David Hildenbrand wrote:
> > On 11.07.24 23:33, David Hildenbrand wrote:
> > > On 11.07.24 07:13, Pei Li wrote:
> > > > This patch fixes this warning by acquiring read lock before entering
> > > > untrack_pfn() while write lock is not held.
> > > > 
> > > > syzbot has tested the proposed patch and the reproducer did not
> > > > trigger any issue.
> > > > 
> > > > Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
> > > > Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
> > > > Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
> > > > Signed-off-by: Pei Li <peili.dev@gmail.com>
> > > > ---
> > > > Syzbot reported the following warning in follow_pte():
> > > > 
> > > > WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
> > > > WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
> > > > WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
> > > > 
> > > > This is because we are assuming that mm->mmap_lock should be held when
> > > > entering follow_pte(). This is added in commit c5541ba378e3 (mm:
> > > > follow_pte() improvements).
> > > > 
> > > > However, in the following call stack, we are not acquring the lock:
> > > >     follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
> > > >     get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
> > > >     untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
> > > >     unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
> > > >     zap_page_range_single+0x326/0x560 mm/memory.c:1920
> > > 
> > > That implies that unmap_vmas() is called without the mmap lock in read
> > > mode, correct?
> > > 
> > > Do we know how this happens?
> > > 
> > > * exit_mmap() holds the mmap lock in read mode
> > > * unmap_region is documented to hold the mmap lock in read mode
> > 
> > I think this is it (missed the call from zap_page_range_single()):
> > 
> >    follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
> >    get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
> >    untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
> >    unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
> >    zap_page_range_single+0x326/0x560 mm/memory.c:1920
> >    unmap_mapping_range_vma mm/memory.c:3684 [inline]
> >    unmap_mapping_range_tree mm/memory.c:3701 [inline]
> >    unmap_mapping_pages mm/memory.c:3767 [inline]
> >    unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
> >    truncate_pagecache+0x53/0x90 mm/truncate.c:731
> >    simple_setattr+0xf2/0x120 fs/libfs.c:886
> >    notify_change+0xec6/0x11f0 fs/attr.c:499
> >    do_truncate+0x15c/0x220 fs/open.c:65
> >    handle_truncate fs/namei.c:3308 [inline]
> > 
> > I think Peter recently questioned whether untrack_pfn() should be even
> > called from the place, but I might misremember things.
> > 
> > Fix should work (I suspect we are not violating some locking rules?),
> > PFNMAP should not happen there too often that we really care.
> 
> ... thinking again, likely we reach this point with "!mm_wr_locked" and the
> mmap lock already held in read mode. So I suspect the fix won't work as is.

Ah yes, I had one rfc patch for that, I temporarily put that aside as it
seemed nobody cared except myself.. it's here:

https://lore.kernel.org/all/20240523223745.395337-2-peterx@redhat.com

I didn't know it can already cause real trouble.  It looks like that patch
should fix this.

Thanks,

-- 
Peter Xu

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by David Wang 1 year, 7 months ago

Hi,

> Ah yes, I had one rfc patch for that, I temporarily put that aside as it
> seemed nobody cared except myself.. it's here:
> 
> https://lore.kernel.org/all/20240523223745.395337-2-peterx@redhat.com
> 
> I didn't know it can already cause real trouble.  It looks like that patch
> should fix this.
> 
> Thanks,
> 
> -- 
> Peter Xu

Just add another user scenario concering this kernel warning.
Ever since 6.10-rc1, when I suspend my system via `systemctl suspend`, nvidia gpu driver would trigger a warning:

             	 Call Trace:
             	  <TASK>
             	  ? __warn+0x7c/0x120
             	  ? follow_pte+0x15b/0x170
             	  ? report_bug+0x18d/0x1c0
             	  ? handle_bug+0x3c/0x80
             	  ? exc_invalid_op+0x13/0x60
             	  ? asm_exc_invalid_op+0x16/0x20
             	  ? follow_pte+0x15b/0x170
             	  follow_phys+0x3a/0xf0
             	  untrack_pfn+0x53/0x120
             	  unmap_single_vma+0xa6/0xe0
             	  zap_page_range_single+0xe4/0x190
             	  ? _nv002569kms+0x17b/0x210 [nvidia_modeset]
             	  ? srso_return_thunk+0x5/0x5f
             	  ? kfree+0x257/0x290
             	  unmap_mapping_range+0x10d/0x130
             	  nv_revoke_gpu_mappings_locked+0x43/0x70 [nvidia]
             	  nv_set_system_power_state+0x1c9/0x470 [nvidia]
             	  nv_procfs_write_suspend+0xd3/0x140 [nvidia]
             	  proc_reg_write+0x58/0xa0
             	  ? srso_return_thunk+0x5/0x5f
             	  vfs_write+0xf6/0x440
             	  ? __count_memcg_events+0x73/0x110
             	  ? srso_return_thunk+0x5/0x5f
             	  ? count_memcg_events.constprop.0+0x1a/0x30
             	  ? srso_return_thunk+0x5/0x5f
             	  ? handle_mm_fault+0xa9/0x2d0
             	  ? srso_return_thunk+0x5/0x5f
             	  ? preempt_count_add+0x47/0xa0
             	  ksys_write+0x63/0xe0
             	  do_syscall_64+0x4b/0x110
             	  entry_SYSCALL_64_after_hwframe+0x76/0x7e
             	 RIP: 0033:0x7f34a3914240
             	 Code: 40 00 48 8b 15 c1 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 23 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
             	 RSP: 002b:00007ffca2aa2688 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
             	 RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f34a3914240
             	 RDX: 0000000000000008 RSI: 000055a02968ed80 RDI: 0000000000000001
             	 RBP: 000055a02968ed80 R08: 00007f34a39eecd0 R09: 00007f34a39eecd0
             	 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000008
             	 R13: 00007f34a39ef760 R14: 0000000000000008 R15: 00007f34a39ea9e0
             	  </TASK>
             	 ---[ end trace 0000000000000000 ]---
             	 PM: suspend entry (deep)

Considering out-of-tree nature of nvidia gpu driver, and nobody reported this kernel warning before with in-trees,
 I had almost convinced myself that nvidia driver may need "big" rework to live with those "PTE" changes.
So glad to see this thread of discussion/issue/fix now, I have been patching my system manually ever since 6.10-rc1,
hope things got fixed soon...


FYI
David

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by Peter Xu 1 year, 7 months ago

On Fri, Jul 12, 2024 at 09:19:31PM +0800, David Wang wrote:
> Hi,
> 
> > Ah yes, I had one rfc patch for that, I temporarily put that aside as it
> > seemed nobody cared except myself.. it's here:
> > 
> > https://lore.kernel.org/all/20240523223745.395337-2-peterx@redhat.com
> > 
> > I didn't know it can already cause real trouble.  It looks like that patch
> > should fix this.
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> 
> Just add another user scenario concering this kernel warning.
> Ever since 6.10-rc1, when I suspend my system via `systemctl suspend`, nvidia gpu driver would trigger a warning:
> 
>              	 Call Trace:
>              	  <TASK>
>              	  ? __warn+0x7c/0x120
>              	  ? follow_pte+0x15b/0x170
>              	  ? report_bug+0x18d/0x1c0
>              	  ? handle_bug+0x3c/0x80
>              	  ? exc_invalid_op+0x13/0x60
>              	  ? asm_exc_invalid_op+0x16/0x20
>              	  ? follow_pte+0x15b/0x170
>              	  follow_phys+0x3a/0xf0
>              	  untrack_pfn+0x53/0x120
>              	  unmap_single_vma+0xa6/0xe0
>              	  zap_page_range_single+0xe4/0x190
>              	  ? _nv002569kms+0x17b/0x210 [nvidia_modeset]
>              	  ? srso_return_thunk+0x5/0x5f
>              	  ? kfree+0x257/0x290
>              	  unmap_mapping_range+0x10d/0x130
>              	  nv_revoke_gpu_mappings_locked+0x43/0x70 [nvidia]
>              	  nv_set_system_power_state+0x1c9/0x470 [nvidia]
>              	  nv_procfs_write_suspend+0xd3/0x140 [nvidia]
>              	  proc_reg_write+0x58/0xa0
>              	  ? srso_return_thunk+0x5/0x5f
>              	  vfs_write+0xf6/0x440
>              	  ? __count_memcg_events+0x73/0x110
>              	  ? srso_return_thunk+0x5/0x5f
>              	  ? count_memcg_events.constprop.0+0x1a/0x30
>              	  ? srso_return_thunk+0x5/0x5f
>              	  ? handle_mm_fault+0xa9/0x2d0
>              	  ? srso_return_thunk+0x5/0x5f
>              	  ? preempt_count_add+0x47/0xa0
>              	  ksys_write+0x63/0xe0
>              	  do_syscall_64+0x4b/0x110
>              	  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>              	 RIP: 0033:0x7f34a3914240
>              	 Code: 40 00 48 8b 15 c1 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 23 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
>              	 RSP: 002b:00007ffca2aa2688 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
>              	 RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f34a3914240
>              	 RDX: 0000000000000008 RSI: 000055a02968ed80 RDI: 0000000000000001
>              	 RBP: 000055a02968ed80 R08: 00007f34a39eecd0 R09: 00007f34a39eecd0
>              	 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000008
>              	 R13: 00007f34a39ef760 R14: 0000000000000008 R15: 00007f34a39ea9e0
>              	  </TASK>
>              	 ---[ end trace 0000000000000000 ]---
>              	 PM: suspend entry (deep)
> 
> Considering out-of-tree nature of nvidia gpu driver, and nobody reported this kernel warning before with in-trees,
>  I had almost convinced myself that nvidia driver may need "big" rework to live with those "PTE" changes.
> So glad to see this thread of discussion/issue/fix now, I have been patching my system manually ever since 6.10-rc1,
> hope things got fixed soon...

Yep this is a similar file truncation path.  I'll repost my previous rfc
patch separately soon.

Thanks,

-- 
Peter Xu

Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()

Posted by Andrew Morton 1 year, 7 months ago

On Wed, 10 Jul 2024 22:13:17 -0700 Pei Li <peili.dev@gmail.com> wrote:

> This patch fixes this warning by acquiring read lock before entering
> untrack_pfn() while write lock is not held.
> 
> syzbot has tested the proposed patch and the reproducer did not
> trigger any issue.
> 

Thanks.

> ---
> Syzbot reported the following warning in follow_pte():
> 
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
> 
> This is because we are assuming that mm->mmap_lock should be held when
> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
> follow_pte() improvements).
> 
> However, in the following call stack, we are not acquring the lock:
>  follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>  get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>  untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>  unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>  zap_page_range_single+0x326/0x560 mm/memory.c:1920
> 
> In zap_page_range_single(), we passed mm_wr_locked as false, as we do
> not expect write lock to be held.
> In the special case where vma->vm_flags is set as VM_PFNMAP, we are
> hitting untrack_pfn() which eventually calls into follow_phys.

I included the above (very relevant) info in the changelog.

And I added 

	Fixes: c5541ba378e3 ("mm: follow_pte() improvements")

and queued the patch for 6.10-rc7.  Hopefully David can review it for us.