[v1] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

[PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by Peter Xu 3 years, 2 months ago

Based on latest mm-unstable (9ed079378408).

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, but majorly covering safe use of huge_pte_offset.

Comparing to previous rfcv2, the major change is I dropped the new pgtable
lock but only use vma lock for locking.  The major reason is I overlooked
that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs.  OTOH many of
the huge_pte_offset() call sites do need to take pgtable lock.  It means
the valid users for the new RCU lock will be very limited.

It's possible that in the future we can rework the pgtable free to only
free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
to be within tlb_remove_table_rcu()), then the RCU lock will make more
sense.  For now, make it simple by fixing the races first.

Since this version attached a reproducer (below) and also removed the RCU
(especially, the fallback irqoff) solution, removing RFC tag.

Old versions:

rfcv1:  https://lore.kernel.org/r/20221030212929.335473-1-peterx@redhat.com
rfcv2:  https://lore.kernel.org/r/20221118011025.2178986-1-peterx@redhat.com

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable.  It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.  It was assumed to be safe but it's actually not.  One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.  It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us.  IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
  #define _GNU_SOURCE
  #include <stdio.h>
  #include <stdlib.h>
  #include <errno.h>
  #include <unistd.h>
  #include <sys/mman.h>
  #include <fcntl.h>
  #include <linux/memfd.h>
  #include <assert.h>
  #include <pthread.h>

  #define  MSIZE  (1UL << 30)     /* 1GB */
  #define  PSIZE  (2UL << 20)     /* 2MB */

  #define  HOLD_SEC  (1)

  int pipefd[2];
  void *buf;

  void *do_map(int fd)
  {
      unsigned char *tmpbuf, *p;
      int ret;

      ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
      if (ret) {
          perror("posix_memalign() failed");
          return NULL;
      }

      tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
                    MAP_SHARED | MAP_FIXED, fd, 0);
      if (tmpbuf == MAP_FAILED) {
          perror("mmap() failed");
          return NULL;
      }
      printf("mmap() -> %p\n", tmpbuf);

      for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
          *p = 1;
      }

      return tmpbuf;
  }

  void do_unmap(void *buf)
  {
      munmap(buf, MSIZE);
  }

  void proc2(int fd)
  {
      unsigned char c;

      buf = do_map(fd);
      if (!buf)
          return;

      read(pipefd[0], &c, 1);
      /*
       * This frees the shared pgtable page, causing use-after-free in
       * proc1_thread1 when soft walking hugetlb pgtable.
       */
      do_unmap(buf);

      printf("Proc2 quitting\n");
  }

  void *proc1_thread1(void *data)
  {
      /*
       * Trigger follow-page on 1st 2m page.  Kernel hack patch needed to
       * withhold this procedure for easier reproduce.
       */
      madvise(buf, PSIZE, MADV_POPULATE_WRITE);
      printf("Proc1-thread1 quitting\n");
      return NULL;
  }

  void *proc1_thread2(void *data)
  {
      unsigned char c;

      /* Wait a while until proc1_thread1() start to wait */
      sleep(0.5);
      /* Trigger pmd unshare */
      madvise(buf, PSIZE, MADV_DONTNEED);
      /* Kick off proc2 to release the pgtable */
      write(pipefd[1], &c, 1);

      printf("Proc1-thread2 quitting\n");
      return NULL;
  }

  void proc1(int fd)
  {
      pthread_t tid1, tid2;
      int ret;

      buf = do_map(fd);
      if (!buf)
          return;

      ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
      assert(ret == 0);
      ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
      assert(ret == 0);

      /* Kick the child to share the PUD entry */
      pthread_join(tid1, NULL);
      pthread_join(tid2, NULL);

      do_unmap(buf);
  }

  int main(void)
  {
      int fd, ret;

      fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
      if (fd < 0) {
          perror("open failed");
          return -1;
      }

      ret = ftruncate(fd, MSIZE);
      if (ret) {
          perror("ftruncate() failed");
          return -1;
      }

      ret = pipe(pipefd);
      if (ret) {
          perror("pipe() failed");
          return -1;
      }

      if (fork()) {
          proc1(fd);
      } else {
          proc2(fd);
      }

      close(pipefd[0]);
      close(pipefd[1]);
      close(fd);

      return 0;
  }
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d97c9a2a15d..f8d99dad5004 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -38,6 +38,7 @@
 #include <asm/page.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
+#include <asm/delay.h>

 #include <linux/io.h>
 #include <linux/hugetlb.h>
@@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
                bool unshare = false;
                int absent;
                struct page *page;
+               unsigned long c = 0;

                /*
                 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
                 */
                pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
                                      huge_page_size(h));
+
+               pr_info("%s: withhold 1 sec...\n", __func__);
+               for (c = 0; c < 100; c++) {
+                       udelay(10000);
+               }
+               pr_info("%s: withhold 1 sec...done\n", __func__);
+
                if (pte)
                        ptl = huge_pte_lock(h, mm, pte);
                absent = !pte || huge_pte_none(huge_ptep_get(pte));
======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[   16.959907] follow_hugetlb_page: withhold 1 sec...
[   17.960315] follow_hugetlb_page: withhold 1 sec...done
[   17.960550] ------------[ cut here ]------------
[   17.960742] DEBUG_LOCKS_WARN_ON(1)
[   17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[   17.961264] Modules linked in:
[   17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[   17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[   17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[   17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[   17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[   17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[   17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[   17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[   17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[   17.965123] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[   17.965443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[   17.965956] PKRU: 55555554
[   17.966068] Call Trace:
[   17.966172]  <TASK>
[   17.966268]  ? tick_nohz_tick_stopped+0x12/0x30
[   17.966455]  lock_acquire+0xbf/0x2b0
[   17.966603]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.966799]  ? _printk+0x48/0x4e
[   17.966934]  _raw_spin_lock+0x2f/0x40
[   17.967087]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.967285]  follow_hugetlb_page.cold+0x75/0x5c4
[   17.967473]  __get_user_pages+0xbb/0x620
[   17.967635]  faultin_vma_page_range+0x9a/0x100
[   17.967817]  madvise_vma_behavior+0x3c0/0xbd0
[   17.967998]  ? mas_prev+0x11/0x290
[   17.968141]  ? find_vma_prev+0x5e/0xa0
[   17.968304]  ? madvise_vma_anon_name+0x70/0x70
[   17.968486]  madvise_walk_vmas+0xa9/0x120
[   17.968650]  do_madvise.part.0+0xfa/0x270
[   17.968813]  __x64_sys_madvise+0x5a/0x70
[   17.968974]  do_syscall_64+0x37/0x90
[   17.969123]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   17.969329] RIP: 0033:0x7f1840f0efdb
[   17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[   17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[   17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[   17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[   17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[   17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[   17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[   17.972083]  </TASK>
[   17.972199] irq event stamp: 2353
[   17.972372] hardirqs last  enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[   17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[   17.973365] softirqs last  enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[   17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[   17.974341] ---[ end trace 0000000000000000 ]---
[   17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[   17.975012] #PF: supervisor read access in kernel mode
[   17.975314] #PF: error_code(0x0000) - not-present page
[   17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[   17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G        W          6.1.0-rc4-peterx+ #46
[   17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[   17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[   17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[   17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[   17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[   17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[   17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[   17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[   17.984815] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[   17.985924] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[   17.986674] PKRU: 55555554
[   17.986832] Call Trace:
[   17.987012]  <TASK>
[   17.987266]  ? tick_nohz_tick_stopped+0x12/0x30
[   17.987770]  lock_acquire+0xbf/0x2b0
[   17.988118]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.988575]  ? _printk+0x48/0x4e
[   17.988889]  _raw_spin_lock+0x2f/0x40
[   17.989243]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.989687]  follow_hugetlb_page.cold+0x75/0x5c4
[   17.990119]  __get_user_pages+0xbb/0x620
[   17.990500]  faultin_vma_page_range+0x9a/0x100
[   17.990928]  madvise_vma_behavior+0x3c0/0xbd0
[   17.991354]  ? mas_prev+0x11/0x290
[   17.991678]  ? find_vma_prev+0x5e/0xa0
[   17.992024]  ? madvise_vma_anon_name+0x70/0x70
[   17.992421]  madvise_walk_vmas+0xa9/0x120
[   17.992793]  do_madvise.part.0+0xfa/0x270
[   17.993166]  __x64_sys_madvise+0x5a/0x70
[   17.993539]  do_syscall_64+0x37/0x90
[   17.993879]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2:         cleanup, or dependency of the follow up patches
Patch 3:           before fixing, document huge_pte_offset() on lock required
Patch 4-9:         each patch resolves one possible race condition
Patch 10:          introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore.  It also passes all hugetlb kselftests.

Comments welcomed, thanks.

Peter Xu (10):
  mm/hugetlb: Let vma_offset_start() to return start
  mm/hugetlb: Don't wait for migration entry during follow page
  mm/hugetlb: Document huge_pte_offset usage
  mm/hugetlb: Move swap entry handling into vma lock when faulted
  mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare
  mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare
  mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare
  mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare
  mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare
  mm/hugetlb: Introduce hugetlb_walk()

 fs/hugetlbfs/inode.c    | 28 ++++++-------
 fs/userfaultfd.c        | 24 +++++++----
 include/linux/hugetlb.h | 69 ++++++++++++++++++++++++++++++++
 include/linux/rmap.h    |  4 ++
 include/linux/swapops.h |  6 ++-
 mm/hugetlb.c            | 89 ++++++++++++++++++-----------------------
 mm/migrate.c            | 25 ++++++++++--
 mm/page_vma_mapped.c    |  7 +++-
 mm/pagewalk.c           |  6 +--
 9 files changed, 175 insertions(+), 83 deletions(-)

-- 
2.37.3

Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> Based on latest mm-unstable (9ed079378408).
> 
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
> 
> Comparing to previous rfcv2, the major change is I dropped the new pgtable
> lock but only use vma lock for locking.  The major reason is I overlooked
> that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
> pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs.  OTOH many of
> the huge_pte_offset() call sites do need to take pgtable lock.  It means
> the valid users for the new RCU lock will be very limited.

Thanks.

> 
> It's possible that in the future we can rework the pgtable free to only
> free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
> to be within tlb_remove_table_rcu()), then the RCU lock will make more
> sense.  For now, make it simple by fixing the races first.

Good.

> 
> Since this version attached a reproducer (below) and also removed the RCU
> (especially, the fallback irqoff) solution, removing RFC tag.

Very nice, thanks.

> 
> Old versions:
> 
> rfcv1:  https://lore.kernel.org/r/20221030212929.335473-1-peterx@redhat.com
> rfcv2:  https://lore.kernel.org/r/20221118011025.2178986-1-peterx@redhat.com
> 
> Problem
> =======
> 
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable.  It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
> 
> huge_pte_offset() is always called with mmap lock held with either read or
> write.  It was assumed to be safe but it's actually not.  One race
> condition can easily trigger by: (1) firstly trigger pmd share on a memory
> range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> another thread unshare the pmd range, and the pgtable page is prone to lost
> if the other shared process wants to free it completely (by either munmap
> or exit mm).

So just that I understand correctly:

Two processes, #A and #B, share a page table. Process #A runs two 
threads, #A1 and #A2.

#A1 walks that shared page table (using huge_pte_offset()), for example, 
to resolve a page fault. Concurrently, #A2 triggers unsharing of that 
page table (replacing it by a private page table), for example, using 
munmap().

So #A1 will eventually read/write the shared page table while we're 
placing a private page table. Which would be fine (assuming no unsharing 
would be required by #A1), however, if #B also concurrently drops the 
reference to the shared page table (), the shared page table could 
essentially get freed while #A1 is still walking it.

I suspect, looking at the reproducer, that the page table deconstructor 
was called. Will the page table also actually get freed already? IOW, 
could #A1 be reading/writing a freed page?


> 
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed.  It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.

Agreed.

> 
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons especially
> on huge_pte_offset() usage.
> 
> One more thing to mention is that besides the vma lock, i_mmap_rwsem can
> also be used to protect the pgtable page (along with its pgtable lock) from
> being freed from under us.  IOW, huge_pte_offset() callers need to either
> hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
> 
> A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
> very hard to trigger, one needs to apply another kernel delay patch too,
> see below):

Thanks,

David / dhildenb

Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Wed, Nov 30, 2022 at 10:46:24AM +0100, David Hildenbrand wrote:
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write.  It was assumed to be safe but it's actually not.  One race
> > condition can easily trigger by: (1) firstly trigger pmd share on a memory
> > range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> > another thread unshare the pmd range, and the pgtable page is prone to lost
> > if the other shared process wants to free it completely (by either munmap
> > or exit mm).
> 
> So just that I understand correctly:
> 
> Two processes, #A and #B, share a page table. Process #A runs two threads,
> #A1 and #A2.
> 
> #A1 walks that shared page table (using huge_pte_offset()), for example, to
> resolve a page fault. Concurrently, #A2 triggers unsharing of that page
> table (replacing it by a private page table),

Not yet replacing it, just unsharing.

If the replacement happened we shouldn't trigger a bug either because
huge_pte_offset() will return the private pgtable page instead.

> for example, using munmap().

munmap() may not work because it needs mmap lock, so it'll wait until #A1
completes huge_pte_offset() walks and release mmap lock read.

Many of other things can trigger unshare, though.  In the reproducer I used
MADV_DONTNEED.

> 
> So #A1 will eventually read/write the shared page table while we're placing
> a private page table. Which would be fine (assuming no unsharing would be
> required by #A1), however, if #B also concurrently drops the reference to
> the shared page table (), the shared page table could essentially get freed
> while #A1 is still walking it.
> 
> I suspect, looking at the reproducer, that the page table deconstructor was
> called. Will the page table also actually get freed already? IOW, could #A1
> be reading/writing a freed page?

If with the existing code base, I think it could.

If with RCU lock, it couldn't, but still since the pgtable lock is freed
even if the page is not, we'll still hit weird issues when accessing the
lock.

And with vma lock it should be all safe.

-- 
Peter Xu

Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by Andrew Morton 3 years, 2 months ago

On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote:

> [   17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI

Do we know which kernel versions are affected here?

Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Tue, Nov 29, 2022 at 12:51:17PM -0800, Andrew Morton wrote:
> On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote:
> 
> > [   17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
> 
> Do we know which kernel versions are affected here?

Since lockless walk of huge_pte_offset() existed since the initial git
commit, it should be the time when we introduced pmd sharing for hugetlb,
which means any kernel after commit 39dde65c9940 ("[PATCH] shared page
table for hugetlb page", 2006-12-07) should be prone to the issue at least
for x86 (as pmd sharing was proposed initially only for x86).

-- 
Peter Xu

Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by Andrew Morton 3 years, 2 months ago

On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote:

> Based on latest mm-unstable (9ed079378408).
> 
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, but majorly covering safe use of huge_pte_offset.

We're at -rc7 (a -rc8 appears probable this time) and I'm looking to
settle down and stabilize things...

> 
> ...
>
> huge_pte_offset() is always called with mmap lock held with either read or
> write.  It was assumed to be safe but it's actually not.  One race
> condition can easily trigger by: (1) firstly trigger pmd share on a memory
> range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> another thread unshare the pmd range, and the pgtable page is prone to lost
> if the other shared process wants to free it completely (by either munmap
> or exit mm).

That sounds like a hard-to-hit memory leak, but what we have here is a
user-triggerable use-after-free and an oops.  Ugh.

Could people please prioritize the review and testing of this patchset?

Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by Peter Xu 3 years, 2 months ago

Hi, Andrew,

On Tue, Nov 29, 2022 at 12:49:44PM -0800, Andrew Morton wrote:
> On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote:
> 
> > Based on latest mm-unstable (9ed079378408).
> > 
> > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
> 
> We're at -rc7 (a -rc8 appears probable this time) and I'm looking to
> settle down and stabilize things...

I targeted this series for the next release not current, because there's no
known report for it per my knowledge.

The reproducer needs explicit kernel delays to trigger as mentioned in the
cover letter.  So far I didn't try to reproduce with a generic kernel yet
but just to verify the existance of the problem.

> 
> > 
> > ...
> >
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write.  It was assumed to be safe but it's actually not.  One race
> > condition can easily trigger by: (1) firstly trigger pmd share on a memory
> > range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> > another thread unshare the pmd range, and the pgtable page is prone to lost
> > if the other shared process wants to free it completely (by either munmap
> > or exit mm).
> 
> That sounds like a hard-to-hit memory leak, but what we have here is a
> user-triggerable use-after-free and an oops.  Ugh.

IIUC it's not a leak, but it's just that huge_pte_offset() can walk the
(pmd-shared) pgtable page and also trying to take the pgtable lock even
though the page can already be freed in parallel, hence accessing either
the page or the pgtable lock after the pgtable page got freed.

E.g., the 1st warning was trigger by:

static inline struct lock_class *hlock_class(struct held_lock *hlock)
{
	unsigned int class_idx = hlock->class_idx;

	/* Don't re-read hlock->class_idx, can't use READ_ONCE() on bitfield */
	barrier();

	if (!test_bit(class_idx, lock_classes_in_use)) {
		/*
		 * Someone passed in garbage, we give up.
		 */
		DEBUG_LOCKS_WARN_ON(1);  <----------------------------
		return NULL;
	}
        ...
}

I think it's because the spin lock got freed along with the pgtable page,
so when we want to lock the pgtable lock we see weird lock state in
dep_map, as the lock pointer is not valid at all.

-- 
Peter Xu

Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Posted by Andrew Morton 3 years, 2 months ago

On Tue, 29 Nov 2022 16:19:52 -0500 Peter Xu <peterx@redhat.com> wrote:

> On Tue, Nov 29, 2022 at 12:49:44PM -0800, Andrew Morton wrote:
> > On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote:
> > 
> > > Based on latest mm-unstable (9ed079378408).
> > > 
> > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > > series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
> > 
> > We're at -rc7 (a -rc8 appears probable this time) and I'm looking to
> > settle down and stabilize things...
> 
> I targeted this series for the next release not current, because there's no
> known report for it per my knowledge.
> 
> The reproducer needs explicit kernel delays to trigger as mentioned in the
> cover letter.  So far I didn't try to reproduce with a generic kernel yet
> but just to verify the existance of the problem.

OK, thanks, I missed that.

I'll give the series a run in -next for a couple of days then I'll pull
it out until after the next Linus merge window, so it can't invalidate
testing heading into that merge window.

[PATCH 01/10] mm/hugetlb: Let vma_offset_start() to return start

Posted by Peter Xu 3 years, 2 months ago

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for all
the callers because whenever the retval is used, it'll be ultimately added
into the vma->vm_start anyway, so it's better.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 790d2727141a..fdb16246f46e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -412,10 +412,12 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
  */
 static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
 {
+	unsigned long offset = 0;
+
 	if (vma->vm_pgoff < start)
-		return (start - vma->vm_pgoff) << PAGE_SHIFT;
-	else
-		return 0;
+		offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
+
+	return vma->vm_start + offset;
 }
 
 static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
@@ -457,7 +459,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
 
-		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+		if (!hugetlb_vma_maps_page(vma, v_start, page))
 			continue;
 
 		if (!hugetlb_vma_trylock_write(vma)) {
@@ -473,8 +475,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 			break;
 		}
 
-		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
-				NULL, ZAP_FLAG_DROP_MARKER);
+		unmap_hugepage_range(vma, v_start, v_end, NULL,
+				     ZAP_FLAG_DROP_MARKER);
 		hugetlb_vma_unlock_write(vma);
 	}
 
@@ -507,10 +509,9 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 		 */
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
-		if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
-			unmap_hugepage_range(vma, vma->vm_start + v_start,
-						v_end, NULL,
-						ZAP_FLAG_DROP_MARKER);
+		if (hugetlb_vma_maps_page(vma, v_start, page))
+			unmap_hugepage_range(vma, v_start, v_end, NULL,
+					     ZAP_FLAG_DROP_MARKER);
 
 		kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
 		hugetlb_vma_unlock_write(vma);
@@ -540,8 +541,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
 
-		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
-				     NULL, zap_flags);
+		unmap_hugepage_range(vma, v_start, v_end, NULL, zap_flags);
 
 		/*
 		 * Note that vma lock only exists for shared/non-private
-- 
2.37.3

Re: [PATCH 01/10] mm/hugetlb: Let vma_offset_start() to return start

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> Even though vma_offset_start() is named like that, it's not returning "the
> start address of the range" but rather the offset we should use to offset
> the vma->vm_start address.
> 
> Make it return the real value of the start vaddr, and it also helps for all
> the callers because whenever the retval is used, it'll be ultimately added
> into the vma->vm_start anyway, so it's better.

... the function name is a bit unfortunate ("offset" what?, "start" what?).

But it now matches the semantics of vma_offset_end() and the actual 
"v_start" variables.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb

[PATCH 02/10] mm/hugetlb: Don't wait for migration entry during follow page

Posted by Peter Xu 3 years, 2 months ago

That's what the code does with !hugetlb pages, so we should logically do
the same for hugetlb, so migration entry will also be treated as no page.

This is probably also the last piece in follow_page code that may sleep,
the last one should be removed in cf994dd8af27 ("mm/gup: remove
FOLL_MIGRATION", 2022-11-16).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d97c9a2a15d..dfe677fadaf8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6234,7 +6234,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	if (WARN_ON_ONCE(flags & FOLL_PIN))
 		return NULL;
 
-retry:
 	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (!pte)
 		return NULL;
@@ -6257,16 +6256,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 			page = NULL;
 			goto out;
 		}
-	} else {
-		if (is_hugetlb_entry_migration(entry)) {
-			spin_unlock(ptl);
-			__migration_entry_wait_huge(pte, ptl);
-			goto retry;
-		}
-		/*
-		 * hwpoisoned entry is treated as no_page_table in
-		 * follow_page_mask().
-		 */
 	}
 out:
 	spin_unlock(ptl);
-- 
2.37.3

Re: [PATCH 02/10] mm/hugetlb: Don't wait for migration entry during follow page

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> That's what the code does with !hugetlb pages, so we should logically do
> the same for hugetlb, so migration entry will also be treated as no page.
> 
> This is probably also the last piece in follow_page code that may sleep,
> the last one should be removed in cf994dd8af27 ("mm/gup: remove
> FOLL_MIGRATION", 2022-11-16).

We only have a handful of follow_page() calls left. Most relevant for 
hugetlb should only be the ones from mm/migrate.c.

So this is now just consistent with ordinary pages/THP.

Happy so see that gone.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb

Re: [PATCH 02/10] mm/hugetlb: Don't wait for migration entry during follow page

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> That's what the code does with !hugetlb pages, so we should logically do
> the same for hugetlb, so migration entry will also be treated as no page.

Makes sense.  I like the consistency.

> This is probably also the last piece in follow_page code that may sleep,
> the last one should be removed in cf994dd8af27 ("mm/gup: remove
> FOLL_MIGRATION", 2022-11-16).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 11 -----------
>  1 file changed, 11 deletions(-)

Thanks!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

[PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by Peter Xu 3 years, 2 months ago

huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk a generic pgtable as long as we're with
the mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
pgtable freed by pmd unsharing, it means that even with mmap lock held for
current mm, the PMD pgtable page can still go away from under us if pmd
unsharing is possible during the walk.

So we have two ways to make it safe even for a shared mapping:

  (1) If we're with the hugetlb vma lock held for either read/write, it's
      okay because pmd unshare cannot happen at all.

  (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
      okay because even if pmd unshare can happen, the pgtable page cannot
      be freed from under us.

Document it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 551834cd5299..81efd9b9baa2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz);
+/*
+ * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
+ * Returns the pte_t* if found, or NULL if the address is not mapped.
+ *
+ * Since this function will walk all the pgtable pages (including not only
+ * high-level pgtable page, but also PUD entry that can be unshared
+ * concurrently for VM_SHARED), the caller of this function should be
+ * responsible of its thread safety.  One can follow this rule:
+ *
+ *  (1) For private mappings: pmd unsharing is not possible, so it'll
+ *      always be safe if we're with the mmap sem for either read or write.
+ *      This is normally always the case, IOW we don't need to do anything
+ *      special.
+ *
+ *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
+ *      pgtable page can go away from under us!  It can be done by a pmd
+ *      unshare with a follow up munmap() on the other process), then we
+ *      need either:
+ *
+ *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
+ *           won't happen upon the range (it also makes sure the pte_t we
+ *           read is the right and stable one), or,
+ *
+ *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
+ *           sure even if unshare happened the racy unmap() will wait until
+ *           i_mmap_rwsem is released.
+ *
+ * Option (2.1) is the safest, which guarantees pte stability from pmd
+ * sharing pov, until the vma lock released.  Option (2.2) doesn't protect
+ * a concurrent pmd unshare, but it makes sure the pgtable page is safe to
+ * access.
+ */
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_page(struct hstate *h);
-- 
2.37.3

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> hugetlb address.
> 
> Normally, it's always safe to walk a generic pgtable as long as we're with
> the mmap lock held for either read or write, because that guarantees the
> pgtable pages will always be valid during the process.
> 
> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> pgtable freed by pmd unsharing, it means that even with mmap lock held for
> current mm, the PMD pgtable page can still go away from under us if pmd
> unsharing is possible during the walk.
> 
> So we have two ways to make it safe even for a shared mapping:
> 
>    (1) If we're with the hugetlb vma lock held for either read/write, it's
>        okay because pmd unshare cannot happen at all.
> 
>    (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>        okay because even if pmd unshare can happen, the pgtable page cannot
>        be freed from under us.
> 
> Document it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>   1 file changed, 32 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 551834cd5299..81efd9b9baa2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>   
>   pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   			unsigned long addr, unsigned long sz);
> +/*
> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> + * Returns the pte_t* if found, or NULL if the address is not mapped.
> + *
> + * Since this function will walk all the pgtable pages (including not only
> + * high-level pgtable page, but also PUD entry that can be unshared
> + * concurrently for VM_SHARED), the caller of this function should be
> + * responsible of its thread safety.  One can follow this rule:
> + *
> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> + *      always be safe if we're with the mmap sem for either read or write.
> + *      This is normally always the case, IOW we don't need to do anything
> + *      special.

Maybe worth mentioning that hugetlb_vma_lock_read() and friends already 
optimize for private mappings, to not take the VMA lock if not required.

Was happy to spot that optimization in there already :)

-- 
Thanks,

David / dhildenb

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by Peter Xu 3 years, 2 months ago

On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
> On 29.11.22 20:35, Peter Xu wrote:
> > huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> > hugetlb address.
> > 
> > Normally, it's always safe to walk a generic pgtable as long as we're with
> > the mmap lock held for either read or write, because that guarantees the
> > pgtable pages will always be valid during the process.
> > 
> > But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> > pgtable freed by pmd unsharing, it means that even with mmap lock held for
> > current mm, the PMD pgtable page can still go away from under us if pmd
> > unsharing is possible during the walk.
> > 
> > So we have two ways to make it safe even for a shared mapping:
> > 
> >    (1) If we're with the hugetlb vma lock held for either read/write, it's
> >        okay because pmd unshare cannot happen at all.
> > 
> >    (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
> >        okay because even if pmd unshare can happen, the pgtable page cannot
> >        be freed from under us.
> > 
> > Document it.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
> >   1 file changed, 32 insertions(+)
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 551834cd5299..81efd9b9baa2 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
> >   pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> >   			unsigned long addr, unsigned long sz);
> > +/*
> > + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> > + * Returns the pte_t* if found, or NULL if the address is not mapped.
> > + *
> > + * Since this function will walk all the pgtable pages (including not only
> > + * high-level pgtable page, but also PUD entry that can be unshared
> > + * concurrently for VM_SHARED), the caller of this function should be
> > + * responsible of its thread safety.  One can follow this rule:
> > + *
> > + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> > + *      always be safe if we're with the mmap sem for either read or write.
> > + *      This is normally always the case, IOW we don't need to do anything
> > + *      special.
> 
> Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
> optimize for private mappings, to not take the VMA lock if not required.

Yes we can.  I assume this is not super urgent so I'll hold a while to see
whether there's anything else that needs amending for the documents.

Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
code path it's still better to not take the lock at all, because that still
contains a function jump which will be unnecesary.

Thanks,

-- 
Peter Xu

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by David Hildenbrand 3 years, 2 months ago

On 30.11.22 17:09, Peter Xu wrote:
> On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
>> On 29.11.22 20:35, Peter Xu wrote:
>>> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
>>> hugetlb address.
>>>
>>> Normally, it's always safe to walk a generic pgtable as long as we're with
>>> the mmap lock held for either read or write, because that guarantees the
>>> pgtable pages will always be valid during the process.
>>>
>>> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
>>> pgtable freed by pmd unsharing, it means that even with mmap lock held for
>>> current mm, the PMD pgtable page can still go away from under us if pmd
>>> unsharing is possible during the walk.
>>>
>>> So we have two ways to make it safe even for a shared mapping:
>>>
>>>     (1) If we're with the hugetlb vma lock held for either read/write, it's
>>>         okay because pmd unshare cannot happen at all.
>>>
>>>     (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>>>         okay because even if pmd unshare can happen, the pgtable page cannot
>>>         be freed from under us.
>>>
>>> Document it.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>>>    1 file changed, 32 insertions(+)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 551834cd5299..81efd9b9baa2 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>>>    pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>>    			unsigned long addr, unsigned long sz);
>>> +/*
>>> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
>>> + * Returns the pte_t* if found, or NULL if the address is not mapped.
>>> + *
>>> + * Since this function will walk all the pgtable pages (including not only
>>> + * high-level pgtable page, but also PUD entry that can be unshared
>>> + * concurrently for VM_SHARED), the caller of this function should be
>>> + * responsible of its thread safety.  One can follow this rule:
>>> + *
>>> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
>>> + *      always be safe if we're with the mmap sem for either read or write.
>>> + *      This is normally always the case, IOW we don't need to do anything
>>> + *      special.
>>
>> Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
>> optimize for private mappings, to not take the VMA lock if not required.
> 
> Yes we can.  I assume this is not super urgent so I'll hold a while to see
> whether there's anything else that needs amending for the documents.
> 
> Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
> code path it's still better to not take the lock at all, because that still
> contains a function jump which will be unnecesary.

IMHO it makes coding a lot more consistent and less error-prone when not 
care about whether to the the lock or not (as an optimization) and just 
having this handled "automatically".

Optimizing a jump out would rather smell like a micro-optimization.

-- 
Thanks,

David / dhildenb

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by Peter Xu 3 years, 2 months ago

        On Wed, Nov 30, 2022 at 05:11:36PM +0100, David Hildenbrand wrote:
> On 30.11.22 17:09, Peter Xu wrote:
> > On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
> > > On 29.11.22 20:35, Peter Xu wrote:
> > > > huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> > > > hugetlb address.
> > > > 
> > > > Normally, it's always safe to walk a generic pgtable as long as we're with
> > > > the mmap lock held for either read or write, because that guarantees the
> > > > pgtable pages will always be valid during the process.
> > > > 
> > > > But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> > > > pgtable freed by pmd unsharing, it means that even with mmap lock held for
> > > > current mm, the PMD pgtable page can still go away from under us if pmd
> > > > unsharing is possible during the walk.
> > > > 
> > > > So we have two ways to make it safe even for a shared mapping:
> > > > 
> > > >     (1) If we're with the hugetlb vma lock held for either read/write, it's
> > > >         okay because pmd unshare cannot happen at all.
> > > > 
> > > >     (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
> > > >         okay because even if pmd unshare can happen, the pgtable page cannot
> > > >         be freed from under us.
> > > > 
> > > > Document it.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
> > > >    1 file changed, 32 insertions(+)
> > > > 
> > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > > index 551834cd5299..81efd9b9baa2 100644
> > > > --- a/include/linux/hugetlb.h
> > > > +++ b/include/linux/hugetlb.h
> > > > @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
> > > >    pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> > > >    			unsigned long addr, unsigned long sz);
> > > > +/*
> > > > + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> > > > + * Returns the pte_t* if found, or NULL if the address is not mapped.
> > > > + *
> > > > + * Since this function will walk all the pgtable pages (including not only
> > > > + * high-level pgtable page, but also PUD entry that can be unshared
> > > > + * concurrently for VM_SHARED), the caller of this function should be
> > > > + * responsible of its thread safety.  One can follow this rule:
> > > > + *
> > > > + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> > > > + *      always be safe if we're with the mmap sem for either read or write.
> > > > + *      This is normally always the case, IOW we don't need to do anything
> > > > + *      special.
> > > 
> > > Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
> > > optimize for private mappings, to not take the VMA lock if not required.
> > 
> > Yes we can.  I assume this is not super urgent so I'll hold a while to see
> > whether there's anything else that needs amending for the documents.
> > 
> > Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
> > code path it's still better to not take the lock at all, because that still
> > contains a function jump which will be unnecesary.
> 
> IMHO it makes coding a lot more consistent and less error-prone when not
> care about whether to the the lock or not (as an optimization) and just
> having this handled "automatically".
> 
> Optimizing a jump out would rather smell like a micro-optimization.

Or we can move the lock helpers into the headers, too.

-- 
Peter Xu

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by David Hildenbrand 3 years, 2 months ago

On 30.11.22 17:25, Peter Xu wrote:
>          On Wed, Nov 30, 2022 at 05:11:36PM +0100, David Hildenbrand wrote:
>> On 30.11.22 17:09, Peter Xu wrote:
>>> On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
>>>> On 29.11.22 20:35, Peter Xu wrote:
>>>>> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
>>>>> hugetlb address.
>>>>>
>>>>> Normally, it's always safe to walk a generic pgtable as long as we're with
>>>>> the mmap lock held for either read or write, because that guarantees the
>>>>> pgtable pages will always be valid during the process.
>>>>>
>>>>> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
>>>>> pgtable freed by pmd unsharing, it means that even with mmap lock held for
>>>>> current mm, the PMD pgtable page can still go away from under us if pmd
>>>>> unsharing is possible during the walk.
>>>>>
>>>>> So we have two ways to make it safe even for a shared mapping:
>>>>>
>>>>>      (1) If we're with the hugetlb vma lock held for either read/write, it's
>>>>>          okay because pmd unshare cannot happen at all.
>>>>>
>>>>>      (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>>>>>          okay because even if pmd unshare can happen, the pgtable page cannot
>>>>>          be freed from under us.
>>>>>
>>>>> Document it.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>     include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 32 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>>>> index 551834cd5299..81efd9b9baa2 100644
>>>>> --- a/include/linux/hugetlb.h
>>>>> +++ b/include/linux/hugetlb.h
>>>>> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>>>>>     pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>>>>     			unsigned long addr, unsigned long sz);
>>>>> +/*
>>>>> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
>>>>> + * Returns the pte_t* if found, or NULL if the address is not mapped.
>>>>> + *
>>>>> + * Since this function will walk all the pgtable pages (including not only
>>>>> + * high-level pgtable page, but also PUD entry that can be unshared
>>>>> + * concurrently for VM_SHARED), the caller of this function should be
>>>>> + * responsible of its thread safety.  One can follow this rule:
>>>>> + *
>>>>> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
>>>>> + *      always be safe if we're with the mmap sem for either read or write.
>>>>> + *      This is normally always the case, IOW we don't need to do anything
>>>>> + *      special.
>>>>
>>>> Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
>>>> optimize for private mappings, to not take the VMA lock if not required.
>>>
>>> Yes we can.  I assume this is not super urgent so I'll hold a while to see
>>> whether there's anything else that needs amending for the documents.
>>>
>>> Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
>>> code path it's still better to not take the lock at all, because that still
>>> contains a function jump which will be unnecesary.
>>
>> IMHO it makes coding a lot more consistent and less error-prone when not
>> care about whether to the the lock or not (as an optimization) and just
>> having this handled "automatically".
>>
>> Optimizing a jump out would rather smell like a micro-optimization.
> 
> Or we can move the lock helpers into the headers, too.

Ah, yes.

-- 
Thanks,

David / dhildenb

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> hugetlb address.
> 
> Normally, it's always safe to walk a generic pgtable as long as we're with
> the mmap lock held for either read or write, because that guarantees the
> pgtable pages will always be valid during the process.

With the addition, that it's only safe to walk within VMA ranges while 
holding the mmap lock in read mode. It's not safe to walk outside VMA 
ranges.

But the point is that we're walking within a known hugetlbfs VMA, I 
assume, just adding it for completeness :)

> 
> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> pgtable freed by pmd unsharing, it means that even with mmap lock held for
> current mm, the PMD pgtable page can still go away from under us if pmd
> unsharing is possible during the walk.
> 
> So we have two ways to make it safe even for a shared mapping:
> 
>    (1) If we're with the hugetlb vma lock held for either read/write, it's
>        okay because pmd unshare cannot happen at all.
> 
>    (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>        okay because even if pmd unshare can happen, the pgtable page cannot
>        be freed from under us.
> 
> Document it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

In general, I like that documentation. Let's see if we can figure out 
what to do with the i_mmap_rwsem.

-- 
Thanks,

David / dhildenb

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> hugetlb address.
> 
> Normally, it's always safe to walk a generic pgtable as long as we're with
> the mmap lock held for either read or write, because that guarantees the
> pgtable pages will always be valid during the process.
> 
> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> pgtable freed by pmd unsharing, it means that even with mmap lock held for
> current mm, the PMD pgtable page can still go away from under us if pmd
> unsharing is possible during the walk.
> 
> So we have two ways to make it safe even for a shared mapping:
> 
>   (1) If we're with the hugetlb vma lock held for either read/write, it's
>       okay because pmd unshare cannot happen at all.
> 
>   (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>       okay because even if pmd unshare can happen, the pgtable page cannot
>       be freed from under us.
> 
> Document it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 551834cd5299..81efd9b9baa2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long addr, unsigned long sz);
> +/*
> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> + * Returns the pte_t* if found, or NULL if the address is not mapped.
> + *
> + * Since this function will walk all the pgtable pages (including not only
> + * high-level pgtable page, but also PUD entry that can be unshared
> + * concurrently for VM_SHARED), the caller of this function should be
> + * responsible of its thread safety.  One can follow this rule:
> + *
> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> + *      always be safe if we're with the mmap sem for either read or write.
> + *      This is normally always the case, IOW we don't need to do anything
> + *      special.
> + *
> + *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
> + *      pgtable page can go away from under us!  It can be done by a pmd
> + *      unshare with a follow up munmap() on the other process), then we
> + *      need either:
> + *
> + *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
> + *           won't happen upon the range (it also makes sure the pte_t we
> + *           read is the right and stable one), or,
> + *
> + *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
> + *           sure even if unshare happened the racy unmap() will wait until
> + *           i_mmap_rwsem is released.

Is that 100% correct?  IIUC, the page tables will be released via the
call to tlb_finish_mmu().  In most cases, the tlb_finish_mmu() call is
performed when holding i_mmap_rwsem.  However, in the final teardown of
a hugetlb vma via __unmap_hugepage_range_final, the tlb_finish_mmu call
is done outside the i_mmap_rwsem lock.  In this case, I think we are
still safe because nobody else should be walking the page table.

I really like the documentation.  However, if i_mmap_rwsem is not 100%
safe I would prefer not to document it here.  I don't think anyone
relies on this do they?
-- 
Mike Kravetz

> + *
> + * Option (2.1) is the safest, which guarantees pte stability from pmd
> + * sharing pov, until the vma lock released.  Option (2.2) doesn't protect
> + * a concurrent pmd unshare, but it makes sure the pgtable page is safe to
> + * access.
> + */
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  unsigned long hugetlb_mask_last_page(struct hstate *h);
> -- 
> 2.37.3
>

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by Peter Xu 3 years, 2 months ago

Hi, Mike,

On Tue, Nov 29, 2022 at 08:55:21PM -0800, Mike Kravetz wrote:
> > + *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
> > + *      pgtable page can go away from under us!  It can be done by a pmd
> > + *      unshare with a follow up munmap() on the other process), then we
> > + *      need either:
> > + *
> > + *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
> > + *           won't happen upon the range (it also makes sure the pte_t we
> > + *           read is the right and stable one), or,
> > + *
> > + *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
> > + *           sure even if unshare happened the racy unmap() will wait until
> > + *           i_mmap_rwsem is released.
> 
> Is that 100% correct?  IIUC, the page tables will be released via the
> call to tlb_finish_mmu().  In most cases, the tlb_finish_mmu() call is
> performed when holding i_mmap_rwsem.  However, in the final teardown of
> a hugetlb vma via __unmap_hugepage_range_final, the tlb_finish_mmu call
> is done outside the i_mmap_rwsem lock.  In this case, I think we are
> still safe because nobody else should be walking the page table.
> 
> I really like the documentation.  However, if i_mmap_rwsem is not 100%
> safe I would prefer not to document it here.  I don't think anyone
> relies on this do they?

I think i_mmap_rwsem is 100% safe.

It's not in tlb_finish_mmu(), but when freeing the pgtables we need to
unlink current vma from the vma list first:

	free_pgtables
            unlink_file_vma
                i_mmap_lock_write
	tlb_finish_mmu

So it's not the same logic as how the RCU lock worked, but it's actually
better (even though with higher overhead) because vma unlink happens before
free_pgd_range(), so the pgtable locks are not freed yet (unlike RCU).

Thanks,

-- 
Peter Xu

Re: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage

Posted by Mike Kravetz 3 years, 2 months ago

On 11/30/22 10:58, Peter Xu wrote:
> Hi, Mike,
> 
> On Tue, Nov 29, 2022 at 08:55:21PM -0800, Mike Kravetz wrote:
> > > + *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
> > > + *      pgtable page can go away from under us!  It can be done by a pmd
> > > + *      unshare with a follow up munmap() on the other process), then we
> > > + *      need either:
> > > + *
> > > + *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
> > > + *           won't happen upon the range (it also makes sure the pte_t we
> > > + *           read is the right and stable one), or,
> > > + *
> > > + *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
> > > + *           sure even if unshare happened the racy unmap() will wait until
> > > + *           i_mmap_rwsem is released.
> > 
> > Is that 100% correct?  IIUC, the page tables will be released via the
> > call to tlb_finish_mmu().  In most cases, the tlb_finish_mmu() call is
> > performed when holding i_mmap_rwsem.  However, in the final teardown of
> > a hugetlb vma via __unmap_hugepage_range_final, the tlb_finish_mmu call
> > is done outside the i_mmap_rwsem lock.  In this case, I think we are
> > still safe because nobody else should be walking the page table.
> > 
> > I really like the documentation.  However, if i_mmap_rwsem is not 100%
> > safe I would prefer not to document it here.  I don't think anyone
> > relies on this do they?
> 
> I think i_mmap_rwsem is 100% safe.
> 
> It's not in tlb_finish_mmu(), but when freeing the pgtables we need to
> unlink current vma from the vma list first:
> 
> 	free_pgtables
>             unlink_file_vma
>                 i_mmap_lock_write
> 	tlb_finish_mmu

Thanks!

Sorry, I was thinking about page freeing not page table freeing.

Agree that is 100% safe.
-- 
Mike Kravetz

[PATCH 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

Posted by Peter Xu 3 years, 2 months ago

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset().  That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without any lock to protect the pgtable from being freed after pmd unshare.

Here the simplest solution to make it safe is to move the swap handling to
be after the vma lock being held.  We may need to take the fault mutex on
either migration or hwpoison entries now (also the vma lock, but that's
really needed), however neither of them is hot path.

Note that the vma lock cannot be released in hugetlb_fault() when the
migration entry is detected, because in migration_entry_wait_huge() the
pgtable page will be used again (by taking the pgtable lock), so that also
need to be protected by the vma lock.  Modify migration_entry_wait_huge()
so that it must be called with vma read lock held, and properly release the
lock in __migration_entry_wait_huge().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h |  6 ++++--
 mm/hugetlb.c            | 32 +++++++++++++++-----------------
 mm/migrate.c            | 25 +++++++++++++++++++++----
 3 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 27ade4f22abb..09b22b169a71 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -335,7 +335,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
 #ifdef CONFIG_HUGETLB_PAGE
-extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
+extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
+					pte_t *ptep, spinlock_t *ptl);
 extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
 #endif	/* CONFIG_HUGETLB_PAGE */
 #else  /* CONFIG_MIGRATION */
@@ -364,7 +365,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
 #ifdef CONFIG_HUGETLB_PAGE
-static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
+static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
+					       pte_t *ptep, spinlock_t *ptl) { }
 static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
 #endif	/* CONFIG_HUGETLB_PAGE */
 static inline int is_writable_migration_entry(swp_entry_t entry)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfe677fadaf8..776e34ccf029 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
 
-	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
-	if (ptep) {
-		/*
-		 * Since we hold no locks, ptep could be stale.  That is
-		 * OK as we are only making decisions based on content and
-		 * not actually modifying content here.
-		 */
-		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait_huge(vma, ptep);
-			return 0;
-		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
-			return VM_FAULT_HWPOISON_LARGE |
-				VM_FAULT_SET_HINDEX(hstate_index(h));
-	}
-
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
@@ -5888,8 +5872,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
 	 * properly handle it.
 	 */
-	if (!pte_present(entry))
+	if (!pte_present(entry)) {
+		if (unlikely(is_hugetlb_entry_migration(entry))) {
+			/*
+			 * Release fault lock first because the vma lock is
+			 * needed to guard the huge_pte_lockptr() later in
+			 * migration_entry_wait_huge().  The vma lock will
+			 * be released there.
+			 */
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			migration_entry_wait_huge(vma, ptep);
+			return 0;
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+			ret = VM_FAULT_HWPOISON_LARGE |
+			    VM_FAULT_SET_HINDEX(hstate_index(h));
 		goto out_mutex;
+	}
 
 	/*
 	 * If we are going to COW/unshare the mapping later, we examine the
diff --git a/mm/migrate.c b/mm/migrate.c
index 267ad0d073ae..c13c828d34f3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -326,24 +326,41 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
+void __migration_entry_wait_huge(struct vm_area_struct *vma,
+				 pte_t *ptep, spinlock_t *ptl)
 {
 	pte_t pte;
 
+	/*
+	 * The vma read lock must be taken, which will be released before
+	 * the function returns.  It makes sure the pgtable page (along
+	 * with its spin lock) not be freed in parallel.
+	 */
+	hugetlb_vma_assert_locked(vma);
+
 	spin_lock(ptl);
 	pte = huge_ptep_get(ptep);
 
-	if (unlikely(!is_hugetlb_entry_migration(pte)))
+	if (unlikely(!is_hugetlb_entry_migration(pte))) {
 		spin_unlock(ptl);
-	else
+		hugetlb_vma_unlock_read(vma);
+	} else {
+		/*
+		 * If migration entry existed, safe to release vma lock
+		 * here because the pgtable page won't be freed without the
+		 * pgtable lock released.  See comment right above pgtable
+		 * lock release in migration_entry_wait_on_locked().
+		 */
+		hugetlb_vma_unlock_read(vma);
 		migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
+	}
 }
 
 void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
 {
 	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
 
-	__migration_entry_wait_huge(pte, ptl);
+	__migration_entry_wait_huge(vma, pte, ptl);
 }
 #endif
 
-- 
2.37.3

Re: [PATCH 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> In hugetlb_fault(), there used to have a special path to handle swap entry
> at the entrance using huge_pte_offset().  That's unsafe because
> huge_pte_offset() for a pmd sharable range can access freed pgtables if
> without any lock to protect the pgtable from being freed after pmd unshare.

Thanks.  That special path has been there for a long time.  I should have given
it more thought when adding lock.  However, I was only considering the 'stale'
pte case as opposed to freed.

> Here the simplest solution to make it safe is to move the swap handling to
> be after the vma lock being held.  We may need to take the fault mutex on
> either migration or hwpoison entries now (also the vma lock, but that's
> really needed), however neither of them is hot path.
> 
> Note that the vma lock cannot be released in hugetlb_fault() when the
> migration entry is detected, because in migration_entry_wait_huge() the
> pgtable page will be used again (by taking the pgtable lock), so that also
> need to be protected by the vma lock.  Modify migration_entry_wait_huge()
> so that it must be called with vma read lock held, and properly release the
> lock in __migration_entry_wait_huge().
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h |  6 ++++--
>  mm/hugetlb.c            | 32 +++++++++++++++-----------------
>  mm/migrate.c            | 25 +++++++++++++++++++++----
>  3 files changed, 40 insertions(+), 23 deletions(-)

Looks good with one small change noted below,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

> 
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 27ade4f22abb..09b22b169a71 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -335,7 +335,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  					unsigned long address);
>  #ifdef CONFIG_HUGETLB_PAGE
> -extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
> +extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
> +					pte_t *ptep, spinlock_t *ptl);
>  extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  #else  /* CONFIG_MIGRATION */
> @@ -364,7 +365,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  					 unsigned long address) { }
>  #ifdef CONFIG_HUGETLB_PAGE
> -static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
> +static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
> +					       pte_t *ptep, spinlock_t *ptl) { }
>  static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  static inline int is_writable_migration_entry(swp_entry_t entry)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dfe677fadaf8..776e34ccf029 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int need_wait_lock = 0;
>  	unsigned long haddr = address & huge_page_mask(h);
>  
> -	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> -	if (ptep) {
> -		/*
> -		 * Since we hold no locks, ptep could be stale.  That is
> -		 * OK as we are only making decisions based on content and
> -		 * not actually modifying content here.
> -		 */
> -		entry = huge_ptep_get(ptep);
> -		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait_huge(vma, ptep);
> -			return 0;
> -		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> -			return VM_FAULT_HWPOISON_LARGE |
> -				VM_FAULT_SET_HINDEX(hstate_index(h));
> -	}
> -

Before acquiring the vma_lock, there is this comment:

	/*
	 * Acquire vma lock before calling huge_pte_alloc and hold
	 * until finished with ptep.  This prevents huge_pmd_unshare from
	 * being called elsewhere and making the ptep no longer valid.
	 *
	 * ptep could have already be assigned via hugetlb_walk().  That
	 * is OK, as huge_pte_alloc will return the same value unless
	 * something has changed.
	 */

The second sentence in that comment about ptep being already assigned no
longer applies and can be deleted.

-- 
Mike Kravetz

Re: [PATCH 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

Posted by Peter Xu 3 years, 2 months ago

On Mon, Dec 05, 2022 at 02:14:38PM -0800, Mike Kravetz wrote:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index dfe677fadaf8..776e34ccf029 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	int need_wait_lock = 0;
> >  	unsigned long haddr = address & huge_page_mask(h);
> >  
> > -	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> > -	if (ptep) {
> > -		/*
> > -		 * Since we hold no locks, ptep could be stale.  That is
> > -		 * OK as we are only making decisions based on content and
> > -		 * not actually modifying content here.
> > -		 */
> > -		entry = huge_ptep_get(ptep);
> > -		if (unlikely(is_hugetlb_entry_migration(entry))) {
> > -			migration_entry_wait_huge(vma, ptep);
> > -			return 0;
> > -		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> > -			return VM_FAULT_HWPOISON_LARGE |
> > -				VM_FAULT_SET_HINDEX(hstate_index(h));
> > -	}
> > -
> 
> Before acquiring the vma_lock, there is this comment:
> 
> 	/*
> 	 * Acquire vma lock before calling huge_pte_alloc and hold
> 	 * until finished with ptep.  This prevents huge_pmd_unshare from
> 	 * being called elsewhere and making the ptep no longer valid.
> 	 *
> 	 * ptep could have already be assigned via hugetlb_walk().  That
> 	 * is OK, as huge_pte_alloc will return the same value unless
> 	 * something has changed.
> 	 */
> 
> The second sentence in that comment about ptep being already assigned no
> longer applies and can be deleted.

Correct, this can be removed.

One thing to mention is there's an inline touch-up above in the last patch
of the series when introducing hugetlb_walk() to s/pte_offset/walk/, but I
saw that Andrew has already done the right thing on the fixup one in his
local tree, so I think we're all good.

Thanks!

-- 
Peter Xu

[PATCH 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

We can take the hugetlb walker lock, here taking vma lock directly.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..a602f008dde5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
  */
 vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 {
-	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 */
 	mmap_assert_locked(mm);
 
-	ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
+	ctx = vma->vm_userfaultfd_ctx.ctx;
 	if (!ctx)
 		goto out;
 
@@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
 
+	/*
+	 * This stablizes pgtable for hugetlb on e.g. pmd unsharing.  Need
+	 * to be before setting current state.
+	 */
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_lock_read(vma);
+
 	spin_lock_irq(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -507,13 +515,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	set_current_state(blocking_state);
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
-	if (!is_vm_hugetlb_page(vmf->vma))
+	if (!is_vm_hugetlb_page(vma))
 		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
 						  reason);
 	else
-		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
+		must_wait = userfaultfd_huge_must_wait(ctx, vma,
 						       vmf->address,
 						       vmf->flags, reason);
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_unlock_read(vma);
 	mmap_read_unlock(mm);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
-- 
2.37.3

Re: [PATCH 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> We can take the hugetlb walker lock, here taking vma lock directly.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  fs/userfaultfd.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)

Looks good.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

Re: [PATCH 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> We can take the hugetlb walker lock, here taking vma lock directly.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb

[PATCH 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 776e34ccf029..d6bb1d22f1c4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6232,9 +6232,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	if (WARN_ON_ONCE(flags & FOLL_PIN))
 		return NULL;
 
+	hugetlb_vma_lock_read(vma);
 	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (!pte)
-		return NULL;
+		goto out_unlock;
 
 	ptl = huge_pte_lock(h, mm, pte);
 	entry = huge_ptep_get(pte);
@@ -6257,6 +6258,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	}
 out:
 	spin_unlock(ptl);
+out_unlock:
+	hugetlb_vma_unlock_read(vma);
 	return page;
 }
 
-- 
2.37.3

Re: [PATCH 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)

Thanks!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

Re: [PATCH 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb

[PATCH 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d6bb1d22f1c4..df645a5824e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6290,6 +6290,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			break;
 		}
 
+		hugetlb_vma_lock_read(vma);
 		/*
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
@@ -6314,6 +6315,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
 			if (pte)
 				spin_unlock(ptl);
+			hugetlb_vma_unlock_read(vma);
 			remainder = 0;
 			break;
 		}
@@ -6335,6 +6337,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 			if (pte)
 				spin_unlock(ptl);
+			hugetlb_vma_unlock_read(vma);
+
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
 			else if (unshare)
@@ -6394,6 +6398,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			remainder -= pages_per_huge_page(h);
 			i += pages_per_huge_page(h);
 			spin_unlock(ptl);
+			hugetlb_vma_unlock_read(vma);
 			continue;
 		}
 
@@ -6421,6 +6426,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
 							 flags))) {
 				spin_unlock(ptl);
+				hugetlb_vma_unlock_read(vma);
 				remainder = 0;
 				err = -ENOMEM;
 				break;
@@ -6432,6 +6438,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		i += refs;
 
 		spin_unlock(ptl);
+		hugetlb_vma_unlock_read(vma);
 	}
 	*nr_pages = remainder;
 	/*
-- 
2.37.3

Re: [PATCH 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 7 +++++++
>  1 file changed, 7 insertions(+)

Thanks!  The flow of that routine is difficult to follow.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

Re: [PATCH 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb

[PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/pagewalk.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 7f1c9b274906..d98564a7be57 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 
+	hugetlb_vma_lock_read(vma);
 	do {
 		next = hugetlb_entry_end(h, addr, end);
 		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
@@ -314,6 +315,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 		if (err)
 			break;
 	} while (addr = next, addr != end);
+	hugetlb_vma_unlock_read(vma);
 
 	return err;
 }
-- 
2.37.3

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/pagewalk.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 7f1c9b274906..d98564a7be57 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
>  	const struct mm_walk_ops *ops = walk->ops;
>  	int err = 0;
>  
> +	hugetlb_vma_lock_read(vma);
>  	do {
>  		next = hugetlb_entry_end(h, addr, end);
>  		pte = huge_pte_offset(walk->mm, addr & hmask, sz);

For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with
the vma_lock held.  I looked into the various hugetlb_entry routines, and
I am not sure about hmm_vma_walk_hugetlb_entry.  It seems like it could
possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault.  If this
can happen, then we may have an issue as hugetlb_fault will also need to
acquire the vma_lock in read mode.

I do not know the hmm code well enough to know if this may be an actual
issue?
-- 
Mike Kravetz

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by John Hubbard 3 years, 2 months ago

On 12/5/22 15:33, Mike Kravetz wrote:
> On 11/29/22 14:35, Peter Xu wrote:
>> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
>> to make sure the pgtable page will not be freed concurrently.
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>>   mm/pagewalk.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index 7f1c9b274906..d98564a7be57 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
>>   	const struct mm_walk_ops *ops = walk->ops;
>>   	int err = 0;
>>   
>> +	hugetlb_vma_lock_read(vma);
>>   	do {
>>   		next = hugetlb_entry_end(h, addr, end);
>>   		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
> 
> For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with
> the vma_lock held.  I looked into the various hugetlb_entry routines, and
> I am not sure about hmm_vma_walk_hugetlb_entry.  It seems like it could
> possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault.  If this
> can happen, then we may have an issue as hugetlb_fault will also need to
> acquire the vma_lock in read mode.
> 
> I do not know the hmm code well enough to know if this may be an actual
> issue?

Oh, this sounds like a serious concern. If we add a new lock, and hold it
during callbacks that also need to take it, that's not going to work out,
right?

And yes, hmm_range_fault() and related things do a good job of revealing
this kind of deadlock. :)

thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Mon, Dec 05, 2022 at 03:52:51PM -0800, John Hubbard wrote:
> On 12/5/22 15:33, Mike Kravetz wrote:
> > On 11/29/22 14:35, Peter Xu wrote:
> > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> > > to make sure the pgtable page will not be freed concurrently.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >   mm/pagewalk.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > index 7f1c9b274906..d98564a7be57 100644
> > > --- a/mm/pagewalk.c
> > > +++ b/mm/pagewalk.c
> > > @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
> > >   	const struct mm_walk_ops *ops = walk->ops;
> > >   	int err = 0;
> > > +	hugetlb_vma_lock_read(vma);
> > >   	do {
> > >   		next = hugetlb_entry_end(h, addr, end);
> > >   		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
> > 
> > For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with
> > the vma_lock held.  I looked into the various hugetlb_entry routines, and
> > I am not sure about hmm_vma_walk_hugetlb_entry.  It seems like it could
> > possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault.  If this
> > can happen, then we may have an issue as hugetlb_fault will also need to
> > acquire the vma_lock in read mode.

Thanks for spotting that, Mike.

I used to notice that path special but that's when I was still using RCU
locks who doesn't have the issue.  Then I overlooked this one when
switchover.

> > 
> > I do not know the hmm code well enough to know if this may be an actual
> > issue?
> 
> Oh, this sounds like a serious concern. If we add a new lock, and hold it
> during callbacks that also need to take it, that's not going to work out,
> right?
> 
> And yes, hmm_range_fault() and related things do a good job of revealing
> this kind of deadlock. :)

I've got a fixup attached.  John, since this got your attention please also
have a look too in case there's further issues.

Thanks,

-- 
Peter Xu
From 9ad1e65a31f51a0dc687cd9d6083b9e920d2da61 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 6 Dec 2022 11:38:47 -0500
Subject: [PATCH] fixup! mm/hugetlb: Make walk_hugetlb_range() safe to pmd
 unshare
Content-type: text/plain

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/mm/gmap.c      | 2 ++
 fs/proc/task_mmu.c       | 2 ++
 include/linux/pagewalk.h | 8 +++++++-
 mm/hmm.c                 | 8 +++++++-
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 8947451ae021..292a54c490d4 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
 	end = start + HPAGE_SIZE - 1;
 	__storage_key_init_range(start, end);
 	set_bit(PG_arch_1, &page->flags);
+	hugetlb_vma_unlock_read(walk->vma);
 	cond_resched();
+	hugetlb_vma_lock_read(walk->vma);
 	return 0;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 89338950afd3..d7155f3bb678 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1612,7 +1612,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			frame++;
 	}
 
+	hugetlb_vma_unlock_read(walk->vma);
 	cond_resched();
+	hugetlb_vma_lock_read(walk->vma);
 
 	return err;
 }
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 959f52e5867d..1f7c2011f6cb 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -21,7 +21,13 @@ struct mm_walk;
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
  *			are skipped.
- * @hugetlb_entry:	if set, called for each hugetlb entry
+ * @hugetlb_entry:	if set, called for each hugetlb entry.	Note that
+ *			currently the hook function is protected by hugetlb
+ *			vma lock to make sure pte_t* and the spinlock is valid
+ *			to access.  If the hook function needs to yield the
+ *			thread or retake the vma lock for some reason, it
+ *			needs to properly release the vma lock manually,
+ *			and retake it before the function returns.
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
  *			"do page table walk over the current vma", returning
diff --git a/mm/hmm.c b/mm/hmm.c
index 3850fb625dda..dcd624f28bcf 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -493,8 +493,14 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	required_fault =
 		hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
 	if (required_fault) {
+		int ret;
+
 		spin_unlock(ptl);
-		return hmm_vma_fault(addr, end, required_fault, walk);
+		hugetlb_vma_unlock_read(vma);
+		/* hmm_vma_fault() can retake the vma lock */
+		ret = hmm_vma_fault(addr, end, required_fault, walk);
+		hugetlb_vma_lock_read(vma);
+		return ret;
 	}
 
 	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
-- 
2.37.3

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by John Hubbard 3 years, 2 months ago

On 12/6/22 08:45, Peter Xu wrote:
> I've got a fixup attached.  John, since this got your attention please also
> have a look too in case there's further issues.
> 

Well, one question: Normally, the pattern of "release_lock(A); call f();
acquire_lock(A);" is tricky, because one must revalidate that the state
protected by A has not changed while the lock was released. However, in
this case, it's letting page fault handling proceed, which already
assumes that pages might be gone, so generally that seems OK.

However, I'm lagging behind on understanding what the vma lock actually
protects. It seems to be a hugetlb-specific protection for concurrent
freeing of the page tables? If so, then running a page fault handler
seems safe. If there's something else it protects, then we might need to
revalidate that after re-acquiring the vma lock.

Also, scattering hugetlb-specific locks throughout mm seems like an
unfortuate thing, I wonder if there is a longer term plan to Not Do
That?

thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote:
> On 12/6/22 08:45, Peter Xu wrote:
> > I've got a fixup attached.  John, since this got your attention please also
> > have a look too in case there's further issues.
> > 
> 
> Well, one question: Normally, the pattern of "release_lock(A); call f();
> acquire_lock(A);" is tricky, because one must revalidate that the state
> protected by A has not changed while the lock was released. However, in
> this case, it's letting page fault handling proceed, which already
> assumes that pages might be gone, so generally that seems OK.

Yes it's tricky, but not as tricky in this case.

I hope my documentation supplemented that (in the fixup patch):

+ * @hugetlb_entry:     if set, called for each hugetlb entry.  Note that
+ *                     currently the hook function is protected by hugetlb
+ *                     vma lock to make sure pte_t* and the spinlock is valid
+ *                     to access.  If the hook function needs to yield the
+ *                     thread or retake the vma lock for some reason, it
+ *                     needs to properly release the vma lock manually,
+ *                     and retake it before the function returns.

The vma lock here makes sure the pte_t and the pgtable spinlock being
stable.  Without the lock, they're prone to be freed in parallel.

> 
> However, I'm lagging behind on understanding what the vma lock actually
> protects. It seems to be a hugetlb-specific protection for concurrent
> freeing of the page tables?

Not exactly freeing, but unsharing.  Mike probably has more to say.  The
series is here:

https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t

> If so, then running a page fault handler seems safe. If there's something
> else it protects, then we might need to revalidate that after
> re-acquiring the vma lock.

Nothing to validate here.  The only reason to take the vma lock is to match
with the caller who assumes the lock taken, so either it'll be released
very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset).

> 
> Also, scattering hugetlb-specific locks throughout mm seems like an
> unfortuate thing, I wonder if there is a longer term plan to Not Do
> That?

So far HMM is really the only one - normally hugetlb_entry() hook is pretty
light, so not really throughout the whole mm yet.  It's even not urgently
needed for the other two places calling cond_sched(), I added it mostly
just for completeness, and with the slight hope that maybe we can yield
earlier for some pmd unsharers.

But yes it's unfortunate, I just didn't come up with a good solution.
Suggestion is always welcomed.

Thanks,

-- 
Peter Xu

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by John Hubbard 3 years, 2 months ago

On 12/6/22 13:51, Peter Xu wrote:
> On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote:
>> On 12/6/22 08:45, Peter Xu wrote:
>>> I've got a fixup attached.  John, since this got your attention please also
>>> have a look too in case there's further issues.
>>>
>>
>> Well, one question: Normally, the pattern of "release_lock(A); call f();
>> acquire_lock(A);" is tricky, because one must revalidate that the state
>> protected by A has not changed while the lock was released. However, in
>> this case, it's letting page fault handling proceed, which already
>> assumes that pages might be gone, so generally that seems OK.
> 
> Yes it's tricky, but not as tricky in this case.
> 
> I hope my documentation supplemented that (in the fixup patch):
> 
> + * @hugetlb_entry:     if set, called for each hugetlb entry.  Note that
> + *                     currently the hook function is protected by hugetlb
> + *                     vma lock to make sure pte_t* and the spinlock is valid
> + *                     to access.  If the hook function needs to yield the

So far so good...

> + *                     thread or retake the vma lock for some reason, it
> + *                     needs to properly release the vma lock manually,
> + *                     and retake it before the function returns.

...but you can actually delete this second sentence. It does not add
any real information--clearly, if you must drop the lock, then you must
"manually" drop the lock.

And it still ignores my original question, which I don't think I've
fully communicated. Basically, what can happen to the protected data
during the time when the lock is not held?

> 
> The vma lock here makes sure the pte_t and the pgtable spinlock being
> stable.  Without the lock, they're prone to be freed in parallel.
> 

Yes, but think about this: if the vma lock protects against the pte
going away, then:

lock()
    get a pte
unlock()

...let hmm_vma_fault() cond_resched() run...

lock()
    ...whoops, something else release the pte that I'd previously
    retrieved.

>>
>> However, I'm lagging behind on understanding what the vma lock actually
>> protects. It seems to be a hugetlb-specific protection for concurrent
>> freeing of the page tables?
> 
> Not exactly freeing, but unsharing.  Mike probably has more to say.  The
> series is here:
> 
> https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t
> 
>> If so, then running a page fault handler seems safe. If there's something
>> else it protects, then we might need to revalidate that after
>> re-acquiring the vma lock.
> 
> Nothing to validate here.  The only reason to take the vma lock is to match
> with the caller who assumes the lock taken, so either it'll be released
> very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset).
> 

ummm, see above. :)

>>
>> Also, scattering hugetlb-specific locks throughout mm seems like an
>> unfortuate thing, I wonder if there is a longer term plan to Not Do
>> That?
> 
> So far HMM is really the only one - normally hugetlb_entry() hook is pretty
> light, so not really throughout the whole mm yet.  It's even not urgently
> needed for the other two places calling cond_sched(), I added it mostly
> just for completeness, and with the slight hope that maybe we can yield
> earlier for some pmd unsharers.
> 
> But yes it's unfortunate, I just didn't come up with a good solution.
> Suggestion is always welcomed.
> 

I guess it's on me to think of something cleaner, so if I do I'll pipe
up. :)

thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Tue, Dec 06, 2022 at 02:31:30PM -0800, John Hubbard wrote:
> On 12/6/22 13:51, Peter Xu wrote:
> > On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote:
> > > On 12/6/22 08:45, Peter Xu wrote:
> > > > I've got a fixup attached.  John, since this got your attention please also
> > > > have a look too in case there's further issues.
> > > > 
> > > 
> > > Well, one question: Normally, the pattern of "release_lock(A); call f();
> > > acquire_lock(A);" is tricky, because one must revalidate that the state
> > > protected by A has not changed while the lock was released. However, in
> > > this case, it's letting page fault handling proceed, which already
> > > assumes that pages might be gone, so generally that seems OK.
> > 
> > Yes it's tricky, but not as tricky in this case.
> > 
> > I hope my documentation supplemented that (in the fixup patch):
> > 
> > + * @hugetlb_entry:     if set, called for each hugetlb entry.  Note that
> > + *                     currently the hook function is protected by hugetlb
> > + *                     vma lock to make sure pte_t* and the spinlock is valid
> > + *                     to access.  If the hook function needs to yield the

[1]

> 
> So far so good...
> 
> > + *                     thread or retake the vma lock for some reason, it
> > + *                     needs to properly release the vma lock manually,
> > + *                     and retake it before the function returns.
> 
> ...but you can actually delete this second sentence. It does not add
> any real information--clearly, if you must drop the lock, then you must
> "manually" drop the lock.
> 
> And it still ignores my original question, which I don't think I've
> fully communicated. Basically, what can happen to the protected data
> during the time when the lock is not held?

I thought I answered this one at [1] above.  If not, I can extend the
answer.

What can happen is some thread can firstly unshare the pmd pgtable page
(e.g. by clearing the PUD entry in current mm), then release the pmd
pgtable page (e.g. by unmapping it) even if current thread is still
accessing it.  It will cause use-after-free on the pmd pgtable page on this
thread in various ways.

One way to trigger this is when the current thread tries to take the
pgtable lock and it'll trigger warning like the call stack referenced in
the cover letter of this series:

https://lore.kernel.org/r/20221129193526.3588187-1-peterx@redhat.com

Please also feel free to read the reproducer attached in the cover letter,
it has details on how this can trigger (even though it's so hard to trigger
so I added a delay in the kernel to make it trigger).  The idea should be
the same.

> 
> > 
> > The vma lock here makes sure the pte_t and the pgtable spinlock being
> > stable.  Without the lock, they're prone to be freed in parallel.
> > 
> 
> Yes, but think about this: if the vma lock protects against the pte
> going away, then:
> 
> lock()
>    get a pte
> unlock()
> 
> ...let hmm_vma_fault() cond_resched() run...
> 
> lock()
>    ...whoops, something else release the pte that I'd previously
>    retrieved.

Here the pte_t* is never referenced again after hugetlb_entry() returned.

The loop looks like:

	do {
		next = hugetlb_entry_end(h, addr, end);
		pte = hugetlb_walk(vma, addr & hmask, sz);
		if (pte)
			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
		else if (ops->pte_hole)
			err = ops->pte_hole(addr, next, -1, walk);
		if (err)
			break;
	} while (addr = next, addr != end);

After hugetlb_entry() returned, we'll _never_ touch that pte again we got
from either huge_pte_offset() or hugetlb_walk() after this patchset
applied.

If we touch it, it's a potential bug as you mentioned.  But we didn't.

Hope it explains.

> 
> > > 
> > > However, I'm lagging behind on understanding what the vma lock actually
> > > protects. It seems to be a hugetlb-specific protection for concurrent
> > > freeing of the page tables?
> > 
> > Not exactly freeing, but unsharing.  Mike probably has more to say.  The
> > series is here:
> > 
> > https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t
> > 
> > > If so, then running a page fault handler seems safe. If there's something
> > > else it protects, then we might need to revalidate that after
> > > re-acquiring the vma lock.
> > 
> > Nothing to validate here.  The only reason to take the vma lock is to match
> > with the caller who assumes the lock taken, so either it'll be released
> > very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset).
> > 
> 
> ummm, see above. :)
> 
> > > 
> > > Also, scattering hugetlb-specific locks throughout mm seems like an
> > > unfortuate thing, I wonder if there is a longer term plan to Not Do
> > > That?
> > 
> > So far HMM is really the only one - normally hugetlb_entry() hook is pretty
> > light, so not really throughout the whole mm yet.  It's even not urgently
> > needed for the other two places calling cond_sched(), I added it mostly
> > just for completeness, and with the slight hope that maybe we can yield
> > earlier for some pmd unsharers.
> > 
> > But yes it's unfortunate, I just didn't come up with a good solution.
> > Suggestion is always welcomed.
> > 
> 
> I guess it's on me to think of something cleaner, so if I do I'll pipe
> up. :)

That'll be very much appricated.

It's really that I don't know how to make this better, or I can rework the
series as long as it hasn't land upstream.

Thanks,

-- 
Peter Xu

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by John Hubbard 3 years, 2 months ago

On 12/6/22 16:07, Peter Xu wrote:
> I thought I answered this one at [1] above.  If not, I can extend the
> answer.

[1] explains it, but it doesn't mention why it's safe to drop and reacquire.

...
> 
> If we touch it, it's a potential bug as you mentioned.  But we didn't.
> 
> Hope it explains.

I think it's OK after all, because hmm_vma_fault() does revalidate after
it takes the vma lock, so that closes the loop that I was fretting over.

I was just also worried that I'd missed some other place, but it looks
like that's not the case.

So, good.

How about this incremental diff on top, as an attempt to clarify what's
going on? Or is this too much wordage? Sometimes I write too many words:


diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 1f7c2011f6cb..27a6df448ee5 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -21,13 +21,16 @@ struct mm_walk;
   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
   *			are skipped.
- * @hugetlb_entry:	if set, called for each hugetlb entry.	Note that
- *			currently the hook function is protected by hugetlb
- *			vma lock to make sure pte_t* and the spinlock is valid
- *			to access.  If the hook function needs to yield the
- *			thread or retake the vma lock for some reason, it
- *			needs to properly release the vma lock manually,
- *			and retake it before the function returns.
+ * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
+ *			function is called with the vma lock held, in order to
+ *			protect against a concurrent freeing of the pte_t* or
+ *			the ptl. In some cases, the hook function needs to drop
+ *			and retake the vma lock in order to avoid deadlocks
+ *			while calling other functions. In such cases the hook
+ *			function must either refrain from accessing the pte or
+ *			ptl after dropping the vma lock, or else revalidate
+ *			those items after re-acquiring the vma lock and before
+ *			accessing them.
   * @test_walk:		caller specific callback function to determine whether
   *			we walk over the current vma or not. Returning 0 means
   *			"do page table walk over the current vma", returning
diff --git a/mm/hmm.c b/mm/hmm.c
index dcd624f28bcf..b428f2011cfd 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -497,7 +497,13 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
  
  		spin_unlock(ptl);
  		hugetlb_vma_unlock_read(vma);
-		/* hmm_vma_fault() can retake the vma lock */
+		/*
+		 * Avoid deadlock: drop the vma lock before calling
+		 * hmm_vma_fault(), which will itself potentially take and drop
+		 * the vma lock. This is also correct from a protection point of
+		 * view, because there is no further use here of either pte or
+		 * ptl after dropping the vma lock.
+		 */
  		ret = hmm_vma_fault(addr, end, required_fault, walk);
  		hugetlb_vma_lock_read(vma);
  		return ret;

>> I guess it's on me to think of something cleaner, so if I do I'll pipe
>> up. :)
> 
> That'll be very much appricated.
> 
> It's really that I don't know how to make this better, or I can rework the
> series as long as it hasn't land upstream.
> 

It's always 10x easier to notice an imperfection, than it is to improve on
it. :)

thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Tue, Dec 06, 2022 at 06:38:54PM -0800, John Hubbard wrote:
> On 12/6/22 16:07, Peter Xu wrote:
> > I thought I answered this one at [1] above.  If not, I can extend the
> > answer.
> 
> [1] explains it, but it doesn't mention why it's safe to drop and reacquire.
> 
> ...
> > 
> > If we touch it, it's a potential bug as you mentioned.  But we didn't.
> > 
> > Hope it explains.
> 
> I think it's OK after all, because hmm_vma_fault() does revalidate after
> it takes the vma lock, so that closes the loop that I was fretting over.
> 
> I was just also worried that I'd missed some other place, but it looks
> like that's not the case.
> 
> So, good.
> 
> How about this incremental diff on top, as an attempt to clarify what's
> going on? Or is this too much wordage? Sometimes I write too many words:

Nop, that all looks good, thanks.  I'll apply them in my new post.

> 
> 
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 1f7c2011f6cb..27a6df448ee5 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -21,13 +21,16 @@ struct mm_walk;
>   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
>   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
>   *			are skipped.
> - * @hugetlb_entry:	if set, called for each hugetlb entry.	Note that
> - *			currently the hook function is protected by hugetlb
> - *			vma lock to make sure pte_t* and the spinlock is valid
> - *			to access.  If the hook function needs to yield the
> - *			thread or retake the vma lock for some reason, it
> - *			needs to properly release the vma lock manually,
> - *			and retake it before the function returns.
> + * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
> + *			function is called with the vma lock held, in order to
> + *			protect against a concurrent freeing of the pte_t* or
> + *			the ptl. In some cases, the hook function needs to drop
> + *			and retake the vma lock in order to avoid deadlocks
> + *			while calling other functions. In such cases the hook
> + *			function must either refrain from accessing the pte or
> + *			ptl after dropping the vma lock, or else revalidate
> + *			those items after re-acquiring the vma lock and before
> + *			accessing them.
>   * @test_walk:		caller specific callback function to determine whether
>   *			we walk over the current vma or not. Returning 0 means
>   *			"do page table walk over the current vma", returning
> diff --git a/mm/hmm.c b/mm/hmm.c
> index dcd624f28bcf..b428f2011cfd 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -497,7 +497,13 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>  		spin_unlock(ptl);
>  		hugetlb_vma_unlock_read(vma);
> -		/* hmm_vma_fault() can retake the vma lock */
> +		/*
> +		 * Avoid deadlock: drop the vma lock before calling
> +		 * hmm_vma_fault(), which will itself potentially take and drop
> +		 * the vma lock. This is also correct from a protection point of
> +		 * view, because there is no further use here of either pte or
> +		 * ptl after dropping the vma lock.
> +		 */
>  		ret = hmm_vma_fault(addr, end, required_fault, walk);
>  		hugetlb_vma_lock_read(vma);
>  		return ret;
> 
> > > I guess it's on me to think of something cleaner, so if I do I'll pipe
> > > up. :)
> > 
> > That'll be very much appricated.
> > 
> > It's really that I don't know how to make this better, or I can rework the
> > series as long as it hasn't land upstream.
> > 
> 
> It's always 10x easier to notice an imperfection, than it is to improve on
> it. :)

-- 
Peter Xu

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 12/06/22 11:45, Peter Xu wrote:
> On Mon, Dec 05, 2022 at 03:52:51PM -0800, John Hubbard wrote:
> > On 12/5/22 15:33, Mike Kravetz wrote:
> > > On 11/29/22 14:35, Peter Xu wrote:
> > > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> > > > to make sure the pgtable page will not be freed concurrently.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >   mm/pagewalk.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > > index 7f1c9b274906..d98564a7be57 100644
> > > > --- a/mm/pagewalk.c
> > > > +++ b/mm/pagewalk.c
> > > > @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
> > > >   	const struct mm_walk_ops *ops = walk->ops;
> > > >   	int err = 0;
> > > > +	hugetlb_vma_lock_read(vma);
> > > >   	do {
> > > >   		next = hugetlb_entry_end(h, addr, end);
> > > >   		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
> > > 
> > > For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with
> > > the vma_lock held.  I looked into the various hugetlb_entry routines, and
> > > I am not sure about hmm_vma_walk_hugetlb_entry.  It seems like it could
> > > possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault.  If this
> > > can happen, then we may have an issue as hugetlb_fault will also need to
> > > acquire the vma_lock in read mode.
> 
> Thanks for spotting that, Mike.
> 
> I used to notice that path special but that's when I was still using RCU
> locks who doesn't have the issue.  Then I overlooked this one when
> switchover.
> 
> > > 
> > > I do not know the hmm code well enough to know if this may be an actual
> > > issue?
> > 
> > Oh, this sounds like a serious concern. If we add a new lock, and hold it
> > during callbacks that also need to take it, that's not going to work out,
> > right?
> > 
> > And yes, hmm_range_fault() and related things do a good job of revealing
> > this kind of deadlock. :)
> 
> I've got a fixup attached.  John, since this got your attention please also
> have a look too in case there's further issues.
> 
> Thanks,
> 
> -- 
> Peter Xu

Thanks Peter.  I am good with the fixup.  When combined with original,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz


> From 9ad1e65a31f51a0dc687cd9d6083b9e920d2da61 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 6 Dec 2022 11:38:47 -0500
> Subject: [PATCH] fixup! mm/hugetlb: Make walk_hugetlb_range() safe to pmd
>  unshare
> Content-type: text/plain
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/s390/mm/gmap.c      | 2 ++
>  fs/proc/task_mmu.c       | 2 ++
>  include/linux/pagewalk.h | 8 +++++++-
>  mm/hmm.c                 | 8 +++++++-
>  4 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 8947451ae021..292a54c490d4 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
>  	end = start + HPAGE_SIZE - 1;
>  	__storage_key_init_range(start, end);
>  	set_bit(PG_arch_1, &page->flags);
> +	hugetlb_vma_unlock_read(walk->vma);
>  	cond_resched();
> +	hugetlb_vma_lock_read(walk->vma);
>  	return 0;
>  }
>  
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 89338950afd3..d7155f3bb678 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1612,7 +1612,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
>  			frame++;
>  	}
>  
> +	hugetlb_vma_unlock_read(walk->vma);
>  	cond_resched();
> +	hugetlb_vma_lock_read(walk->vma);
>  
>  	return err;
>  }
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 959f52e5867d..1f7c2011f6cb 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -21,7 +21,13 @@ struct mm_walk;
>   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
>   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
>   *			are skipped.
> - * @hugetlb_entry:	if set, called for each hugetlb entry
> + * @hugetlb_entry:	if set, called for each hugetlb entry.	Note that
> + *			currently the hook function is protected by hugetlb
> + *			vma lock to make sure pte_t* and the spinlock is valid
> + *			to access.  If the hook function needs to yield the
> + *			thread or retake the vma lock for some reason, it
> + *			needs to properly release the vma lock manually,
> + *			and retake it before the function returns.
>   * @test_walk:		caller specific callback function to determine whether
>   *			we walk over the current vma or not. Returning 0 means
>   *			"do page table walk over the current vma", returning
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 3850fb625dda..dcd624f28bcf 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -493,8 +493,14 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>  	required_fault =
>  		hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
>  	if (required_fault) {
> +		int ret;
> +
>  		spin_unlock(ptl);
> -		return hmm_vma_fault(addr, end, required_fault, walk);
> +		hugetlb_vma_unlock_read(vma);
> +		/* hmm_vma_fault() can retake the vma lock */
> +		ret = hmm_vma_fault(addr, end, required_fault, walk);
> +		hugetlb_vma_lock_read(vma);
> +		return ret;
>  	}
>  
>  	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
> -- 
> 2.37.3
>

Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb

[PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/rmap.h | 4 ++++
 mm/page_vma_mapped.c | 5 ++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..a50d18bb86aa 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -13,6 +13,7 @@
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/memremap.h>
+#include <linux/hugetlb.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 		pte_unmap(pvmw->pte);
 	if (pvmw->ptl)
 		spin_unlock(pvmw->ptl);
+	/* This needs to be after unlock of the spinlock */
+	if (is_vm_hugetlb_page(pvmw->vma))
+		hugetlb_vma_unlock_read(pvmw->vma);
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..f94ec78b54ff 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (pvmw->pte)
 			return not_found(pvmw);
 
+		hugetlb_vma_lock_read(vma);
 		/* when pud is not present, pte will be NULL */
 		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
-		if (!pvmw->pte)
+		if (!pvmw->pte) {
+			hugetlb_vma_unlock_read(vma);
 			return false;
+		}
 
 		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
 		if (!check_pte(pvmw))
-- 
2.37.3

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 11/29/22 14:35, Peter Xu wrote:
> Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/rmap.h | 4 ++++
>  mm/page_vma_mapped.c | 5 ++++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bd3504d11b15..a50d18bb86aa 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -13,6 +13,7 @@
>  #include <linux/highmem.h>
>  #include <linux/pagemap.h>
>  #include <linux/memremap.h>
> +#include <linux/hugetlb.h>
>  
>  /*
>   * The anon_vma heads a list of private "related" vmas, to scan if
> @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>  		pte_unmap(pvmw->pte);
>  	if (pvmw->ptl)
>  		spin_unlock(pvmw->ptl);
> +	/* This needs to be after unlock of the spinlock */
> +	if (is_vm_hugetlb_page(pvmw->vma))
> +		hugetlb_vma_unlock_read(pvmw->vma);
>  }
>  
>  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 93e13fc17d3c..f94ec78b54ff 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  		if (pvmw->pte)
>  			return not_found(pvmw);
>  
> +		hugetlb_vma_lock_read(vma);
>  		/* when pud is not present, pte will be NULL */
>  		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> -		if (!pvmw->pte)
> +		if (!pvmw->pte) {
> +			hugetlb_vma_unlock_read(vma);
>  			return false;
> +		}
>  
>  		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
>  		if (!check_pte(pvmw))

I think this is going to cause try_to_unmap() to always fail for hugetlb
shared pages.  See try_to_unmap_one:

	while (page_vma_mapped_walk(&pvmw)) {
		...
		if (folio_test_hugetlb(folio)) {
			...
			/*
                         * To call huge_pmd_unshare, i_mmap_rwsem must be
                         * held in write mode.  Caller needs to explicitly
                         * do this outside rmap routines.
                         *
                         * We also must hold hugetlb vma_lock in write mode.
                         * Lock order dictates acquiring vma_lock BEFORE
                         * i_mmap_rwsem.  We can only try lock here and fail
                         * if unsuccessful.
                         */
                        if (!anon) {
                                VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
                                if (!hugetlb_vma_trylock_write(vma)) {
                                        page_vma_mapped_walk_done(&pvmw);
                                        ret = false;
				}


Can not think of a great solution right now.
-- 
Mike Kravetz

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 12/05/22 15:52, Mike Kravetz wrote:
> On 11/29/22 14:35, Peter Xu wrote:
> > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
> > to make sure the pgtable page will not be freed concurrently.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/rmap.h | 4 ++++
> >  mm/page_vma_mapped.c | 5 ++++-
> >  2 files changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index bd3504d11b15..a50d18bb86aa 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -13,6 +13,7 @@
> >  #include <linux/highmem.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/memremap.h>
> > +#include <linux/hugetlb.h>
> >  
> >  /*
> >   * The anon_vma heads a list of private "related" vmas, to scan if
> > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >  		pte_unmap(pvmw->pte);
> >  	if (pvmw->ptl)
> >  		spin_unlock(pvmw->ptl);
> > +	/* This needs to be after unlock of the spinlock */
> > +	if (is_vm_hugetlb_page(pvmw->vma))
> > +		hugetlb_vma_unlock_read(pvmw->vma);
> >  }
> >  
> >  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 93e13fc17d3c..f94ec78b54ff 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >  		if (pvmw->pte)
> >  			return not_found(pvmw);
> >  
> > +		hugetlb_vma_lock_read(vma);
> >  		/* when pud is not present, pte will be NULL */
> >  		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> > -		if (!pvmw->pte)
> > +		if (!pvmw->pte) {
> > +			hugetlb_vma_unlock_read(vma);
> >  			return false;
> > +		}
> >  
> >  		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
> >  		if (!check_pte(pvmw))
> 
> I think this is going to cause try_to_unmap() to always fail for hugetlb
> shared pages.  See try_to_unmap_one:
> 
> 	while (page_vma_mapped_walk(&pvmw)) {
> 		...
> 		if (folio_test_hugetlb(folio)) {
> 			...
> 			/*
>                          * To call huge_pmd_unshare, i_mmap_rwsem must be
>                          * held in write mode.  Caller needs to explicitly
>                          * do this outside rmap routines.
>                          *
>                          * We also must hold hugetlb vma_lock in write mode.
>                          * Lock order dictates acquiring vma_lock BEFORE
>                          * i_mmap_rwsem.  We can only try lock here and fail
>                          * if unsuccessful.
>                          */
>                         if (!anon) {
>                                 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>                                 if (!hugetlb_vma_trylock_write(vma)) {
>                                         page_vma_mapped_walk_done(&pvmw);
>                                         ret = false;
> 				}
> 
> 
> Can not think of a great solution right now.

Thought of this last night ...

Perhaps we do not need vma_lock in this code path (not sure about all
page_vma_mapped_walk calls).  Why?  We already hold i_mmap_rwsem.
-- 
Mike Kravetz

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Tue, Dec 06, 2022 at 09:10:00AM -0800, Mike Kravetz wrote:
> On 12/05/22 15:52, Mike Kravetz wrote:
> > On 11/29/22 14:35, Peter Xu wrote:
> > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
> > > to make sure the pgtable page will not be freed concurrently.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  include/linux/rmap.h | 4 ++++
> > >  mm/page_vma_mapped.c | 5 ++++-
> > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > index bd3504d11b15..a50d18bb86aa 100644
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -13,6 +13,7 @@
> > >  #include <linux/highmem.h>
> > >  #include <linux/pagemap.h>
> > >  #include <linux/memremap.h>
> > > +#include <linux/hugetlb.h>
> > >  
> > >  /*
> > >   * The anon_vma heads a list of private "related" vmas, to scan if
> > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> > >  		pte_unmap(pvmw->pte);
> > >  	if (pvmw->ptl)
> > >  		spin_unlock(pvmw->ptl);
> > > +	/* This needs to be after unlock of the spinlock */
> > > +	if (is_vm_hugetlb_page(pvmw->vma))
> > > +		hugetlb_vma_unlock_read(pvmw->vma);
> > >  }
> > >  
> > >  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > index 93e13fc17d3c..f94ec78b54ff 100644
> > > --- a/mm/page_vma_mapped.c
> > > +++ b/mm/page_vma_mapped.c
> > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > >  		if (pvmw->pte)
> > >  			return not_found(pvmw);
> > >  
> > > +		hugetlb_vma_lock_read(vma);
> > >  		/* when pud is not present, pte will be NULL */
> > >  		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> > > -		if (!pvmw->pte)
> > > +		if (!pvmw->pte) {
> > > +			hugetlb_vma_unlock_read(vma);
> > >  			return false;
> > > +		}
> > >  
> > >  		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
> > >  		if (!check_pte(pvmw))
> > 
> > I think this is going to cause try_to_unmap() to always fail for hugetlb
> > shared pages.  See try_to_unmap_one:
> > 
> > 	while (page_vma_mapped_walk(&pvmw)) {
> > 		...
> > 		if (folio_test_hugetlb(folio)) {
> > 			...
> > 			/*
> >                          * To call huge_pmd_unshare, i_mmap_rwsem must be
> >                          * held in write mode.  Caller needs to explicitly
> >                          * do this outside rmap routines.
> >                          *
> >                          * We also must hold hugetlb vma_lock in write mode.
> >                          * Lock order dictates acquiring vma_lock BEFORE
> >                          * i_mmap_rwsem.  We can only try lock here and fail
> >                          * if unsuccessful.
> >                          */
> >                         if (!anon) {
> >                                 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> >                                 if (!hugetlb_vma_trylock_write(vma)) {
> >                                         page_vma_mapped_walk_done(&pvmw);
> >                                         ret = false;
> > 				}
> > 
> > 
> > Can not think of a great solution right now.
> 
> Thought of this last night ...
> 
> Perhaps we do not need vma_lock in this code path (not sure about all
> page_vma_mapped_walk calls).  Why?  We already hold i_mmap_rwsem.

Exactly.  The only concern is when it's not in a rmap.

I'm actually preparing something that adds a new flag to PVMW, like:

#define PVMW_HUGETLB_NEEDS_LOCK	(1 << 2)

But maybe we don't need that at all, since I had a closer look the only
outliers of not using a rmap is:

__replace_page
write_protect_page

I'm pretty sure ksm doesn't have hugetlb involved, then the other one is
uprobe (uprobe_write_opcode).  I think it's the same.  If it's true, we can
simply drop this patch.  Then we also have hugetlb_walk and the lock checks
there guarantee that we're safe anyways.

Potentially we can document this fact, which I also attached a comment
patch just for it to be appended to the end of the patchset.

Mike, let me know what do you think.

Andrew, if this patch to be dropped then the last patch may not cleanly
apply.  Let me know if you want a full repost of the things.

Thanks,

-- 
Peter Xu

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Tue, Dec 06, 2022 at 12:39:53PM -0500, Peter Xu wrote:
> On Tue, Dec 06, 2022 at 09:10:00AM -0800, Mike Kravetz wrote:
> > On 12/05/22 15:52, Mike Kravetz wrote:
> > > On 11/29/22 14:35, Peter Xu wrote:
> > > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
> > > > to make sure the pgtable page will not be freed concurrently.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  include/linux/rmap.h | 4 ++++
> > > >  mm/page_vma_mapped.c | 5 ++++-
> > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > > index bd3504d11b15..a50d18bb86aa 100644
> > > > --- a/include/linux/rmap.h
> > > > +++ b/include/linux/rmap.h
> > > > @@ -13,6 +13,7 @@
> > > >  #include <linux/highmem.h>
> > > >  #include <linux/pagemap.h>
> > > >  #include <linux/memremap.h>
> > > > +#include <linux/hugetlb.h>
> > > >  
> > > >  /*
> > > >   * The anon_vma heads a list of private "related" vmas, to scan if
> > > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> > > >  		pte_unmap(pvmw->pte);
> > > >  	if (pvmw->ptl)
> > > >  		spin_unlock(pvmw->ptl);
> > > > +	/* This needs to be after unlock of the spinlock */
> > > > +	if (is_vm_hugetlb_page(pvmw->vma))
> > > > +		hugetlb_vma_unlock_read(pvmw->vma);
> > > >  }
> > > >  
> > > >  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 93e13fc17d3c..f94ec78b54ff 100644
> > > > --- a/mm/page_vma_mapped.c
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > >  		if (pvmw->pte)
> > > >  			return not_found(pvmw);
> > > >  
> > > > +		hugetlb_vma_lock_read(vma);
> > > >  		/* when pud is not present, pte will be NULL */
> > > >  		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> > > > -		if (!pvmw->pte)
> > > > +		if (!pvmw->pte) {
> > > > +			hugetlb_vma_unlock_read(vma);
> > > >  			return false;
> > > > +		}
> > > >  
> > > >  		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
> > > >  		if (!check_pte(pvmw))
> > > 
> > > I think this is going to cause try_to_unmap() to always fail for hugetlb
> > > shared pages.  See try_to_unmap_one:
> > > 
> > > 	while (page_vma_mapped_walk(&pvmw)) {
> > > 		...
> > > 		if (folio_test_hugetlb(folio)) {
> > > 			...
> > > 			/*
> > >                          * To call huge_pmd_unshare, i_mmap_rwsem must be
> > >                          * held in write mode.  Caller needs to explicitly
> > >                          * do this outside rmap routines.
> > >                          *
> > >                          * We also must hold hugetlb vma_lock in write mode.
> > >                          * Lock order dictates acquiring vma_lock BEFORE
> > >                          * i_mmap_rwsem.  We can only try lock here and fail
> > >                          * if unsuccessful.
> > >                          */
> > >                         if (!anon) {
> > >                                 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> > >                                 if (!hugetlb_vma_trylock_write(vma)) {
> > >                                         page_vma_mapped_walk_done(&pvmw);
> > >                                         ret = false;
> > > 				}
> > > 
> > > 
> > > Can not think of a great solution right now.
> > 
> > Thought of this last night ...
> > 
> > Perhaps we do not need vma_lock in this code path (not sure about all
> > page_vma_mapped_walk calls).  Why?  We already hold i_mmap_rwsem.
> 
> Exactly.  The only concern is when it's not in a rmap.
> 
> I'm actually preparing something that adds a new flag to PVMW, like:
> 
> #define PVMW_HUGETLB_NEEDS_LOCK	(1 << 2)
> 
> But maybe we don't need that at all, since I had a closer look the only
> outliers of not using a rmap is:
> 
> __replace_page
> write_protect_page
> 
> I'm pretty sure ksm doesn't have hugetlb involved, then the other one is
> uprobe (uprobe_write_opcode).  I think it's the same.  If it's true, we can
> simply drop this patch.  Then we also have hugetlb_walk and the lock checks
> there guarantee that we're safe anyways.
> 
> Potentially we can document this fact, which I also attached a comment
> patch just for it to be appended to the end of the patchset.
> 
> Mike, let me know what do you think.
> 
> Andrew, if this patch to be dropped then the last patch may not cleanly
> apply.  Let me know if you want a full repost of the things.

The document patch that can be appended to the end of this series attached.
I referenced hugetlb_walk() so it needs to be the last patch.

-- 
Peter Xu
From 754c2180804e9e86accf131573cbd956b8c62829 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 6 Dec 2022 12:36:04 -0500
Subject: [PATCH] mm/hugetlb: Document why page_vma_mapped_walk() is safe to
 walk
Content-type: text/plain

Taking vma lock here is not needed for now because all potential hugetlb
walkers here should have i_mmap_rwsem held.  Document the fact.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/page_vma_mapped.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e97b2e23bd28..2e59a0419d22 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		/* The only possible mapping was handled on last iteration */
 		if (pvmw->pte)
 			return not_found(pvmw);
-
-		/* when pud is not present, pte will be NULL */
+		/*
+		 * NOTE: we don't need explicit lock here to walk the
+		 * hugetlb pgtable because either (1) potential callers of
+		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
+		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
+		 * When one day this rule breaks, one will get a warning
+		 * in hugetlb_walk(), and then we'll figure out what to do.
+		 */
 		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
 		if (!pvmw->pte)
 			return false;
-- 
2.37.3

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by Mike Kravetz 3 years, 2 months ago

On 12/06/22 12:43, Peter Xu wrote:
> On Tue, Dec 06, 2022 at 12:39:53PM -0500, Peter Xu wrote:
> > On Tue, Dec 06, 2022 at 09:10:00AM -0800, Mike Kravetz wrote:
> > > On 12/05/22 15:52, Mike Kravetz wrote:
> > > > On 11/29/22 14:35, Peter Xu wrote:
> > > > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
> > > > > to make sure the pgtable page will not be freed concurrently.
> > > > > 
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >  include/linux/rmap.h | 4 ++++
> > > > >  mm/page_vma_mapped.c | 5 ++++-
> > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > > > index bd3504d11b15..a50d18bb86aa 100644
> > > > > --- a/include/linux/rmap.h
> > > > > +++ b/include/linux/rmap.h
> > > > > @@ -13,6 +13,7 @@
> > > > >  #include <linux/highmem.h>
> > > > >  #include <linux/pagemap.h>
> > > > >  #include <linux/memremap.h>
> > > > > +#include <linux/hugetlb.h>
> > > > >  
> > > > >  /*
> > > > >   * The anon_vma heads a list of private "related" vmas, to scan if
> > > > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> > > > >  		pte_unmap(pvmw->pte);
> > > > >  	if (pvmw->ptl)
> > > > >  		spin_unlock(pvmw->ptl);
> > > > > +	/* This needs to be after unlock of the spinlock */
> > > > > +	if (is_vm_hugetlb_page(pvmw->vma))
> > > > > +		hugetlb_vma_unlock_read(pvmw->vma);
> > > > >  }
> > > > >  
> > > > >  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > > index 93e13fc17d3c..f94ec78b54ff 100644
> > > > > --- a/mm/page_vma_mapped.c
> > > > > +++ b/mm/page_vma_mapped.c
> > > > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > > >  		if (pvmw->pte)
> > > > >  			return not_found(pvmw);
> > > > >  
> > > > > +		hugetlb_vma_lock_read(vma);
> > > > >  		/* when pud is not present, pte will be NULL */
> > > > >  		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> > > > > -		if (!pvmw->pte)
> > > > > +		if (!pvmw->pte) {
> > > > > +			hugetlb_vma_unlock_read(vma);
> > > > >  			return false;
> > > > > +		}
> > > > >  
> > > > >  		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
> > > > >  		if (!check_pte(pvmw))
> > > > 
> > > > I think this is going to cause try_to_unmap() to always fail for hugetlb
> > > > shared pages.  See try_to_unmap_one:
> > > > 
> > > > 	while (page_vma_mapped_walk(&pvmw)) {
> > > > 		...
> > > > 		if (folio_test_hugetlb(folio)) {
> > > > 			...
> > > > 			/*
> > > >                          * To call huge_pmd_unshare, i_mmap_rwsem must be
> > > >                          * held in write mode.  Caller needs to explicitly
> > > >                          * do this outside rmap routines.
> > > >                          *
> > > >                          * We also must hold hugetlb vma_lock in write mode.
> > > >                          * Lock order dictates acquiring vma_lock BEFORE
> > > >                          * i_mmap_rwsem.  We can only try lock here and fail
> > > >                          * if unsuccessful.
> > > >                          */
> > > >                         if (!anon) {
> > > >                                 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> > > >                                 if (!hugetlb_vma_trylock_write(vma)) {
> > > >                                         page_vma_mapped_walk_done(&pvmw);
> > > >                                         ret = false;
> > > > 				}
> > > > 
> > > > 
> > > > Can not think of a great solution right now.
> > > 
> > > Thought of this last night ...
> > > 
> > > Perhaps we do not need vma_lock in this code path (not sure about all
> > > page_vma_mapped_walk calls).  Why?  We already hold i_mmap_rwsem.
> > 
> > Exactly.  The only concern is when it's not in a rmap.
> > 
> > I'm actually preparing something that adds a new flag to PVMW, like:
> > 
> > #define PVMW_HUGETLB_NEEDS_LOCK	(1 << 2)
> > 
> > But maybe we don't need that at all, since I had a closer look the only
> > outliers of not using a rmap is:
> > 
> > __replace_page
> > write_protect_page
> > 
> > I'm pretty sure ksm doesn't have hugetlb involved, then the other one is
> > uprobe (uprobe_write_opcode).  I think it's the same.  If it's true, we can
> > simply drop this patch.  Then we also have hugetlb_walk and the lock checks
> > there guarantee that we're safe anyways.
> > 
> > Potentially we can document this fact, which I also attached a comment
> > patch just for it to be appended to the end of the patchset.
> > 
> > Mike, let me know what do you think.
> > 
> > Andrew, if this patch to be dropped then the last patch may not cleanly
> > apply.  Let me know if you want a full repost of the things.
> 
> The document patch that can be appended to the end of this series attached.
> I referenced hugetlb_walk() so it needs to be the last patch.
> 
> -- 
> Peter Xu

Agree with dropping this patch and adding the document patch below.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Also, happy we have the warnings in place to catch incorrect locking.
-- 
Mike Kravetz

> From 754c2180804e9e86accf131573cbd956b8c62829 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 6 Dec 2022 12:36:04 -0500
> Subject: [PATCH] mm/hugetlb: Document why page_vma_mapped_walk() is safe to
>  walk
> Content-type: text/plain
> 
> Taking vma lock here is not needed for now because all potential hugetlb
> walkers here should have i_mmap_rwsem held.  Document the fact.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/page_vma_mapped.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e97b2e23bd28..2e59a0419d22 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  		/* The only possible mapping was handled on last iteration */
>  		if (pvmw->pte)
>  			return not_found(pvmw);
> -
> -		/* when pud is not present, pte will be NULL */
> +		/*
> +		 * NOTE: we don't need explicit lock here to walk the
> +		 * hugetlb pgtable because either (1) potential callers of
> +		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> +		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> +		 * When one day this rule breaks, one will get a warning
> +		 * in hugetlb_walk(), and then we'll figure out what to do.
> +		 */
>  		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>  		if (!pvmw->pte)
>  			return false;
> -- 
> 2.37.3
>

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by David Hildenbrand 3 years, 2 months ago

On 29.11.22 20:35, Peter Xu wrote:
> Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/rmap.h | 4 ++++
>   mm/page_vma_mapped.c | 5 ++++-
>   2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bd3504d11b15..a50d18bb86aa 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -13,6 +13,7 @@
>   #include <linux/highmem.h>
>   #include <linux/pagemap.h>
>   #include <linux/memremap.h>
> +#include <linux/hugetlb.h>
>   
>   /*
>    * The anon_vma heads a list of private "related" vmas, to scan if
> @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>   		pte_unmap(pvmw->pte);
>   	if (pvmw->ptl)
>   		spin_unlock(pvmw->ptl);
> +	/* This needs to be after unlock of the spinlock */
> +	if (is_vm_hugetlb_page(pvmw->vma))
> +		hugetlb_vma_unlock_read(pvmw->vma);
>   }
>   
>   bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 93e13fc17d3c..f94ec78b54ff 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   		if (pvmw->pte)
>   			return not_found(pvmw);
>   
> +		hugetlb_vma_lock_read(vma);
>   		/* when pud is not present, pte will be NULL */
>   		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> -		if (!pvmw->pte)
> +		if (!pvmw->pte) {
> +			hugetlb_vma_unlock_read(vma);
>   			return false;
> +		}
>   
>   		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
>   		if (!check_pte(pvmw))

Looking at code like  mm/damon/paddr.c:__damon_pa_mkold() and reading 
the doc of page_vma_mapped_walk(), this might be broken.

Can't we get page_vma_mapped_walk() called multiple times? Wouldn't we 
have to remember that we already took the lock to not lock twice, and to 
see if we really have to unlock in page_vma_mapped_walk_done() ?

-- 
Thanks,

David / dhildenb

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by Peter Xu 3 years, 2 months ago

On Wed, Nov 30, 2022 at 05:18:45PM +0100, David Hildenbrand wrote:
> On 29.11.22 20:35, Peter Xu wrote:
> > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
> > to make sure the pgtable page will not be freed concurrently.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/linux/rmap.h | 4 ++++
> >   mm/page_vma_mapped.c | 5 ++++-
> >   2 files changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index bd3504d11b15..a50d18bb86aa 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -13,6 +13,7 @@
> >   #include <linux/highmem.h>
> >   #include <linux/pagemap.h>
> >   #include <linux/memremap.h>
> > +#include <linux/hugetlb.h>
> >   /*
> >    * The anon_vma heads a list of private "related" vmas, to scan if
> > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >   		pte_unmap(pvmw->pte);
> >   	if (pvmw->ptl)
> >   		spin_unlock(pvmw->ptl);
> > +	/* This needs to be after unlock of the spinlock */
> > +	if (is_vm_hugetlb_page(pvmw->vma))
> > +		hugetlb_vma_unlock_read(pvmw->vma);
> >   }
> >   bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 93e13fc17d3c..f94ec78b54ff 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >   		if (pvmw->pte)
> >   			return not_found(pvmw);
> > +		hugetlb_vma_lock_read(vma);
> >   		/* when pud is not present, pte will be NULL */
> >   		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> > -		if (!pvmw->pte)
> > +		if (!pvmw->pte) {
> > +			hugetlb_vma_unlock_read(vma);
> >   			return false;
> > +		}
> >   		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
> >   		if (!check_pte(pvmw))
> 
> Looking at code like  mm/damon/paddr.c:__damon_pa_mkold() and reading the
> doc of page_vma_mapped_walk(), this might be broken.
> 
> Can't we get page_vma_mapped_walk() called multiple times?

Yes it normally can, but not for hugetlbfs?  Feel free to check:

	if (unlikely(is_vm_hugetlb_page(vma))) {
                ...
		/* The only possible mapping was handled on last iteration */
		if (pvmw->pte)
			return not_found(pvmw);
        }

> Wouldn't we have to remember that we already took the lock to not lock
> twice, and to see if we really have to unlock in
> page_vma_mapped_walk_done() ?

-- 
Peter Xu

Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare

Posted by David Hildenbrand 3 years, 2 months ago

On 30.11.22 17:32, Peter Xu wrote:
> On Wed, Nov 30, 2022 at 05:18:45PM +0100, David Hildenbrand wrote:
>> On 29.11.22 20:35, Peter Xu wrote:
>>> Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
>>> to make sure the pgtable page will not be freed concurrently.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    include/linux/rmap.h | 4 ++++
>>>    mm/page_vma_mapped.c | 5 ++++-
>>>    2 files changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index bd3504d11b15..a50d18bb86aa 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -13,6 +13,7 @@
>>>    #include <linux/highmem.h>
>>>    #include <linux/pagemap.h>
>>>    #include <linux/memremap.h>
>>> +#include <linux/hugetlb.h>
>>>    /*
>>>     * The anon_vma heads a list of private "related" vmas, to scan if
>>> @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>>    		pte_unmap(pvmw->pte);
>>>    	if (pvmw->ptl)
>>>    		spin_unlock(pvmw->ptl);
>>> +	/* This needs to be after unlock of the spinlock */
>>> +	if (is_vm_hugetlb_page(pvmw->vma))
>>> +		hugetlb_vma_unlock_read(pvmw->vma);
>>>    }
>>>    bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index 93e13fc17d3c..f94ec78b54ff 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>    		if (pvmw->pte)
>>>    			return not_found(pvmw);
>>> +		hugetlb_vma_lock_read(vma);
>>>    		/* when pud is not present, pte will be NULL */
>>>    		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
>>> -		if (!pvmw->pte)
>>> +		if (!pvmw->pte) {
>>> +			hugetlb_vma_unlock_read(vma);
>>>    			return false;
>>> +		}
>>>    		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
>>>    		if (!check_pte(pvmw))
>>
>> Looking at code like  mm/damon/paddr.c:__damon_pa_mkold() and reading the
>> doc of page_vma_mapped_walk(), this might be broken.
>>
>> Can't we get page_vma_mapped_walk() called multiple times?
> 
> Yes it normally can, but not for hugetlbfs?  Feel free to check:
> 
> 	if (unlikely(is_vm_hugetlb_page(vma))) {
>                  ...
> 		/* The only possible mapping was handled on last iteration */
> 		if (pvmw->pte)
> 			return not_found(pvmw);
>          }

Ah, I see, thanks.

Acked-by: David Hildenbrand <david@redhat.com>


-- 
Thanks,

David / dhildenb

[PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

Posted by Peter Xu 3 years, 2 months ago

huge_pte_offset() is the main walker function for hugetlb pgtables.  The
name is not really representing what it does, though.

Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside.  Assert on the locks when walking
the pgtable.

Note, the vma lock assertion will be a no-op for private mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c    |  4 +---
 fs/userfaultfd.c        |  6 ++----
 include/linux/hugetlb.h | 37 +++++++++++++++++++++++++++++++++++++
 mm/hugetlb.c            | 34 ++++++++++++++--------------------
 mm/page_vma_mapped.c    |  2 +-
 mm/pagewalk.c           |  4 +---
 6 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fdb16246f46e..48f1a8ad2243 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -388,9 +388,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
 {
 	pte_t *ptep, pte;
 
-	ptep = huge_pte_offset(vma->vm_mm, addr,
-			huge_page_size(hstate_vma(vma)));
-
+	ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
 	if (!ptep)
 		return false;
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a602f008dde5..f31fe1a9f4c5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 					 unsigned long flags,
 					 unsigned long reason)
 {
-	struct mm_struct *mm = ctx->mm;
 	pte_t *ptep, pte;
 	bool ret = true;
 
-	mmap_assert_locked(mm);
-
-	ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
+	mmap_assert_locked(ctx->mm);
 
+	ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
 	if (!ptep)
 		goto out;
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 81efd9b9baa2..1a51c45fdf2e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -196,6 +196,11 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
  * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
  * Returns the pte_t* if found, or NULL if the address is not mapped.
  *
+ * IMPORTANT: we should normally not directly call this function, instead
+ * this is only a common interface to implement arch-specific walker.
+ * Please consider using the hugetlb_walk() helper to make sure of the
+ * correct locking is satisfied.
+ *
  * Since this function will walk all the pgtable pages (including not only
  * high-level pgtable page, but also PUD entry that can be unshared
  * concurrently for VM_SHARED), the caller of this function should be
@@ -1229,4 +1234,36 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
 #define flush_hugetlb_tlb_range(vma, addr, end)	flush_tlb_range(vma, addr, end)
 #endif
 
+static inline bool
+__vma_shareable_flags_pmd(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
+		vma->vm_private_data;
+}
+
+/*
+ * Safe version of huge_pte_offset() to check the locks.  See comments
+ * above huge_pte_offset().
+ */
+static inline pte_t *
+hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
+{
+#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
+	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+	/*
+	 * If pmd sharing possible, locking needed to safely walk the
+	 * hugetlb pgtables.  More information can be found at the comment
+	 * above huge_pte_offset() in the same file.
+	 *
+	 * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
+	 */
+	if (__vma_shareable_flags_pmd(vma))
+		WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
+			     !lockdep_is_held(
+				 &vma->vm_file->f_mapping->i_mmap_rwsem));
+#endif
+	return huge_pte_offset(vma->vm_mm, addr, sz);
+}
+
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df645a5824e3..05867e82b467 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4816,7 +4816,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	} else {
 		/*
 		 * For shared mappings the vma lock must be held before
-		 * calling huge_pte_offset in the src vma. Otherwise, the
+		 * calling hugetlb_walk() in the src vma. Otherwise, the
 		 * returned ptep could go away if part of a shared pmd and
 		 * another thread calls huge_pmd_unshare.
 		 */
@@ -4826,7 +4826,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	last_addr_mask = hugetlb_mask_last_page(h);
 	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
-		src_pte = huge_pte_offset(src, addr, sz);
+		src_pte = hugetlb_walk(src_vma, addr, sz);
 		if (!src_pte) {
 			addr |= last_addr_mask;
 			continue;
@@ -5030,7 +5030,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(mapping);
 	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
-		src_pte = huge_pte_offset(mm, old_addr, sz);
+		src_pte = hugetlb_walk(vma, old_addr, sz);
 		if (!src_pte) {
 			old_addr |= last_addr_mask;
 			new_addr |= last_addr_mask;
@@ -5093,7 +5093,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	last_addr_mask = hugetlb_mask_last_page(h);
 	address = start;
 	for (; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address, sz);
+		ptep = hugetlb_walk(vma, address, sz);
 		if (!ptep) {
 			address |= last_addr_mask;
 			continue;
@@ -5406,7 +5406,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			hugetlb_vma_lock_read(vma);
 			spin_lock(ptl);
-			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+			ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
 			if (likely(ptep &&
 				   pte_same(huge_ptep_get(ptep), pte)))
 				goto retry_avoidcopy;
@@ -5444,7 +5444,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * before the page tables are altered
 	 */
 	spin_lock(ptl);
-	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+	ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
@@ -5841,7 +5841,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * until finished with ptep.  This prevents huge_pmd_unshare from
 	 * being called elsewhere and making the ptep no longer valid.
 	 *
-	 * ptep could have already be assigned via huge_pte_offset.  That
+	 * ptep could have already be assigned via hugetlb_walk().  That
 	 * is OK, as huge_pte_alloc will return the same value unless
 	 * something has changed.
 	 */
@@ -6233,7 +6233,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		return NULL;
 
 	hugetlb_vma_lock_read(vma);
-	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
+	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
 	if (!pte)
 		goto out_unlock;
 
@@ -6298,8 +6298,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 *
 		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
-				      huge_page_size(h));
+		pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
+				   huge_page_size(h));
 		if (pte)
 			ptl = huge_pte_lock(h, mm, pte);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
@@ -6485,7 +6485,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	last_addr_mask = hugetlb_mask_last_page(h);
 	for (; address < end; address += psize) {
 		spinlock_t *ptl;
-		ptep = huge_pte_offset(mm, address, psize);
+		ptep = hugetlb_walk(vma, address, psize);
 		if (!ptep) {
 			address |= last_addr_mask;
 			continue;
@@ -6863,12 +6863,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 		*end = ALIGN(*end, PUD_SIZE);
 }
 
-static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
-{
-	return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
-		vma->vm_private_data;
-}
-
 void hugetlb_vma_lock_read(struct vm_area_struct *vma)
 {
 	if (__vma_shareable_flags_pmd(vma)) {
@@ -7034,8 +7028,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		saddr = page_table_shareable(svma, vma, addr, idx);
 		if (saddr) {
-			spte = huge_pte_offset(svma->vm_mm, saddr,
-					       vma_mmu_pagesize(svma));
+			spte = hugetlb_walk(svma, saddr,
+					    vma_mmu_pagesize(svma));
 			if (spte) {
 				get_page(virt_to_page(spte));
 				break;
@@ -7394,7 +7388,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (address = start; address < end; address += PUD_SIZE) {
-		ptep = huge_pte_offset(mm, address, sz);
+		ptep = hugetlb_walk(vma, address, sz);
 		if (!ptep)
 			continue;
 		ptl = huge_pte_lock(h, mm, ptep);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f94ec78b54ff..bb782dea4b42 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -171,7 +171,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 
 		hugetlb_vma_lock_read(vma);
 		/* when pud is not present, pte will be NULL */
-		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
+		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
 		if (!pvmw->pte) {
 			hugetlb_vma_unlock_read(vma);
 			return false;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d98564a7be57..cb23f8a15c13 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -305,13 +305,11 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	hugetlb_vma_lock_read(vma);
 	do {
 		next = hugetlb_entry_end(h, addr, end);
-		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
-
+		pte = hugetlb_walk(vma, addr & hmask, sz);
 		if (pte)
 			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
 		else if (ops->pte_hole)
 			err = ops->pte_hole(addr, next, -1, walk);
-
 		if (err)
 			break;
 	} while (addr = next, addr != end);
-- 
2.37.3

Re: [PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

Posted by Eric Biggers 3 years, 2 months ago

On Tue, Nov 29, 2022 at 02:35:26PM -0500, Peter Xu wrote:
> +static inline pte_t *
> +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> +{
> +#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> +	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> +
> +	/*
> +	 * If pmd sharing possible, locking needed to safely walk the
> +	 * hugetlb pgtables.  More information can be found at the comment
> +	 * above huge_pte_offset() in the same file.
> +	 *
> +	 * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> +	 */
> +	if (__vma_shareable_flags_pmd(vma))
> +		WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> +			     !lockdep_is_held(
> +				 &vma->vm_file->f_mapping->i_mmap_rwsem));
> +#endif

FYI, in next-20221130 there is a compile error here due to this commit:

    In file included from security/commoncap.c:19:
    ./include/linux/hugetlb.h:1262:42: error: incomplete definition of type 'struct hugetlb_vma_lock'
                    WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
                                                   ~~~~~~~~^
    ./include/linux/lockdep.h:286:47: note: expanded from macro 'lockdep_is_held'
    #define lockdep_is_held(lock)           lock_is_held(&(lock)->dep_map)
                                                           ^~~~

- Eric

Re: [PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

Posted by Peter Xu 3 years, 2 months ago

On Tue, Nov 29, 2022 at 09:18:08PM -0800, Eric Biggers wrote:
> On Tue, Nov 29, 2022 at 02:35:26PM -0500, Peter Xu wrote:
> > +static inline pte_t *
> > +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> > +{
> > +#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> > +	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> > +
> > +	/*
> > +	 * If pmd sharing possible, locking needed to safely walk the
> > +	 * hugetlb pgtables.  More information can be found at the comment
> > +	 * above huge_pte_offset() in the same file.
> > +	 *
> > +	 * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> > +	 */
> > +	if (__vma_shareable_flags_pmd(vma))
> > +		WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> > +			     !lockdep_is_held(
> > +				 &vma->vm_file->f_mapping->i_mmap_rwsem));
> > +#endif
> 
> FYI, in next-20221130 there is a compile error here due to this commit:
> 
>     In file included from security/commoncap.c:19:
>     ./include/linux/hugetlb.h:1262:42: error: incomplete definition of type 'struct hugetlb_vma_lock'
>                     WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
>                                                    ~~~~~~~~^
>     ./include/linux/lockdep.h:286:47: note: expanded from macro 'lockdep_is_held'
>     #define lockdep_is_held(lock)           lock_is_held(&(lock)->dep_map)
>                                                            ^~~~

This probably means the config has:

  CONFIG_HUGETLB_PAGE=n
  CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y

And I'm surprised we didn't have a dependency that ARCH_WANT_HUGE_PMD_SHARE
should depend on HUGETLB_PAGE already.  Mike, what do you think?

I've also attached a quick fix for this patch to be squashed in.  Hope it
works.

Thanks,

-- 
Peter Xu
From 9787a7f5492ca251fcce5c09bd7e4a80ac157726 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 30 Nov 2022 10:33:44 -0500
Subject: [PATCH] fixup! mm/hugetlb: Introduce hugetlb_walk()
Content-type: text/plain

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1a51c45fdf2e..ec2a1f93b12d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1248,7 +1248,8 @@ __vma_shareable_flags_pmd(struct vm_area_struct *vma)
 static inline pte_t *
 hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
 {
-#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
+#if defined(CONFIG_HUGETLB_PAGE) && \
+	defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
 	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
 
 	/*
-- 
2.37.3

Re: [PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

Posted by Mike Kravetz 3 years, 2 months ago

On 11/30/22 10:37, Peter Xu wrote:
> On Tue, Nov 29, 2022 at 09:18:08PM -0800, Eric Biggers wrote:
> > On Tue, Nov 29, 2022 at 02:35:26PM -0500, Peter Xu wrote:
> > > +static inline pte_t *
> > > +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> > > +{
> > > +#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> > > +	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> > > +
> > > +	/*
> > > +	 * If pmd sharing possible, locking needed to safely walk the
> > > +	 * hugetlb pgtables.  More information can be found at the comment
> > > +	 * above huge_pte_offset() in the same file.
> > > +	 *
> > > +	 * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> > > +	 */
> > > +	if (__vma_shareable_flags_pmd(vma))
> > > +		WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> > > +			     !lockdep_is_held(
> > > +				 &vma->vm_file->f_mapping->i_mmap_rwsem));
> > > +#endif
> > 
> > FYI, in next-20221130 there is a compile error here due to this commit:
> > 
> >     In file included from security/commoncap.c:19:
> >     ./include/linux/hugetlb.h:1262:42: error: incomplete definition of type 'struct hugetlb_vma_lock'
> >                     WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> >                                                    ~~~~~~~~^
> >     ./include/linux/lockdep.h:286:47: note: expanded from macro 'lockdep_is_held'
> >     #define lockdep_is_held(lock)           lock_is_held(&(lock)->dep_map)
> >                                                            ^~~~
> 
> This probably means the config has:
> 
>   CONFIG_HUGETLB_PAGE=n
>   CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
> 
> And I'm surprised we didn't have a dependency that ARCH_WANT_HUGE_PMD_SHARE
> should depend on HUGETLB_PAGE already.  Mike, what do you think?

Yes, ARCH_WANT_HUGE_PMD_SHARE should probably depend on HUGETLB_PAGE as it
makes no sense without it.  There is also ARCH_WANT_GENERAL_HUGETLB that
seems to fall into the same category.


> 
> I've also attached a quick fix for this patch to be squashed in.  Hope it
> works.

Let's go with this quick fix for now since the only issue is in the new code
and spend more time on config dependencies later.

For the original patch and the quick fix,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> 
> Thanks,
> 
> -- 
> Peter Xu

> From 9787a7f5492ca251fcce5c09bd7e4a80ac157726 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 30 Nov 2022 10:33:44 -0500
> Subject: [PATCH] fixup! mm/hugetlb: Introduce hugetlb_walk()
> Content-type: text/plain
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1a51c45fdf2e..ec2a1f93b12d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1248,7 +1248,8 @@ __vma_shareable_flags_pmd(struct vm_area_struct *vma)
>  static inline pte_t *
>  hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
>  {
> -#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> +#if defined(CONFIG_HUGETLB_PAGE) && \
> +	defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
>  	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
>  
>  	/*
> -- 
> 2.37.3
>