[v3] fs/dax: Fix ZONE_DEVICE page reference counts

[PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

Posted by Alistair Popple 1 year, 2 months ago

Main updates since v2:

 - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
   and have them pass the vmf struct.

 - Seperate out the device DAX changes.

 - Restore the page share mapping counting and associated warnings.

 - Rework truncate to require file-systems to have previously called
   dax_break_layout() to remove the address space mapping for a
   page. This found several bugs which are fixed by the first half of
   the series. The motivation for this was initially to allow the FS
   DAX page-cache mappings to hold a reference on the page.

   However that turned out to be a dead-end (see the comments on patch
   21), but it found several bugs and I think overall it is an
   improvement so I have left it here.

Device and FS DAX pages have always maintained their own page
reference counts without following the normal rules for page reference
counting. In particular pages are considered free when the refcount
hits one rather than zero and refcounts are not added when mapping the
page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment. It also
enables support for compound ZONE_DEVICE pages which is one of my
primary motivators for doing this work.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

Cc: lina@asahilina.net
Cc: zhang.lyra@gmail.com
Cc: gerald.schaefer@linux.ibm.com
Cc: dan.j.williams@intel.com
Cc: vishal.l.verma@intel.com
Cc: dave.jiang@intel.com
Cc: logang@deltatee.com
Cc: bhelgaas@google.com
Cc: jack@suse.cz
Cc: jgg@ziepe.ca
Cc: catalin.marinas@arm.com
Cc: will@kernel.org
Cc: mpe@ellerman.id.au
Cc: npiggin@gmail.com
Cc: dave.hansen@linux.intel.com
Cc: ira.weiny@intel.com
Cc: willy@infradead.org
Cc: djwong@kernel.org
Cc: tytso@mit.edu
Cc: linmiaohe@huawei.com
Cc: david@redhat.com
Cc: peterx@redhat.com
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: nvdimm@lists.linux.dev
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: jhubbard@nvidia.com
Cc: hch@lst.de
Cc: david@fromorbit.com

Alistair Popple (25):
  fuse: Fix dax truncate/punch_hole fault path
  fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()
  fs/dax: Don't skip locked entries when scanning entries
  fs/dax: Refactor wait for dax idle page
  fs/dax: Create a common implementation to break DAX layouts
  fs/dax: Always remove DAX page-cache entries when breaking layouts
  fs/dax: Ensure all pages are idle prior to filesystem unmount
  fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
  mm/gup.c: Remove redundant check for PCI P2PDMA page
  pci/p2pdma: Don't initialise page refcount to one
  mm: Allow compound zone device pages
  mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
  mm/memory: Add vmf_insert_page_mkwrite()
  huge_memory: Allow mappings of PUD sized pages
  huge_memory: Allow mappings of PMD sized pages
  memremap: Add is_device_dax_page() and is_fsdax_page() helpers
  gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  proc/task_mmu: Ignore ZONE_DEVICE pages
  memcontrol-v1: Ignore ZONE_DEVICE pages
  mm/mlock: Skip ZONE_DEVICE PMDs during mlock
  fs/dax: Properly refcount fs dax pages
  device/dax: Properly refcount device dax pages when mapping
  mm: Remove pXX_devmap callers
  mm: Remove devmap related functions and page table bits
  Revert "riscv: mm: Add support for ZONE_DEVICE"

 Documentation/mm/arch_pgtable_helpers.rst     |   6 +-
 arch/arm64/Kconfig                            |   1 +-
 arch/arm64/include/asm/pgtable-prot.h         |   1 +-
 arch/arm64/include/asm/pgtable.h              |  24 +-
 arch/powerpc/Kconfig                          |   1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   6 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   7 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  52 +---
 arch/powerpc/include/asm/book3s/64/radix.h    |  14 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c       |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c            |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c      |   5 +-
 arch/powerpc/mm/pgtable.c                     |   2 +-
 arch/riscv/Kconfig                            |   1 +-
 arch/riscv/include/asm/pgtable-64.h           |  20 +-
 arch/riscv/include/asm/pgtable-bits.h         |   1 +-
 arch/riscv/include/asm/pgtable.h              |  17 +-
 arch/x86/Kconfig                              |   1 +-
 arch/x86/include/asm/pgtable.h                |  51 +---
 arch/x86/include/asm/pgtable_types.h          |   5 +-
 drivers/dax/device.c                          |  15 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c        |   3 +-
 drivers/nvdimm/pmem.c                         |   4 +-
 drivers/pci/p2pdma.c                          |  19 +-
 fs/dax.c                                      | 354 ++++++++++++++-----
 fs/ext4/inode.c                               |  43 +--
 fs/fuse/dax.c                                 |  35 +--
 fs/fuse/virtio_fs.c                           |   3 +-
 fs/proc/task_mmu.c                            |  18 +-
 fs/userfaultfd.c                              |   2 +-
 fs/xfs/xfs_inode.c                            |  40 +-
 fs/xfs/xfs_inode.h                            |   3 +-
 fs/xfs/xfs_super.c                            |  18 +-
 include/linux/dax.h                           |  23 +-
 include/linux/huge_mm.h                       |  22 +-
 include/linux/memremap.h                      |  28 +-
 include/linux/migrate.h                       |   4 +-
 include/linux/mm.h                            |  40 +--
 include/linux/mm_types.h                      |  14 +-
 include/linux/mmzone.h                        |   8 +-
 include/linux/page-flags.h                    |   6 +-
 include/linux/pfn_t.h                         |  20 +-
 include/linux/pgtable.h                       |  21 +-
 include/linux/rmap.h                          |  15 +-
 lib/test_hmm.c                                |   3 +-
 mm/Kconfig                                    |   4 +-
 mm/debug_vm_pgtable.c                         |  59 +---
 mm/gup.c                                      | 176 +---------
 mm/hmm.c                                      |  12 +-
 mm/huge_memory.c                              | 233 ++++++++-----
 mm/internal.h                                 |   2 +-
 mm/khugepaged.c                               |   2 +-
 mm/mapping_dirty_helpers.c                    |   4 +-
 mm/memcontrol-v1.c                            |   2 +-
 mm/memory-failure.c                           |   6 +-
 mm/memory.c                                   | 126 ++++---
 mm/memremap.c                                 |  59 +--
 mm/migrate_device.c                           |   9 +-
 mm/mlock.c                                    |   2 +-
 mm/mm_init.c                                  |  23 +-
 mm/mprotect.c                                 |   2 +-
 mm/mremap.c                                   |   5 +-
 mm/page_vma_mapped.c                          |   5 +-
 mm/pagewalk.c                                 |   8 +-
 mm/pgtable-generic.c                          |   7 +-
 mm/rmap.c                                     |  49 +++-
 mm/swap.c                                     |   2 +-
 mm/truncate.c                                 |  12 +-
 mm/userfaultfd.c                              |   5 +-
 mm/vmscan.c                                   |   5 +-
 70 files changed, 886 insertions(+), 920 deletions(-)

base-commit: 81983758430957d9a5cb3333fe324fd70cf63e7e
-- 
git-series 0.9.1

Re: [PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

Posted by Dan Williams 1 year, 1 month ago

[ add akpm and sfr for next steps ]

Alistair Popple wrote:
> Main updates since v2:
> 
>  - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
>    and have them pass the vmf struct.
> 
>  - Seperate out the device DAX changes.
> 
>  - Restore the page share mapping counting and associated warnings.
> 
>  - Rework truncate to require file-systems to have previously called
>    dax_break_layout() to remove the address space mapping for a
>    page. This found several bugs which are fixed by the first half of
>    the series. The motivation for this was initially to allow the FS
>    DAX page-cache mappings to hold a reference on the page.
> 
>    However that turned out to be a dead-end (see the comments on patch
>    21), but it found several bugs and I think overall it is an
>    improvement so I have left it here.
> 
> Device and FS DAX pages have always maintained their own page
> reference counts without following the normal rules for page reference
> counting. In particular pages are considered free when the refcount
> hits one rather than zero and refcounts are not added when mapping the
> page.
> 
> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.
> 
> By treating the refcounts on these pages the same way as normal pages
> we can remove a lot of special checks. In particular pXd_trans_huge()
> becomes the same as pXd_leaf(), although I haven't made that change
> here. It also frees up a valuable SW define PTE bit on architectures
> that have devmap PTE bits defined.
> 
> It also almost certainly allows further clean-up of the devmap managed
> functions, but I have left that as a future improvment. It also
> enables support for compound ZONE_DEVICE pages which is one of my
> primary motivators for doing this work.

So this is feeling ready for -next exposure, and ideally merged for v6.14. I
see the comments from John and Bjorn and that you were going to respin for
that, but if it's just those details things they can probably be handled
incrementally.

Alistair, are you ready for this to hit -next?

As for which tree...

Andrew, we could take this through -mm, but my first instinct would be to try
to take it through nvdimm.git mainly to offload any conflict wrangling work and
small fixups which are likely to be an ongoing trickle.

However, I am not going to put up much of a fight if others prefer this go
through -mm.

Thoughts?

Here is the current conflict diff vs -next:


commit 2f678b756ddf2c4a0fad7819442c09d3c8b52e4e
Merge: 4176cf5c5651 0f26a45ef326
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Fri Dec 13 17:37:32 2024 -0800

    Merge branch 'for-6.14/dax' into for-6.14/dax-next
    
    Fixup conflicts with pte_devmap removal and other misc conflicts.

diff --cc arch/arm64/Kconfig
index 39310a484d2d,1d90ab98af3c..81855d1c822c
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@@ -40,8 -38,6 +40,7 @@@ config ARM6
  	select ARCH_HAS_MEM_ENCRYPT
  	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 +	select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT
- 	select ARCH_HAS_PTE_DEVMAP
  	select ARCH_HAS_PTE_SPECIAL
  	select ARCH_HAS_HW_PTE_YOUNG
  	select ARCH_HAS_SETUP_DMA_OPS
diff --cc arch/riscv/Kconfig
index 7d5718667e39,2475c5b63c34..c285250a8ea8
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@@ -41,9 -39,7 +41,8 @@@ config RISC
  	select ARCH_HAS_MMIOWB
  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
  	select ARCH_HAS_PMEM_API
 +	select ARCH_HAS_PREEMPT_LAZY
  	select ARCH_HAS_PREPARE_SYNC_CORE_CMD
- 	select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU
  	select ARCH_HAS_PTE_SPECIAL
  	select ARCH_HAS_SET_DIRECT_MAP if MMU
  	select ARCH_HAS_SET_MEMORY if MMU
diff --cc arch/x86/Kconfig
index 77f001c6a567,6a1a06f6a98a..acac373f5097
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@@ -96,8 -93,6 +96,7 @@@ config X8
  	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
  	select ARCH_HAS_PMEM_API		if X86_64
 +	select ARCH_HAS_PREEMPT_LAZY
- 	select ARCH_HAS_PTE_DEVMAP		if X86_64
  	select ARCH_HAS_PTE_SPECIAL
  	select ARCH_HAS_HW_PTE_YOUNG
  	select ARCH_HAS_NONLEAF_PMD_YOUNG	if PGTABLE_LEVELS > 2
diff --cc arch/x86/include/asm/pgtable_types.h
index 4b804531b03c,4a13b76b9b97..e4c7b519d962
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@@ -33,15 -33,12 +33,14 @@@
  #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
  #define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
- #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
  
  #ifdef CONFIG_X86_64
 -#define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW5 /* Saved Dirty bit */
 +#define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */
 +#define _PAGE_BIT_NOPTISHADOW	_PAGE_BIT_SOFTW5 /* No PTI shadow (root PGD) */
  #else
  /* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */
 -#define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW2 /* Saved Dirty bit */
 +#define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW2 /* Saved Dirty bit (leaf) */
 +#define _PAGE_BIT_NOPTISHADOW	_PAGE_BIT_SOFTW2 /* No PTI shadow (root PGD) */
  #endif
  
  /* If _PAGE_BIT_PRESENT is clear, we use these: */
diff --cc include/linux/rmap.h
index 683a04088f3f,4a811c5e9294..07a99abcaf2f
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@@ -192,10 -192,11 +192,11 @@@ typedef int __bitwise rmap_t
  enum rmap_level {
  	RMAP_LEVEL_PTE = 0,
  	RMAP_LEVEL_PMD,
+ 	RMAP_LEVEL_PUD,
  };
  
 -static inline void __folio_rmap_sanity_checks(struct folio *folio,
 -		struct page *page, int nr_pages, enum rmap_level level)
 +static inline void __folio_rmap_sanity_checks(const struct folio *folio,
 +		const struct page *page, int nr_pages, enum rmap_level level)
  {
  	/* hugetlb folios are handled separately. */
  	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
diff --cc mm/internal.h
index 39227887e47b,50ff42f7b783..c4df0ad37423
--- a/mm/internal.h
+++ b/mm/internal.h
@@@ -733,9 -687,8 +733,7 @@@ static inline void prep_compound_tail(s
  	set_page_private(p, 0);
  }
  
- extern void prep_compound_page(struct page *page, unsigned int order);
- 
 -extern void post_alloc_hook(struct page *page, unsigned int order,
 -					gfp_t gfp_flags);
 +void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
  extern bool free_pages_prepare(struct page *page, unsigned int order);
  
  extern int user_min_free_kbytes;
diff --cc mm/madvise.c
index 49f3a75046f6,ff139e57cca2..1f4c99ee5c82
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@@ -1030,214 -1017,6 +1030,214 @@@ static long madvise_remove(struct vm_ar
  	return error;
  }
  
 +static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
 +{
 +	vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB;
 +
 +	/*
 +	 * A user could lock after setting a guard range but that's fine, as
 +	 * they'd not be able to fault in. The issue arises when we try to zap
 +	 * existing locked VMAs. We don't want to do that.
 +	 */
 +	if (!allow_locked)
 +		disallowed |= VM_LOCKED;
 +
 +	if (!vma_is_anonymous(vma))
 +		return false;
 +
 +	if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
 +		return false;
 +
 +	return true;
 +}
 +
 +static bool is_guard_pte_marker(pte_t ptent)
 +{
 +	return is_pte_marker(ptent) &&
 +		is_guard_swp_entry(pte_to_swp_entry(ptent));
 +}
 +
 +static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
 +				   unsigned long next, struct mm_walk *walk)
 +{
 +	pud_t pudval = pudp_get(pud);
 +
 +	/* If huge return >0 so we abort the operation + zap. */
- 	return pud_trans_huge(pudval) || pud_devmap(pudval);
++	return pud_trans_huge(pudval);
 +}
 +
 +static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
 +				   unsigned long next, struct mm_walk *walk)
 +{
 +	pmd_t pmdval = pmdp_get(pmd);
 +
 +	/* If huge return >0 so we abort the operation + zap. */
- 	return pmd_trans_huge(pmdval) || pmd_devmap(pmdval);
++	return pmd_trans_huge(pmdval);
 +}
 +
 +static int guard_install_pte_entry(pte_t *pte, unsigned long addr,
 +				   unsigned long next, struct mm_walk *walk)
 +{
 +	pte_t pteval = ptep_get(pte);
 +	unsigned long *nr_pages = (unsigned long *)walk->private;
 +
 +	/* If there is already a guard page marker, we have nothing to do. */
 +	if (is_guard_pte_marker(pteval)) {
 +		(*nr_pages)++;
 +
 +		return 0;
 +	}
 +
 +	/* If populated return >0 so we abort the operation + zap. */
 +	return 1;
 +}
 +
 +static int guard_install_set_pte(unsigned long addr, unsigned long next,
 +				 pte_t *ptep, struct mm_walk *walk)
 +{
 +	unsigned long *nr_pages = (unsigned long *)walk->private;
 +
 +	/* Simply install a PTE marker, this causes segfault on access. */
 +	*ptep = make_pte_marker(PTE_MARKER_GUARD);
 +	(*nr_pages)++;
 +
 +	return 0;
 +}
 +
 +static const struct mm_walk_ops guard_install_walk_ops = {
 +	.pud_entry		= guard_install_pud_entry,
 +	.pmd_entry		= guard_install_pmd_entry,
 +	.pte_entry		= guard_install_pte_entry,
 +	.install_pte		= guard_install_set_pte,
 +	.walk_lock		= PGWALK_RDLOCK,
 +};
 +
 +static long madvise_guard_install(struct vm_area_struct *vma,
 +				 struct vm_area_struct **prev,
 +				 unsigned long start, unsigned long end)
 +{
 +	long err;
 +	int i;
 +
 +	*prev = vma;
 +	if (!is_valid_guard_vma(vma, /* allow_locked = */false))
 +		return -EINVAL;
 +
 +	/*
 +	 * If we install guard markers, then the range is no longer
 +	 * empty from a page table perspective and therefore it's
 +	 * appropriate to have an anon_vma.
 +	 *
 +	 * This ensures that on fork, we copy page tables correctly.
 +	 */
 +	err = anon_vma_prepare(vma);
 +	if (err)
 +		return err;
 +
 +	/*
 +	 * Optimistically try to install the guard marker pages first. If any
 +	 * non-guard pages are encountered, give up and zap the range before
 +	 * trying again.
 +	 *
 +	 * We try a few times before giving up and releasing back to userland to
 +	 * loop around, releasing locks in the process to avoid contention. This
 +	 * would only happen if there was a great many racing page faults.
 +	 *
 +	 * In most cases we should simply install the guard markers immediately
 +	 * with no zap or looping.
 +	 */
 +	for (i = 0; i < MAX_MADVISE_GUARD_RETRIES; i++) {
 +		unsigned long nr_pages = 0;
 +
 +		/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
 +		err = walk_page_range_mm(vma->vm_mm, start, end,
 +					 &guard_install_walk_ops, &nr_pages);
 +		if (err < 0)
 +			return err;
 +
 +		if (err == 0) {
 +			unsigned long nr_expected_pages = PHYS_PFN(end - start);
 +
 +			VM_WARN_ON(nr_pages != nr_expected_pages);
 +			return 0;
 +		}
 +
 +		/*
 +		 * OK some of the range have non-guard pages mapped, zap
 +		 * them. This leaves existing guard pages in place.
 +		 */
 +		zap_page_range_single(vma, start, end - start, NULL);
 +	}
 +
 +	/*
 +	 * We were unable to install the guard pages due to being raced by page
 +	 * faults. This should not happen ordinarily. We return to userspace and
 +	 * immediately retry, relieving lock contention.
 +	 */
 +	return restart_syscall();
 +}
 +
 +static int guard_remove_pud_entry(pud_t *pud, unsigned long addr,
 +				  unsigned long next, struct mm_walk *walk)
 +{
 +	pud_t pudval = pudp_get(pud);
 +
 +	/* If huge, cannot have guard pages present, so no-op - skip. */
- 	if (pud_trans_huge(pudval) || pud_devmap(pudval))
++	if (pud_trans_huge(pudval))
 +		walk->action = ACTION_CONTINUE;
 +
 +	return 0;
 +}
 +
 +static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr,
 +				  unsigned long next, struct mm_walk *walk)
 +{
 +	pmd_t pmdval = pmdp_get(pmd);
 +
 +	/* If huge, cannot have guard pages present, so no-op - skip. */
- 	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
++	if (pmd_trans_huge(pmdval))
 +		walk->action = ACTION_CONTINUE;
 +
 +	return 0;
 +}
 +
 +static int guard_remove_pte_entry(pte_t *pte, unsigned long addr,
 +				  unsigned long next, struct mm_walk *walk)
 +{
 +	pte_t ptent = ptep_get(pte);
 +
 +	if (is_guard_pte_marker(ptent)) {
 +		/* Simply clear the PTE marker. */
 +		pte_clear_not_present_full(walk->mm, addr, pte, false);
 +		update_mmu_cache(walk->vma, addr, pte);
 +	}
 +
 +	return 0;
 +}
 +
 +static const struct mm_walk_ops guard_remove_walk_ops = {
 +	.pud_entry		= guard_remove_pud_entry,
 +	.pmd_entry		= guard_remove_pmd_entry,
 +	.pte_entry		= guard_remove_pte_entry,
 +	.walk_lock		= PGWALK_RDLOCK,
 +};
 +
 +static long madvise_guard_remove(struct vm_area_struct *vma,
 +				 struct vm_area_struct **prev,
 +				 unsigned long start, unsigned long end)
 +{
 +	*prev = vma;
 +	/*
 +	 * We're ok with removing guards in mlock()'d ranges, as this is a
 +	 * non-destructive action.
 +	 */
 +	if (!is_valid_guard_vma(vma, /* allow_locked = */true))
 +		return -EINVAL;
 +
 +	return walk_page_range(vma->vm_mm, start, end,
 +			       &guard_remove_walk_ops, NULL);
 +}
 +
  /*
   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
   * will handle splitting a vm area into separate areas, each area with its own
diff --cc mm/pagewalk.c
index e478777c86e1,576ff14b118f..a85c3319c6d5
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@@ -133,25 -109,18 +133,24 @@@ again
  
  		if (walk->action == ACTION_AGAIN)
  			goto again;
 -
 -		/*
 -		 * Check this here so we only break down trans_huge
 -		 * pages when we _need_ to
 -		 */
 -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
 -		    walk->action == ACTION_CONTINUE ||
 -		    !(ops->pte_entry))
 +		if (walk->action == ACTION_CONTINUE)
  			continue;
  
 +		if (!has_handler) { /* No handlers for lower page tables. */
 +			if (!has_install)
 +				continue; /* Nothing to do. */
 +			/*
 +			 * We are ONLY installing, so avoid unnecessarily
 +			 * splitting a present huge page.
 +			 */
- 			if (pmd_present(*pmd) &&
- 			    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
++			if (pmd_present(*pmd) && pmd_trans_huge(*pmd))
 +				continue;
 +		}
 +
  		if (walk->vma)
  			split_huge_pmd(walk->vma, pmd, addr);
 +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
 +			continue; /* Nothing to do. */
  
  		err = walk_pte_range(pmd, addr, next, walk);
  		if (err)
@@@ -200,26 -164,14 +199,25 @@@ static int walk_pud_range(p4d_t *p4d, u
  
  		if (walk->action == ACTION_AGAIN)
  			goto again;
 -
 -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
 -		    walk->action == ACTION_CONTINUE ||
 -		    !(ops->pmd_entry || ops->pte_entry))
 +		if (walk->action == ACTION_CONTINUE)
  			continue;
  
 +		if (!has_handler) { /* No handlers for lower page tables. */
 +			if (!has_install)
 +				continue; /* Nothing to do. */
 +			/*
 +			 * We are ONLY installing, so avoid unnecessarily
 +			 * splitting a present huge page.
 +			 */
- 			if (pud_present(*pud) &&
- 			    (pud_trans_huge(*pud) || pud_devmap(*pud)))
++			if (pud_present(*pud) && pud_trans_huge(*pud))
 +				continue;
 +		}
 +
  		if (walk->vma)
  			split_huge_pud(walk->vma, pud, addr);
 +		else if (pud_leaf(*pud) || !pud_present(*pud))
 +			continue; /* Nothing to do. */
 +
  		if (pud_none(*pud))
  			goto again;
  
@@@ -868,11 -753,7 +866,11 @@@ struct folio *folio_walk_start(struct f
  		fw->pudp = pudp;
  		fw->pud = pud;
  
 +		/*
 +		 * TODO: FW_MIGRATION support for PUD migration entries
 +		 * once there are relevant users.
 +		 */
- 		if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) {
+ 		if (!pud_present(pud) || pud_special(pud)) {
  			spin_unlock(ptl);
  			goto not_found;
  		} else if (!pud_leaf(pud)) {
diff --cc mm/truncate.c
index 7c304d2f0052,ee2f890dedde..cb29feac4624
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@@ -73,34 -83,51 +73,47 @@@ static void truncate_folio_batch_except
  		if (xa_is_value(fbatch->folios[j]))
  			break;
  
 -	if (j == folio_batch_count(fbatch))
 +	if (j == nr)
  		return;
  
 -	dax = dax_mapping(mapping);
 -	if (!dax) {
 -		spin_lock(&mapping->host->i_lock);
 -		xa_lock_irq(&mapping->i_pages);
 -	}
 -
 -	for (i = j; i < folio_batch_count(fbatch); i++) {
 -		struct folio *folio = fbatch->folios[i];
 -		pgoff_t index = indices[i];
 +	if (dax_mapping(mapping)) {
 +		for (i = j; i < nr; i++) {
- 			if (xa_is_value(fbatch->folios[i]))
++			if (xa_is_value(fbatch->folios[i])) {
++				/*
++			 	* File systems should already have called
++			 	* dax_break_mapping_entry() to remove all DAX entries
++			 	* while holding a lock to prevent establishing new
++			 	* entries. Therefore we shouldn't find any here.
++			 	*/
++				WARN_ON_ONCE(1);
+ 
 -		if (!xa_is_value(folio)) {
 -			fbatch->folios[j++] = folio;
 -			continue;
++				/*
++			 	* Delete the mapping so truncate_pagecache() doesn't
++			 	* loop forever.
++			 	*/
 +				dax_delete_mapping_entry(mapping, indices[i]);
++			}
  		}
 +		goto out;
 +	}
  
 -		if (unlikely(dax)) {
 -			/*
 -			 * File systems should already have called
 -			 * dax_break_mapping_entry() to remove all DAX entries
 -			 * while holding a lock to prevent establishing new
 -			 * entries. Therefore we shouldn't find any here.
 -			 */
 -			WARN_ON_ONCE(1);
 +	xas_set(&xas, indices[j]);
 +	xas_set_update(&xas, workingset_update_node);
  
 -			/*
 -			 * Delete the mapping so truncate_pagecache() doesn't
 -			 * loop forever.
 -			 */
 -			dax_delete_mapping_entry(mapping, index);
 -			continue;
 -		}
 +	spin_lock(&mapping->host->i_lock);
 +	xas_lock_irq(&xas);
  
 -		__clear_shadow_entry(mapping, index, folio);
 +	xas_for_each(&xas, folio, indices[nr-1]) {
 +		if (xa_is_value(folio))
 +			xas_store(&xas, NULL);
  	}
  
 -	if (!dax) {
 -		xa_unlock_irq(&mapping->i_pages);
 -		if (mapping_shrinkable(mapping))
 -			inode_add_lru(mapping->host);
 -		spin_unlock(&mapping->host->i_lock);
 -	}
 -	fbatch->nr = j;
 +	xas_unlock_irq(&xas);
 +	if (mapping_shrinkable(mapping))
 +		inode_add_lru(mapping->host);
 +	spin_unlock(&mapping->host->i_lock);
 +out:
 +	folio_batch_remove_exceptionals(fbatch);
  }
  
  /**
diff --cc mm/vmscan.c
index 78e9360a20aa,34ff5a6635b4..f0c8eac63972
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@@ -3362,12 -3303,9 +3362,12 @@@ static unsigned long get_pte_pfn(pte_t 
  	if (!pte_present(pte) || is_zero_pfn(pfn))
  		return -1;
  
- 	if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
+ 	if (WARN_ON_ONCE(pte_special(pte)))
  		return -1;
  
 +	if (!pte_young(pte) && !mm_has_notifiers(vma->vm_mm))
 +		return -1;
 +
  	if (WARN_ON_ONCE(!pfn_valid(pfn)))
  		return -1;
  
@@@ -3387,12 -3321,6 +3387,9 @@@ static unsigned long get_pmd_pfn(pmd_t 
  	if (!pmd_present(pmd) || is_huge_zero_pmd(pmd))
  		return -1;
  
- 	if (WARN_ON_ONCE(pmd_devmap(pmd)))
- 		return -1;
- 
 +	if (!pmd_young(pmd) && !mm_has_notifiers(vma->vm_mm))
 +		return -1;
 +
  	if (WARN_ON_ONCE(!pfn_valid(pfn)))
  		return -1;

Re: [PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

Posted by David Hildenbrand 1 year, 1 month ago

On 14.12.24 02:39, Dan Williams wrote:
> [ add akpm and sfr for next steps ]
> 
> Alistair Popple wrote:
>> Main updates since v2:
>>
>>   - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
>>     and have them pass the vmf struct.
>>
>>   - Seperate out the device DAX changes.
>>
>>   - Restore the page share mapping counting and associated warnings.
>>
>>   - Rework truncate to require file-systems to have previously called
>>     dax_break_layout() to remove the address space mapping for a
>>     page. This found several bugs which are fixed by the first half of
>>     the series. The motivation for this was initially to allow the FS
>>     DAX page-cache mappings to hold a reference on the page.
>>
>>     However that turned out to be a dead-end (see the comments on patch
>>     21), but it found several bugs and I think overall it is an
>>     improvement so I have left it here.
>>
>> Device and FS DAX pages have always maintained their own page
>> reference counts without following the normal rules for page reference
>> counting. In particular pages are considered free when the refcount
>> hits one rather than zero and refcounts are not added when mapping the
>> page.
>>
>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> mechanism for allowing GUP to hold references on the page (see
>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> DAX pages need their own reference counting scheme.
>>
>> By treating the refcounts on these pages the same way as normal pages
>> we can remove a lot of special checks. In particular pXd_trans_huge()
>> becomes the same as pXd_leaf(), although I haven't made that change
>> here. It also frees up a valuable SW define PTE bit on architectures
>> that have devmap PTE bits defined.
>>
>> It also almost certainly allows further clean-up of the devmap managed
>> functions, but I have left that as a future improvment. It also
>> enables support for compound ZONE_DEVICE pages which is one of my
>> primary motivators for doing this work.
> 
> So this is feeling ready for -next exposure, and ideally merged for v6.14. I
> see the comments from John and Bjorn and that you were going to respin for
> that, but if it's just those details things they can probably be handled
> incrementally.
> 
> Alistair, are you ready for this to hit -next?
> 
> As for which tree...
> 
> Andrew, we could take this through -mm, but my first instinct would be to try
> to take it through nvdimm.git mainly to offload any conflict wrangling work and
> small fixups which are likely to be an ongoing trickle.
> 
> However, I am not going to put up much of a fight if others prefer this go
> through -mm.
> 
> Thoughts?

I'm in the process of preparing v2 of [1] that will result in conflicts 
with this series in the rmap code (in particular [PATCH v3 14/25] 
huge_memory: Allow mappings of PUD sized pages).

I'll be away for 2 weeks over Christmas, but I assume I'll manage to 
post v2 shortly.

Which reminds me that I still have to take a closer look at some things 
in this series :) Especially also #14 regarding accounting.

I wonder if we could split out the rmap changes in #14, and have that 
patch simply in two trees? No idea.

[1] 
https://lore.kernel.org/all/20240829165627.2256514-1-david@redhat.com/T/#u

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

Posted by Alistair Popple 1 year, 1 month ago

On Sat, Dec 14, 2024 at 04:22:58PM +0100, David Hildenbrand wrote:
> On 14.12.24 02:39, Dan Williams wrote:
> > [ add akpm and sfr for next steps ]
> > 
> > Alistair Popple wrote:
> > > Main updates since v2:
> > > 
> > >   - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
> > >     and have them pass the vmf struct.
> > > 
> > >   - Seperate out the device DAX changes.
> > > 
> > >   - Restore the page share mapping counting and associated warnings.
> > > 
> > >   - Rework truncate to require file-systems to have previously called
> > >     dax_break_layout() to remove the address space mapping for a
> > >     page. This found several bugs which are fixed by the first half of
> > >     the series. The motivation for this was initially to allow the FS
> > >     DAX page-cache mappings to hold a reference on the page.
> > > 
> > >     However that turned out to be a dead-end (see the comments on patch
> > >     21), but it found several bugs and I think overall it is an
> > >     improvement so I have left it here.
> > > 
> > > Device and FS DAX pages have always maintained their own page
> > > reference counts without following the normal rules for page reference
> > > counting. In particular pages are considered free when the refcount
> > > hits one rather than zero and refcounts are not added when mapping the
> > > page.
> > > 
> > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> > > mechanism for allowing GUP to hold references on the page (see
> > > get_dev_pagemap). However there doesn't seem to be any reason why FS
> > > DAX pages need their own reference counting scheme.
> > > 
> > > By treating the refcounts on these pages the same way as normal pages
> > > we can remove a lot of special checks. In particular pXd_trans_huge()
> > > becomes the same as pXd_leaf(), although I haven't made that change
> > > here. It also frees up a valuable SW define PTE bit on architectures
> > > that have devmap PTE bits defined.
> > > 
> > > It also almost certainly allows further clean-up of the devmap managed
> > > functions, but I have left that as a future improvment. It also
> > > enables support for compound ZONE_DEVICE pages which is one of my
> > > primary motivators for doing this work.
> > 
> > So this is feeling ready for -next exposure, and ideally merged for v6.14. I
> > see the comments from John and Bjorn and that you were going to respin for
> > that, but if it's just those details things they can probably be handled
> > incrementally.
> > 
> > Alistair, are you ready for this to hit -next?

Yeah, I'm pretty happy with the series now. It "feels" right.

There's a couple of dumb build bot errors, so I was going to respin to fix
those as well. I got caught up with a few other things so was just letting this
sit awaiting feedback, but I should be able to post a respin early this week.

> > As for which tree...
> > 
> > Andrew, we could take this through -mm, but my first instinct would be to try
> > to take it through nvdimm.git mainly to offload any conflict wrangling work and
> > small fixups which are likely to be an ongoing trickle.
> > 
> > However, I am not going to put up much of a fight if others prefer this go
> > through -mm.
> > 
> > Thoughts?
> 
> I'm in the process of preparing v2 of [1] that will result in conflicts with
> this series in the rmap code (in particular [PATCH v3 14/25] huge_memory:
> Allow mappings of PUD sized pages).
> 
> I'll be away for 2 weeks over Christmas, but I assume I'll manage to post v2
> shortly.
> 
> Which reminds me that I still have to take a closer look at some things in
> this series :) Especially also #14 regarding accounting.
> 
> I wonder if we could split out the rmap changes in #14, and have that patch
> simply in two trees? No idea.

I could split out the first half (patches 1 - 8) into a series to go via
nvdimm.git, because they are actually standalone clean ups that I think are
worthwhile anyway.

The remainder are more -mm focussed. However they do depend on the fs/dax
cleanups in the first half so the trick would be making sure Andrew only takes
them if the nvdimm.git changes have made it into -next. I'm happy with either
approach, so let me know if I should split the series or not.

 - Alistair

> [1]
> https://lore.kernel.org/all/20240829165627.2256514-1-david@redhat.com/T/#u
> 
> -- 
> Cheers,
> 
> David / dhildenb
>

Re: [PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

Posted by Andrew Morton 1 year, 1 month ago

On Mon, 16 Dec 2024 11:55:30 +1100 Alistair Popple <apopple@nvidia.com> wrote:

> The remainder are more -mm focussed. However they do depend on the fs/dax
> cleanups in the first half so the trick would be making sure Andrew only takes
> them if the nvdimm.git changes have made it into -next. I'm happy with either
> approach, so let me know if I should split the series or not.

My inclination is to put it all into the nvdimm tree, with appropriate
MM developer acks.

But I'm having difficulty determining how practical that is because the
v3 series is almost a month old so my test merging was quite ugly.

Perhaps you could prepare a new-doesn't-need-to-be-final version for
people to look at and to aid with this head-scratching?

Re: [PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

Posted by Alistair Popple 1 year, 1 month ago

On Sun, Dec 15, 2024 at 10:26:55PM -0800, Andrew Morton wrote:
> On Mon, 16 Dec 2024 11:55:30 +1100 Alistair Popple <apopple@nvidia.com> wrote:
> 
> > The remainder are more -mm focussed. However they do depend on the fs/dax
> > cleanups in the first half so the trick would be making sure Andrew only takes
> > them if the nvdimm.git changes have made it into -next. I'm happy with either
> > approach, so let me know if I should split the series or not.
> 
> My inclination is to put it all into the nvdimm tree, with appropriate
> MM developer acks.
> 
> But I'm having difficulty determining how practical that is because the
> v3 series is almost a month old so my test merging was quite ugly.
> 
> Perhaps you could prepare a new-doesn't-need-to-be-final version for
> people to look at and to aid with this head-scratching?

I have just sent a new-maybe-almost-final v4 rebased on top of next-20241216 to
help with the head-scratching. I haven't yet done extensive build tests or a full
xfs-test run on it yet because it sounded better to get it out sooner. So no doubt
the kernel build bot will find some fat finger of mine somewhere :-)

That said the rebase wasn't awful so let me know if it should be rebased on a
different tree.

 - Alistair