From nobody Mon Apr 6 09:19:39 2026 Received: from mail-wr1-f74.google.com (mail-wr1-f74.google.com [209.85.221.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AF783E6DFB for ; Fri, 20 Mar 2026 18:23:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774031023; cv=none; b=B60nqYV56aZAW+ZO1MF2SnqUiwlFgLWKSURBDk1JNAlQo6jP5oM/fpu9RpD5sbw5t5ht1WrMOiRUJxedcrU/E7iB+XgH2b/Zr6YwpxIzal7hfB8yUNNQR0Jwf0A2liTj4cKX16IZr8/9U5cPN2N0yF8z8Iteyeu69jpbL7aPlm4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774031023; c=relaxed/simple; bh=RtcTlGqR1GPWs0Lurn4FWND5DqVggOImdX3Dy6NkWZs=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=lFP4UZZZUwUz5OTqZ/0Nh+KNnO6/vJ+Prdm1fpPukLBWfCRFOMtaQf/yQVYkrbiE2s0AsT3TUP5NGr5b/X/KqKmQGYP05dAA3W/DSAn54cCrcS7vWXEVXRKVWI91zPkSLPQT7dLp0xEPlqVkDJXdV3tvPcjnK2wAzljTBHrX1Qs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jackmanb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=IEuMcFRG; arc=none smtp.client-ip=209.85.221.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jackmanb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="IEuMcFRG" Received: by mail-wr1-f74.google.com with SMTP id ffacd0b85a97d-43b4d3919e4so2110836f8f.0 for ; Fri, 20 Mar 2026 11:23:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774031020; x=1774635820; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=kJ8SkeYHKeVD4r/zvzrMT4HJuUu/LIHCH2575HOl5G0=; b=IEuMcFRGA5dMj+dD71QXHDaHNfDg97qCVAIyi4m5VXaUkHQmoUUNguS/fbGfbo5OM8 GzogyTDOOucEyMnppi4EI5HnToqKhUVoaBEW9yknHZ4aLmuuLzxMOaS7QmEByfFhgwGu 0FQmJ70OKuI6ukqDfnwOil2WZZCi+LvYPaXi8VuFiZ3Tot5tR1zViti2EUgfM0W9lKuc AdMVhQsSr6NY+b/pmGjyZ8jimeKnWsAkmFGKyc9oVXaEyynObDJR6cP6WeByQ98u16Yv W3OXhc3m9gVEHmhdZsObQZE/LC49wPQd+jdhjEvp+/j09gYCLADxCeeF3E80UZoOYRKo V9JQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774031020; x=1774635820; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kJ8SkeYHKeVD4r/zvzrMT4HJuUu/LIHCH2575HOl5G0=; b=I2M1y2zeURpyEt9vxdgRD8PhJOJCbth6hOPu7c5BCy2aB4C3YzrvG84PmIvagrgsmm BVDgoNaBjhgIgA5co8BADwX0ImmJuBofEFFg4bAyqwV2FKgRMnG8dwiXHWunxxY5vu7l +/WSoxU6nAfFvW6mzQq9vEmujRe8veBI7UY8SVhkeHgGSznjwCLWB8EaTbrCMpXLbgeI gPuFDWXstYSXcmW2BiuAxVjrb1Wl0kCtyR8PcJQXjWYvqfnVw6GhDNwgdv4mV03oSetD hzgEUhNRcrr8Lv3nPbpOq5bKBrKh9T/JINQk1x2pl1G5RPV+nid72W4f2rvdSDzv4TdU nyxA== X-Forwarded-Encrypted: i=1; AJvYcCXrw53yqa0qnQXEFQASDYqu3jO1ZmeR9Sh8vydY8b13AZVcwM0QThZBG2ltpbuPntwj3hvtuje9IypEE58=@vger.kernel.org X-Gm-Message-State: AOJu0YxE62gyB6WGvQ7hPBlGnNlJ3cXfj4KdkYe0RXS8sQ5ilXSKvJku 3YJ+/dSFKsjBBlNh0hg0/f3eTsxSDidgGsZqPhpP5OyqPv3g6g9hTUL8vHWEET0tPH/WF+nhdjj jvE7/mnm1qEUPTg== X-Received: from wrrv15.prod.google.com ([2002:a5d:43cf:0:b0:439:afd2:999b]) (user=jackmanb job=prod-delivery.src-stubby-dispatcher) by 2002:a5d:5d86:0:b0:439:b114:60c0 with SMTP id ffacd0b85a97d-43b6427840dmr7188274f8f.35.1774031019553; Fri, 20 Mar 2026 11:23:39 -0700 (PDT) Date: Fri, 20 Mar 2026 18:23:26 +0000 In-Reply-To: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com> X-Mailer: b4 0.14.3 Message-ID: <20260320-page_alloc-unmapped-v2-2-28bf1bd54f41@google.com> Subject: [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" From: Brendan Jackman To: Borislav Petkov , Dave Hansen , Peter Zijlstra , Andrew Morton , David Hildenbrand , Vlastimil Babka , Wei Xu , Johannes Weiner , Zi Yan , Lorenzo Stoakes Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, rppt@kernel.org, Sumit Garg , derkling@google.com, reijiw@google.com, Will Deacon , rientjes@google.com, "Kalyazin, Nikita" , patrick.roy@linux.dev, "Itazuri, Takahiro" , Andy Lutomirski , David Kaplan , Thomas Gleixner , Brendan Jackman , Yosry Ahmed Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Various security features benefit from having process-local address mappings. Examples include no-direct-map guest_memfd [2] and significant optimizations for ASI [1]. As pointed out by Andy in [0], x86 already has a PGD entry that is local to the mm, which is used for the LDT. So, simply redefine that entry's region as "the mm-local region" and then redefine the LDT region as a sub-region of that. With the currently-envisaged usecases, there will be many situations where almost no processes have any need for the mm-local region. Therefore, avoid its overhead (memory cost of pagetables, alloc/free overhead during fork/exit) for processes that don't use it by requiring its users to explicitly initialize it via the new mm_local_* API. This means that the LDT remap code can be simplified: 1. map_ldt_struct_to_user() and free_ldt_pgtables() are no longer required as the mm_local core code handles that automatically. 2. The sanity-check logic is unified: in both cases just walk the pagetables via a generic mechanism. This slightly relaxes the sanity-checking since lookup_address_in_pgd() is more flexible than pgd_to_pmd_walk(), but this seems to be worth it for the simplified code. On 64-bit, the mm-local region gets a whole PGD. On 32-bit, it just i.e. one PMD, i.e. it is completely consumed by the LDT remap - no investigation has been done into whether it's feasible to expand the region on 32-bit. Most likely there is no strong usecase for that anyway. In both cases, in order to combine the need for an on-demand mm initialisation, combined with the desire to transparently handle propagating mappings to userspace under KPTI, the user and kernel pagetables are shared at the highest level possible. For PAE that means the PTE table is shared and for 64-bit the P4D/PUD. This is implemented by pre-allocating the first shared table when the mm-local region is first initialised. The PAE implementation of mm_local_map_to_user() does not allocate pagetables, it assumes the PMD has been preallocated. To make that assumption safer, expose PREALLOCATED_PMDs in the arch headers so that mm_local_map_to_user() can have a BUILD_BUG_ON(). [0] https://lore.kernel.org/linux-mm/CALCETrXHbS9VXfZ80kOjiTrreM2EbapYeGp68= mvJPbosUtorYA@mail.gmail.com/ [1] https://linuxasi.dev/ [2] https://lore.kernel.org/all/20250924151101.2225820-1-patrick.roy@campus= .lmu.de Signed-off-by: Brendan Jackman --- Documentation/arch/x86/x86_64/mm.rst | 4 +- arch/x86/Kconfig | 2 + arch/x86/include/asm/mmu_context.h | 119 ++++++++++++++++++++++++++++- arch/x86/include/asm/page.h | 32 ++++++++ arch/x86/include/asm/pgtable_32_areas.h | 9 ++- arch/x86/include/asm/pgtable_64_types.h | 12 ++- arch/x86/kernel/ldt.c | 130 +++++-----------------------= ---- arch/x86/mm/pgtable.c | 32 +------- include/linux/mm.h | 13 ++++ include/linux/mm_types.h | 2 + kernel/fork.c | 1 + mm/Kconfig | 11 +++ 12 files changed, 217 insertions(+), 150 deletions(-) diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/= x86_64/mm.rst index a6cf05d51bd8c..fa2bb7bab6a42 100644 --- a/Documentation/arch/x86/x86_64/mm.rst +++ b/Documentation/arch/x86/x86_64/mm.rst @@ -53,7 +53,7 @@ Complete virtual memory map with 4-level page tables ____________________________________________________________|___________= ________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard = hole, also reserved for hypervisor - ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap = for PTI + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | MM-local k= ernel data. Includes LDT remap for PTI ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct map= ping of all physical memory (page_offset_base) ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused= hole ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/io= remap space (vmalloc_base) @@ -123,7 +123,7 @@ Complete virtual memory map with 5-level page tables ____________________________________________________________|___________= ________________________________________________ | | | | ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard = hole, also reserved for hypervisor - ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap = for PTI + ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | MM-local k= ernel data. Includes LDT remap for PTI ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct map= ping of all physical memory (page_offset_base) ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused= hole ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/io= remap space (vmalloc_base) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8038b26ae99e0..d7073b6077c62 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -133,6 +133,7 @@ config X86 select ARCH_SUPPORTS_RT select ARCH_SUPPORTS_AUTOFDO_CLANG select ARCH_SUPPORTS_PROPELLER_CLANG if X86_64 + select ARCH_SUPPORTS_MM_LOCAL_REGION if X86_64 || X86_PAE select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF if X86_CX8 select ARCH_USE_MEMTEST @@ -2323,6 +2324,7 @@ config CMDLINE_OVERRIDE config MODIFY_LDT_SYSCALL bool "Enable the LDT (local descriptor table)" if EXPERT default y + select MM_LOCAL_REGION if MITIGATION_PAGE_TABLE_ISOLATION || X86_PAE help Linux can allow user programs to install a per-process x86 Local Descriptor Table (LDT) using the modify_ldt(2) system diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_= context.h index ef5b507de34e2..14f75d1d7e28f 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -8,8 +8,10 @@ =20 #include =20 +#include #include #include +#include #include #include #include @@ -59,7 +61,6 @@ static inline void init_new_context_ldt(struct mm_struct = *mm) } int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm); void destroy_context_ldt(struct mm_struct *mm); -void ldt_arch_exit_mmap(struct mm_struct *mm); #else /* CONFIG_MODIFY_LDT_SYSCALL */ static inline void init_new_context_ldt(struct mm_struct *mm) { } static inline int ldt_dup_context(struct mm_struct *oldmm, @@ -68,7 +69,6 @@ static inline int ldt_dup_context(struct mm_struct *oldmm, return 0; } static inline void destroy_context_ldt(struct mm_struct *mm) { } -static inline void ldt_arch_exit_mmap(struct mm_struct *mm) { } #endif =20 #ifdef CONFIG_MODIFY_LDT_SYSCALL @@ -223,10 +223,123 @@ static inline int arch_dup_mmap(struct mm_struct *ol= dmm, struct mm_struct *mm) return ldt_dup_context(oldmm, mm); } =20 +#ifdef CONFIG_MM_LOCAL_REGION +static inline void mm_local_region_free(struct mm_struct *mm) +{ + if (mm_local_region_used(mm)) { + struct mmu_gather tlb; + unsigned long start =3D MM_LOCAL_BASE_ADDR; + unsigned long end =3D MM_LOCAL_END_ADDR; + + /* + * Although free_pgd_range() is intended for freeing user + * page-tables, it also works out for kernel mappings on x86. + * We use tlb_gather_mmu_fullmm() to avoid confusing the + * range-tracking logic in __tlb_adjust_range(). + */ + tlb_gather_mmu_fullmm(&tlb, mm); + free_pgd_range(&tlb, start, end, start, end); + tlb_finish_mmu(&tlb); + + mm_flags_clear(MMF_LOCAL_REGION_USED, mm); + } +} + +#if defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) && defined(CONFIG_X86_= PAE) +static inline pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va) +{ + p4d_t *p4d; + pud_t *pud; + + if (pgd->pgd =3D=3D 0) + return NULL; + + p4d =3D p4d_offset(pgd, va); + if (p4d_none(*p4d)) + return NULL; + + pud =3D pud_offset(p4d, va); + if (pud_none(*pud)) + return NULL; + + return pmd_offset(pud, va); +} + +static inline int mm_local_map_to_user(struct mm_struct *mm) +{ + BUILD_BUG_ON(!PREALLOCATED_PMDS); + pgd_t *k_pgd =3D pgd_offset(mm, MM_LOCAL_BASE_ADDR); + pgd_t *u_pgd =3D kernel_to_user_pgdp(k_pgd); + pmd_t *k_pmd, *u_pmd; + int err; + + k_pmd =3D pgd_to_pmd_walk(k_pgd, MM_LOCAL_BASE_ADDR); + u_pmd =3D pgd_to_pmd_walk(u_pgd, MM_LOCAL_BASE_ADDR); + + BUILD_BUG_ON(MM_LOCAL_END_ADDR - MM_LOCAL_BASE_ADDR > PMD_SIZE); + + /* Preallocate the PTE table so it can be shared. */ + err =3D pte_alloc(mm, k_pmd); + if (err) + return err; + + /* Point the userspace PMD at the same PTE as the kernel PMD. */ + set_pmd(u_pmd, *k_pmd); + return 0; +} +#elif defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) +static inline int mm_local_map_to_user(struct mm_struct *mm) +{ + pgd_t *pgd; + int err; + + err =3D preallocate_sub_pgd(mm, MM_LOCAL_BASE_ADDR); + if (err) + return err; + + pgd =3D pgd_offset(mm, MM_LOCAL_BASE_ADDR); + set_pgd(kernel_to_user_pgdp(pgd), *pgd); + return 0; +} +#else +static inline int mm_local_map_to_user(struct mm_struct *mm) +{ + WARN_ONCE(1, "mm_local_map_to_user() not implemented"); + return -EINVAL; +} +#endif + +/* + * Do initial setup of the user-local region. Call from process context. + * + * Under PTI, userspace shares the pagetables for the mm-local region with= the + * kernel (if you map stuff here, it's immediately mapped into userspace t= oo). + * LDT remap. It's assuming nothing gets mapped in here that needs to be + * protected from Meltdown-type attacks from the current process. + */ +static inline int mm_local_region_init(struct mm_struct *mm) +{ + int err; + + if (boot_cpu_has(X86_FEATURE_PTI)) { + err =3D mm_local_map_to_user(mm); + if (err) + return err; + } + + mm_flags_set(MMF_LOCAL_REGION_USED, mm); + + return 0; +} + +#else +static inline void mm_local_region_free(struct mm_struct *mm) { } +#endif /* CONFIG_MM_LOCAL_REGION */ + static inline void arch_exit_mmap(struct mm_struct *mm) { paravirt_arch_exit_mmap(mm); - ldt_arch_exit_mmap(mm); + mm_local_region_free(mm); } =20 #ifdef CONFIG_X86_64 diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h index 416dc88e35c15..4de4715c3b40f 100644 --- a/arch/x86/include/asm/page.h +++ b/arch/x86/include/asm/page.h @@ -78,6 +78,38 @@ static __always_inline u64 __is_canonical_address(u64 va= ddr, u8 vaddr_bits) return __canonical_address(vaddr, vaddr_bits) =3D=3D vaddr; } =20 +#ifdef CONFIG_X86_PAE + +/* + * In PAE mode, we need to do a cr3 reload (=3Dtlb flush) when + * updating the top-level pagetable entries to guarantee the + * processor notices the update. Since this is expensive, and + * all 4 top-level entries are used almost immediately in a + * new process's life, we just pre-populate them here. + */ +#define PREALLOCATED_PMDS PTRS_PER_PGD +/* + * "USER_PMDS" are the PMDs for the user copy of the page tables when + * PTI is enabled. They do not exist when PTI is disabled. Note that + * this is distinct from the user _portion_ of the kernel page tables + * which always exists. + * + * We allocate separate PMDs for the kernel part of the user page-table + * when PTI is enabled. We need them to map the per-process LDT into the + * user-space page-table. + */ +#define PREALLOCATED_USER_PMDS (boot_cpu_has(X86_FEATURE_PTI) ? KERNEL_PGD= _PTRS : 0) +#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS + +#else /* !CONFIG_X86_PAE */ + +/* No need to prepopulate any pagetable entries in non-PAE modes. */ +#define PREALLOCATED_PMDS 0 +#define PREALLOCATED_USER_PMDS 0 +#define MAX_PREALLOCATED_USER_PMDS 0 + +#endif /* CONFIG_X86_PAE */ + #endif /* __ASSEMBLER__ */ =20 #include diff --git a/arch/x86/include/asm/pgtable_32_areas.h b/arch/x86/include/asm= /pgtable_32_areas.h index 921148b429676..7fccb887f8b33 100644 --- a/arch/x86/include/asm/pgtable_32_areas.h +++ b/arch/x86/include/asm/pgtable_32_areas.h @@ -30,9 +30,14 @@ extern bool __vmalloc_start_set; /* set once high_memory= is set */ #define CPU_ENTRY_AREA_BASE \ ((FIXADDR_TOT_START - PAGE_SIZE*(CPU_ENTRY_AREA_PAGES+1)) & PMD_MASK) =20 -#define LDT_BASE_ADDR \ - ((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK) +/* + * On 32-bit the mm-local region is currently completely consumed by the L= DT + * remap. + */ +#define MM_LOCAL_BASE_ADDR ((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK) +#define MM_LOCAL_END_ADDR (MM_LOCAL_BASE_ADDR + PMD_SIZE) =20 +#define LDT_BASE_ADDR MM_LOCAL_BASE_ADDR #define LDT_END_ADDR (LDT_BASE_ADDR + PMD_SIZE) =20 #define PKMAP_BASE \ diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm= /pgtable_64_types.h index 7eb61ef6a185f..1181565966405 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -5,8 +5,11 @@ #include =20 #ifndef __ASSEMBLER__ +#include #include #include +#include +#include =20 /* * These are used to make use of C type-checking.. @@ -100,9 +103,12 @@ extern unsigned int ptrs_per_p4d; #define GUARD_HOLE_BASE_ADDR (GUARD_HOLE_PGD_ENTRY << PGDIR_SHIFT) #define GUARD_HOLE_END_ADDR (GUARD_HOLE_BASE_ADDR + GUARD_HOLE_SIZE) =20 -#define LDT_PGD_ENTRY -240UL -#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) -#define LDT_END_ADDR (LDT_BASE_ADDR + PGDIR_SIZE) +#define MM_LOCAL_PGD_ENTRY -240UL +#define MM_LOCAL_BASE_ADDR (MM_LOCAL_PGD_ENTRY << PGDIR_SHIFT) +#define MM_LOCAL_END_ADDR ((MM_LOCAL_PGD_ENTRY + 1) << PGDIR_SHIFT) + +#define LDT_BASE_ADDR MM_LOCAL_BASE_ADDR +#define LDT_END_ADDR (LDT_BASE_ADDR + PMD_SIZE) =20 #define __VMALLOC_BASE_L4 0xffffc90000000000UL #define __VMALLOC_BASE_L5 0xffa0000000000000UL diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 40c5bf97dd5cc..fb2a1914539f8 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -31,6 +31,8 @@ =20 #include =20 +/* LDTs are double-buffered, the buffers are called slots. */ +#define LDT_NUM_SLOTS 2 /* This is a multiple of PAGE_SIZE. */ #define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE) =20 @@ -186,100 +188,36 @@ static struct ldt_struct *alloc_ldt_struct(unsigned = int num_entries) =20 #ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION =20 -static void do_sanity_check(struct mm_struct *mm, - bool had_kernel_mapping, - bool had_user_mapping) +static void sanity_check_ldt_mapping(struct mm_struct *mm) { + pgd_t *k_pgd =3D pgd_offset(mm, LDT_BASE_ADDR); + pgd_t *u_pgd =3D kernel_to_user_pgdp(k_pgd); + unsigned int k_level, u_level; + bool had_kernel, had_user; + + had_kernel =3D lookup_address_in_pgd(k_pgd, LDT_BASE_ADDR, &k_level); + had_user =3D lookup_address_in_pgd(u_pgd, LDT_BASE_ADDR, &u_level); + if (mm->context.ldt) { /* * We already had an LDT. The top-level entry should already * have been allocated and synchronized with the usermode * tables. */ - WARN_ON(!had_kernel_mapping); + WARN_ON(!had_kernel); if (boot_cpu_has(X86_FEATURE_PTI)) - WARN_ON(!had_user_mapping); + WARN_ON(!had_user); } else { /* * This is the first time we're mapping an LDT for this process. * Sync the pgd to the usermode tables. */ - WARN_ON(had_kernel_mapping); + WARN_ON(had_kernel); if (boot_cpu_has(X86_FEATURE_PTI)) - WARN_ON(had_user_mapping); + WARN_ON(had_user); } } =20 -#ifdef CONFIG_X86_PAE - -static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va) -{ - p4d_t *p4d; - pud_t *pud; - - if (pgd->pgd =3D=3D 0) - return NULL; - - p4d =3D p4d_offset(pgd, va); - if (p4d_none(*p4d)) - return NULL; - - pud =3D pud_offset(p4d, va); - if (pud_none(*pud)) - return NULL; - - return pmd_offset(pud, va); -} - -static void map_ldt_struct_to_user(struct mm_struct *mm) -{ - pgd_t *k_pgd =3D pgd_offset(mm, LDT_BASE_ADDR); - pgd_t *u_pgd =3D kernel_to_user_pgdp(k_pgd); - pmd_t *k_pmd, *u_pmd; - - k_pmd =3D pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR); - u_pmd =3D pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR); - - if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt) - set_pmd(u_pmd, *k_pmd); -} - -static void sanity_check_ldt_mapping(struct mm_struct *mm) -{ - pgd_t *k_pgd =3D pgd_offset(mm, LDT_BASE_ADDR); - pgd_t *u_pgd =3D kernel_to_user_pgdp(k_pgd); - bool had_kernel, had_user; - pmd_t *k_pmd, *u_pmd; - - k_pmd =3D pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR); - u_pmd =3D pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR); - had_kernel =3D (k_pmd->pmd !=3D 0); - had_user =3D (u_pmd->pmd !=3D 0); - - do_sanity_check(mm, had_kernel, had_user); -} - -#else /* !CONFIG_X86_PAE */ - -static void map_ldt_struct_to_user(struct mm_struct *mm) -{ - pgd_t *pgd =3D pgd_offset(mm, LDT_BASE_ADDR); - - if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt) - set_pgd(kernel_to_user_pgdp(pgd), *pgd); -} - -static void sanity_check_ldt_mapping(struct mm_struct *mm) -{ - pgd_t *pgd =3D pgd_offset(mm, LDT_BASE_ADDR); - bool had_kernel =3D (pgd->pgd !=3D 0); - bool had_user =3D (kernel_to_user_pgdp(pgd)->pgd !=3D 0); - - do_sanity_check(mm, had_kernel, had_user); -} - -#endif /* CONFIG_X86_PAE */ - /* * If PTI is enabled, this maps the LDT into the kernelmode and * usermode tables for the given mm. @@ -295,6 +233,8 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct = *ldt, int slot) if (!boot_cpu_has(X86_FEATURE_PTI)) return 0; =20 + mm_local_region_init(mm); + /* * Any given ldt_struct should have map_ldt_struct() called at most * once. @@ -339,9 +279,6 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct = *ldt, int slot) pte_unmap_unlock(ptep, ptl); } =20 - /* Propagate LDT mapping to the user page-table */ - map_ldt_struct_to_user(mm); - ldt->slot =3D slot; return 0; } @@ -390,28 +327,6 @@ static void unmap_ldt_struct(struct mm_struct *mm, str= uct ldt_struct *ldt) } #endif /* CONFIG_MITIGATION_PAGE_TABLE_ISOLATION */ =20 -static void free_ldt_pgtables(struct mm_struct *mm) -{ -#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION - struct mmu_gather tlb; - unsigned long start =3D LDT_BASE_ADDR; - unsigned long end =3D LDT_END_ADDR; - - if (!boot_cpu_has(X86_FEATURE_PTI)) - return; - - /* - * Although free_pgd_range() is intended for freeing user - * page-tables, it also works out for kernel mappings on x86. - * We use tlb_gather_mmu_fullmm() to avoid confusing the - * range-tracking logic in __tlb_adjust_range(). - */ - tlb_gather_mmu_fullmm(&tlb, mm); - free_pgd_range(&tlb, start, end, start, end); - tlb_finish_mmu(&tlb); -#endif -} - /* After calling this, the LDT is immutable. */ static void finalize_ldt_struct(struct ldt_struct *ldt) { @@ -472,7 +387,6 @@ int ldt_dup_context(struct mm_struct *old_mm, struct mm= _struct *mm) =20 retval =3D map_ldt_struct(mm, new_ldt, 0); if (retval) { - free_ldt_pgtables(mm); free_ldt_struct(new_ldt); goto out_unlock; } @@ -494,11 +408,6 @@ void destroy_context_ldt(struct mm_struct *mm) mm->context.ldt =3D NULL; } =20 -void ldt_arch_exit_mmap(struct mm_struct *mm) -{ - free_ldt_pgtables(mm); -} - static int read_ldt(void __user *ptr, unsigned long bytecount) { struct mm_struct *mm =3D current->mm; @@ -645,10 +554,9 @@ static int write_ldt(void __user *ptr, unsigned long b= ytecount, int oldmode) /* * This only can fail for the first LDT setup. If an LDT is * already installed then the PTE page is already - * populated. Mop up a half populated page table. + * populated. */ - if (!WARN_ON_ONCE(old_ldt)) - free_ldt_pgtables(mm); + WARN_ON_ONCE(!old_ldt); free_ldt_struct(new_ldt); goto out_unlock; } diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 2e5ecfdce73c3..e4132696c9ef2 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -111,29 +111,6 @@ static void pgd_dtor(pgd_t *pgd) */ =20 #ifdef CONFIG_X86_PAE -/* - * In PAE mode, we need to do a cr3 reload (=3Dtlb flush) when - * updating the top-level pagetable entries to guarantee the - * processor notices the update. Since this is expensive, and - * all 4 top-level entries are used almost immediately in a - * new process's life, we just pre-populate them here. - */ -#define PREALLOCATED_PMDS PTRS_PER_PGD - -/* - * "USER_PMDS" are the PMDs for the user copy of the page tables when - * PTI is enabled. They do not exist when PTI is disabled. Note that - * this is distinct from the user _portion_ of the kernel page tables - * which always exists. - * - * We allocate separate PMDs for the kernel part of the user page-table - * when PTI is enabled. We need them to map the per-process LDT into the - * user-space page-table. - */ -#define PREALLOCATED_USER_PMDS (boot_cpu_has(X86_FEATURE_PTI) ? \ - KERNEL_PGD_PTRS : 0) -#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS - void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd) { paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT); @@ -150,12 +127,6 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, p= md_t *pmd) */ flush_tlb_mm(mm); } -#else /* !CONFIG_X86_PAE */ - -/* No need to prepopulate any pagetable entries in non-PAE modes. */ -#define PREALLOCATED_PMDS 0 -#define PREALLOCATED_USER_PMDS 0 -#define MAX_PREALLOCATED_USER_PMDS 0 #endif /* CONFIG_X86_PAE */ =20 static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count) @@ -375,6 +346,9 @@ pgd_t *pgd_alloc(struct mm_struct *mm) =20 void pgd_free(struct mm_struct *mm, pgd_t *pgd) { + /* Should be cleaned up in mmap exit path. */ + VM_WARN_ON_ONCE(mm_local_region_used(mm)); + pgd_mop_up_pmds(mm, pgd); pgd_dtor(pgd); paravirt_pgd_free(mm, pgd); diff --git a/include/linux/mm.h b/include/linux/mm.h index 70747b53c7da9..413dc707cff9b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -906,6 +906,19 @@ static inline void mm_flags_clear_all(struct mm_struct= *mm) bitmap_zero(ACCESS_PRIVATE(&mm->flags, __mm_flags), NUM_MM_FLAG_BITS); } =20 +#ifdef CONFIG_MM_LOCAL_REGION +static inline bool mm_local_region_used(struct mm_struct *mm) +{ + return mm_flags_test(MMF_LOCAL_REGION_USED, mm); +} +#else +static inline bool mm_local_region_used(struct mm_struct *mm) +{ + VM_WARN_ON_ONCE(mm_flags_test(MMF_LOCAL_REGION_USED, mm)); + return false; +} +#endif + extern const struct vm_operations_struct vma_dummy_vm_ops; =20 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *= mm) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index cee934c6e78ec..0ca7cb7da918f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1944,6 +1944,8 @@ enum { =20 #define MMF_USER_HWCAP 32 /* user-defined HWCAPs */ =20 +#define MMF_LOCAL_REGION_USED 33 + #define MMF_INIT_LEGACY_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\ MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK) diff --git a/kernel/fork.c b/kernel/fork.c index 68cf0109dde3c..ff075c74333fe 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1153,6 +1153,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, fail_nocontext: mm_free_id(mm); fail_noid: + WARN_ON_ONCE(mm_local_region_used(mm)); mm_free_pgd(mm); fail_nopgd: futex_hash_free(mm); diff --git a/mm/Kconfig b/mm/Kconfig index ebd8ea353687e..2813059df9c1c 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1319,6 +1319,10 @@ config SECRETMEM default y bool "Enable memfd_secret() system call" if EXPERT depends on ARCH_HAS_SET_DIRECT_MAP + # Soft dependency, for optimisation. + imply MM_LOCAL_REGION + imply MERMAP + imply PAGE_ALLOC_UNMAPPED help Enable the memfd_secret() system call with the ability to create memory areas visible only in the context of the owning process and @@ -1471,6 +1475,13 @@ config LAZY_MMU_MODE_KUNIT_TEST =20 If unsure, say N. =20 +config ARCH_SUPPORTS_MM_LOCAL_REGION + def_bool n + +config MM_LOCAL_REGION + bool + depends on ARCH_SUPPORTS_MM_LOCAL_REGION + source "mm/damon/Kconfig" =20 endmenu --=20 2.51.2