From nobody Sat Oct 11 12:11:44 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 03A4428DEF4 for ; Tue, 10 Jun 2025 11:44:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749555863; cv=none; b=mnWtwj6nPeBWx20Mg7/cl2Udws42hI6SzN4do8AR6F2xilEfTHqmBz/MUCXGH4Xk0ZGKgcBDWziGS7G+KDhzcb7vKaRiwiGSTuKoG72zSoDDrVioQ59oLzQSuETMW17TQSqkAy7K6u+PEXI9RXlNfF2STQJ/+weyXNGjhW02Sbo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749555863; c=relaxed/simple; bh=6uWTW+f/RLn6MwoctOvrgGqhPUwFQXonS4G7T1sVQcw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=uC7BJEosrBTHFQ1RpCeRiIgxbLn9futeEFUl7HDpL+zibtLk8tHWIvoKJIkDrUbvscwAgnYuxULYXJuPVtkt/ztDD2gzszSYl6qHAH3mvkHaDV64gXT2NzHRUosl72HrAT51JYVjmGrU2amToVRAU8zjWCnd6R8vvDtW/wZ1aho= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E84F0169C; Tue, 10 Jun 2025 04:44:01 -0700 (PDT) Received: from MacBook-Pro.blr.arm.com (MacBook-Pro.blr.arm.com [10.164.18.48]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 781653F59E; Tue, 10 Jun 2025 04:44:15 -0700 (PDT) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, catalin.marinas@arm.com, will@kernel.org Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, suzuki.poulose@arm.com, steven.price@arm.com, gshan@redhat.com, linux-arm-kernel@lists.infradead.org, yang@os.amperecomputing.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, Dev Jain Subject: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking Date: Tue, 10 Jun 2025 17:14:00 +0530 Message-Id: <20250610114401.7097-2-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250610114401.7097-1-dev.jain@arm.com> References: <20250610114401.7097-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" arm64 currently changes permissions on vmalloc objects locklessly, via apply_to_page_range. Patch 2 moves away from this to use the pagewalk API, since a limitation of the former is to deny changing permissions for block mappings. However, the API currently enforces the init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase, this patch extends this generic API to be used locklessly, so as to retain the existing behaviour for changing permissions. Apart from this reason, it is noted at [1] that KFENCE can manipulate kernel pgtable entries during softirqs. It does this by calling set_memory_valid() -> __change_memory_com= mon(). This being a non-sleepable context, we cannot take the init_mm mmap lock. Since such extension can potentially be dangerous for other callers consuming the pagewalk API, explicitly disallow lockless traversal for userspace pagetables by returning EINVAL. Add comments to highlight the conditions under which we can use the API locklessly - no underlying VMA, and the user having exclusive control over the range, thus guaranteeing no concurrent access. Signed-off-by: Dev Jain --- include/linux/pagewalk.h | 7 +++++++ mm/pagewalk.c | 23 ++++++++++++++++++----- 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 8ac2f6d6d2a3..5efd6541239b 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -14,6 +14,13 @@ enum page_walk_lock { PGWALK_WRLOCK =3D 1, /* vma is expected to be already write-locked during the walk */ PGWALK_WRLOCK_VERIFY =3D 2, + /* + * Walk without any lock. Use of this is only meant for the + * case where there is no underlying VMA, and the user has + * exclusive control over the range, guaranteeing no concurrent + * access. For example, changing permissions of vmalloc objects. + */ + PGWALK_NOLOCK =3D 3, }; =20 /** diff --git a/mm/pagewalk.c b/mm/pagewalk.c index ff5299eca687..d55d933f84ec 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, uns= igned long end, return err; } =20 -static inline void process_mm_walk_lock(struct mm_struct *mm, +static inline bool process_mm_walk_lock(struct mm_struct *mm, enum page_walk_lock walk_lock) { + if (walk_lock =3D=3D PGWALK_NOLOCK) + return 1; + if (walk_lock =3D=3D PGWALK_RDLOCK) mmap_assert_locked(mm); else mmap_assert_write_locked(mm); + return 0; } =20 static inline void process_vma_walk_lock(struct vm_area_struct *vma, @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area= _struct *vma, case PGWALK_RDLOCK: /* PGWALK_RDLOCK is handled by process_mm_walk_lock */ break; + case PGWALK_NOLOCK: + break; } #endif } @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned l= ong start, if (!walk.mm) return -EINVAL; =20 - process_mm_walk_lock(walk.mm, ops->walk_lock); + if (process_mm_walk_lock(walk.mm, ops->walk_lock)) + return -EINVAL; =20 vma =3D find_vma(walk.mm, start); do { @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, = unsigned long end, * to prevent the intermediate kernel pages tables belonging to the * specified address range from being freed. The caller should take * other actions to prevent this race. + * + * If the caller can guarantee that it has exclusive access to the + * specified address range, only then it can use PGWALK_NOLOCK. */ - mmap_assert_locked(mm); + if (ops->walk_lock !=3D PGWALK_NOLOCK) + mmap_assert_locked(mm); =20 return walk_pgd_range(start, end, &walk); } @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, uns= igned long start, if (!check_ops_valid(ops)) return -EINVAL; =20 - process_mm_walk_lock(walk.mm, ops->walk_lock); + if (process_mm_walk_lock(walk.mm, ops->walk_lock)) + return -EINVAL; process_vma_walk_lock(vma, ops->walk_lock); return __walk_page_range(start, end, &walk); } @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const str= uct mm_walk_ops *ops, if (!check_ops_valid(ops)) return -EINVAL; =20 - process_mm_walk_lock(walk.mm, ops->walk_lock); + if (process_mm_walk_lock(walk.mm, ops->walk_lock)) + return -EINVAL; process_vma_walk_lock(vma, ops->walk_lock); return __walk_page_range(vma->vm_start, vma->vm_end, &walk); } --=20 2.30.2 From nobody Sat Oct 11 12:11:44 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 426EB28C879 for ; Tue, 10 Jun 2025 11:44:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749555869; cv=none; b=LIPEirHNfmA4RcrTctu1P4WY8ymkt7FiP5qyGGSvNo9T8SaTHTAsV3YarAD8HqL1GrRWI2eooqqpBAhpalGY2oRpjbz2RIckDgS/76+72OL/ZjwQBsXrZhXZAqI8F63cpHUdrgqyk9kO/7eCRO4L9RRsMTRhEexoFaKmuwcjQTY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749555869; c=relaxed/simple; bh=blBfohToxSm9BTYqFUg7FXOdjMwEciWUkbCQxST5+Qw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=SynLCO4GLNMygEqmrOeKR6AcExGTIPKMaCfdJ4BKEwgmIrldTnGQ4S8eYDKvxZJpbzh5solo9FWUfKirlPS6KeGeDl75Q1TGAnAHX5sHFGdjXeD42+AifN2ptmZ7y65IjGY5Tzw5O64rU84fhLOO2VsEv8kJe6VTT9fKzfRihrE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3091E169C; Tue, 10 Jun 2025 04:44:08 -0700 (PDT) Received: from MacBook-Pro.blr.arm.com (MacBook-Pro.blr.arm.com [10.164.18.48]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id B66DD3F59E; Tue, 10 Jun 2025 04:44:21 -0700 (PDT) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, catalin.marinas@arm.com, will@kernel.org Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, suzuki.poulose@arm.com, steven.price@arm.com, gshan@redhat.com, linux-arm-kernel@lists.infradead.org, yang@os.amperecomputing.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, Dev Jain Subject: [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Date: Tue, 10 Jun 2025 17:14:01 +0530 Message-Id: <20250610114401.7097-3-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250610114401.7097-1-dev.jain@arm.com> References: <20250610114401.7097-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Since apply_to_page_range does not support operations on block mappings, use the generic pagewalk API to enable changing permissions for kernel block mappings. This paves the path for enabling huge mappings by default on kernel space mappings, thus leading to more efficient TLB usage. We only require that the start and end of a given range lie on leaf mapping boundaries. Return EINVAL in case a partial block mapping is detected; add a corresponding comment in ___change_memory_common() to warn that eliminating such a condition is the responsibility of the caller. apply_to_page_range ultimately uses the lazy MMU hooks at the pte level function (apply_to_pte_range) - we want to use this functionality after this patch too. Ryan says: "The only reason we traditionally confine the lazy mmu mode to a single page table is because we want to enclose it within the PTL. But that requirement doesn't stand for kernel mappings. As long as the walker can guarantee that it doesn't allocate any memory (because with certain debug settings that can cause lazy mmu nesting) or try to sleep then I think we can just bracket the entire call." Therefore, wrap the call to walk_kernel_page_table_range() with the lazy MMU helpers. Signed-off-by: Dev Jain --- arch/arm64/mm/pageattr.c | 158 +++++++++++++++++++++++++++++++-------- 1 file changed, 126 insertions(+), 32 deletions(-) diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 04d4a8f676db..2c118c0922ef 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -8,6 +8,7 @@ #include #include #include +#include =20 #include #include @@ -20,6 +21,100 @@ struct page_change_data { pgprot_t clear_mask; }; =20 +static pteval_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk) +{ + struct page_change_data *masks =3D walk->private; + + val &=3D ~(pgprot_val(masks->clear_mask)); + val |=3D (pgprot_val(masks->set_mask)); + + return val; +} + +static int pageattr_pgd_entry(pgd_t *pgd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pgd_t val =3D pgdp_get(pgd); + + if (pgd_leaf(val)) { + if (WARN_ON_ONCE((next - addr) !=3D PGDIR_SIZE)) + return -EINVAL; + val =3D __pgd(set_pageattr_masks(pgd_val(val), walk)); + set_pgd(pgd, val); + walk->action =3D ACTION_CONTINUE; + } + + return 0; +} + +static int pageattr_p4d_entry(p4d_t *p4d, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + p4d_t val =3D p4dp_get(p4d); + + if (p4d_leaf(val)) { + if (WARN_ON_ONCE((next - addr) !=3D P4D_SIZE)) + return -EINVAL; + val =3D __p4d(set_pageattr_masks(p4d_val(val), walk)); + set_p4d(p4d, val); + walk->action =3D ACTION_CONTINUE; + } + + return 0; +} + +static int pageattr_pud_entry(pud_t *pud, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pud_t val =3D pudp_get(pud); + + if (pud_leaf(val)) { + if (WARN_ON_ONCE((next - addr) !=3D PUD_SIZE)) + return -EINVAL; + val =3D __pud(set_pageattr_masks(pud_val(val), walk)); + set_pud(pud, val); + walk->action =3D ACTION_CONTINUE; + } + + return 0; +} + +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pmd_t val =3D pmdp_get(pmd); + + if (pmd_leaf(val)) { + if (WARN_ON_ONCE((next - addr) !=3D PMD_SIZE)) + return -EINVAL; + val =3D __pmd(set_pageattr_masks(pmd_val(val), walk)); + set_pmd(pmd, val); + walk->action =3D ACTION_CONTINUE; + } + + return 0; +} + +static int pageattr_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t val =3D __ptep_get(pte); + + val =3D __pte(set_pageattr_masks(pte_val(val), walk)); + __set_pte(pte, val); + + return 0; +} + +static const struct mm_walk_ops pageattr_ops =3D { + .pgd_entry =3D pageattr_pgd_entry, + .p4d_entry =3D pageattr_p4d_entry, + .pud_entry =3D pageattr_pud_entry, + .pmd_entry =3D pageattr_pmd_entry, + .pte_entry =3D pageattr_pte_entry, + .walk_lock =3D PGWALK_NOLOCK, +}; + bool rodata_full __ro_after_init =3D IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT= _ENABLED); =20 bool can_set_direct_map(void) @@ -37,22 +132,7 @@ bool can_set_direct_map(void) arm64_kfence_can_set_direct_map() || is_realm_world(); } =20 -static int change_page_range(pte_t *ptep, unsigned long addr, void *data) -{ - struct page_change_data *cdata =3D data; - pte_t pte =3D __ptep_get(ptep); - - pte =3D clear_pte_bit(pte, cdata->clear_mask); - pte =3D set_pte_bit(pte, cdata->set_mask); - - __set_pte(ptep, pte); - return 0; -} - -/* - * This function assumes that the range is mapped with PAGE_SIZE pages. - */ -static int __change_memory_common(unsigned long start, unsigned long size, +static int ___change_memory_common(unsigned long start, unsigned long size, pgprot_t set_mask, pgprot_t clear_mask) { struct page_change_data data; @@ -61,9 +141,28 @@ static int __change_memory_common(unsigned long start, = unsigned long size, data.set_mask =3D set_mask; data.clear_mask =3D clear_mask; =20 - ret =3D apply_to_page_range(&init_mm, start, size, change_page_range, - &data); + arch_enter_lazy_mmu_mode(); + + /* + * The caller must ensure that the range we are operating on does not + * partially overlap a block mapping. Any such case should either not + * exist, or must be eliminated by splitting the mapping - which for + * kernel mappings can be done only on BBML2 systems. + * + */ + ret =3D walk_kernel_page_table_range(start, start + size, &pageattr_ops, + NULL, &data); + arch_leave_lazy_mmu_mode(); + + return ret; +} =20 +static int __change_memory_common(unsigned long start, unsigned long size, + pgprot_t set_mask, pgprot_t clear_mask) +{ + int ret; + + ret =3D ___change_memory_common(start, size, set_mask, clear_mask); /* * If the memory is being made valid without changing any other bits * then a TLBI isn't required as a non-valid entry cannot be cached in @@ -71,6 +170,7 @@ static int __change_memory_common(unsigned long start, u= nsigned long size, */ if (pgprot_val(set_mask) !=3D PTE_VALID || pgprot_val(clear_mask)) flush_tlb_kernel_range(start, start + size); + return ret; } =20 @@ -174,32 +274,26 @@ int set_memory_valid(unsigned long addr, int numpages= , int enable) =20 int set_direct_map_invalid_noflush(struct page *page) { - struct page_change_data data =3D { - .set_mask =3D __pgprot(0), - .clear_mask =3D __pgprot(PTE_VALID), - }; + pgprot_t clear_mask =3D __pgprot(PTE_VALID); + pgprot_t set_mask =3D __pgprot(0); =20 if (!can_set_direct_map()) return 0; =20 - return apply_to_page_range(&init_mm, - (unsigned long)page_address(page), - PAGE_SIZE, change_page_range, &data); + return ___change_memory_common((unsigned long)page_address(page), PAGE_SI= ZE, + set_mask, clear_mask); } =20 int set_direct_map_default_noflush(struct page *page) { - struct page_change_data data =3D { - .set_mask =3D __pgprot(PTE_VALID | PTE_WRITE), - .clear_mask =3D __pgprot(PTE_RDONLY), - }; + pgprot_t set_mask =3D __pgprot(PTE_VALID | PTE_WRITE); + pgprot_t clear_mask =3D __pgprot(PTE_RDONLY); =20 if (!can_set_direct_map()) return 0; =20 - return apply_to_page_range(&init_mm, - (unsigned long)page_address(page), - PAGE_SIZE, change_page_range, &data); + return ___change_memory_common((unsigned long)page_address(page), PAGE_SI= ZE, + set_mask, clear_mask); } =20 static int __set_memory_enc_dec(unsigned long addr, --=20 2.30.2