From nobody Thu Apr 2 03:24:27 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A2DD426E709; Mon, 9 Mar 2026 02:57:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773025055; cv=none; b=ZnvT6jrTjGMkG6ducBTyGo8S36breaZXAGq4J2A7HrVYtiwz0NUI94k4/6AfZGyAyCyYd3UaqgQQnV4qWR+XC1EpT9QshwUdr5A1oCeMZK3MCw2my807KPt8+p/XSVW+zSZQiUWxKH5dkVvxnqbIgw3ESADerpltpEGk8Lh6QpA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773025055; c=relaxed/simple; bh=TdUR1NNDpnrAM8wqNU3HTKiL35xDsChkYUYmk6FgbEk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=n+HPMiC/K4puyqEo3CC8BaPcZDBLQhQY/eBMIpvH6opHSH2tktE1SvNwHr1kEgc9V2n9NTPR2x2IOpq6F2zqDbt9cJ2rfiGJQbz+1J4ZOrcr7sgUnVeJkxJUWJZKv64qQdt2Xw7uXUWtoTsPoUv0UIFpqun2Xtdzvc2TyUmvdf0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E26961D15; Sun, 8 Mar 2026 19:57:26 -0700 (PDT) Received: from ergosum.cambridge.arm.com (ergosum.cambridge.arm.com [10.1.196.45]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id A16943F694; Sun, 8 Mar 2026 19:57:31 -0700 (PDT) From: Anshuman Khandual To: linux-arm-kernel@lists.infradead.org Cc: Anshuman Khandual , Catalin Marinas , Will Deacon , Ryan Roberts , David Hildenbrand , Yang Shi , Christoph Lameter , linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH V5 1/2] arm64/mm: Enable batched TLB flush in unmap_hotplug_range() Date: Mon, 9 Mar 2026 02:57:24 +0000 Message-Id: <20260309025725.455004-2-anshuman.khandual@arm.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20260309025725.455004-1-anshuman.khandual@arm.com> References: <20260309025725.455004-1-anshuman.khandual@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During a memory hot remove operation, both linear and vmemmap mappings for the memory range being removed, get unmapped via unmap_hotplug_range() but mapped pages get freed only for vmemmap mapping. This is just a sequential operation where each table entry gets cleared, followed by a leaf specific TLB flush, and then followed by memory free operation when applicable. This approach was simple and uniform both for vmemmap and linear mappings. But linear mapping might contain CONT marked block memory where it becomes necessary to first clear out all entire in the range before a TLB flush. This is as per the architecture requirement. Hence batch all TLB flushes during the table tear down walk and finally do it in unmap_hotplug_range(). Prior to this fix, it was hypothetically possible for a speculative access to a higher address in the contiguous block to fill the TLB with shattered entries for the entire contiguous range after a lower address had already been cleared and invalidated. Due to the table entries being shattered, the subsequent TLB invalidation for the higher address would not then clear the TLB entries for the lower address, meaning stale TLB entries could persist. Besides it also helps in improving the performance via TLBI range operation along with reduced synchronization instructions. The time spent executing unmap_hotplug_range() improved 97% measured over a 2GB memory hot removal in KVM guest. This scheme is not applicable during vmemmap mapping tear down where memory needs to be freed and hence a TLB flush is required after clearing out page table entry. Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Closes: https://lore.kernel.org/all/aWZYXhrT6D2M-7-N@willie-the-truck/ Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand (Arm) Reviewed-by: Ryan Roberts Signed-off-by: Ryan Roberts Signed-off-by: Anshuman Khandual --- arch/arm64/mm/mmu.c | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a6a00accf4f9..5dbf988120c8 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1458,10 +1458,14 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, un= signed long addr, =20 WARN_ON(!pte_present(pte)); __pte_clear(&init_mm, addr, ptep); - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); - if (free_mapped) + if (free_mapped) { + /* CONT blocks are not supported in the vmemmap */ + WARN_ON(pte_cont(pte)); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); free_hotplug_page_range(pte_page(pte), PAGE_SIZE, altmap); + } + /* unmap_hotplug_range() flushes TLB for !free_mapped */ } while (addr +=3D PAGE_SIZE, addr < end); } =20 @@ -1482,15 +1486,14 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, un= signed long addr, WARN_ON(!pmd_present(pmd)); if (pmd_sect(pmd)) { pmd_clear(pmdp); - - /* - * One TLBI should be sufficient here as the PMD_SIZE - * range is mapped with a single block entry. - */ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); - if (free_mapped) + if (free_mapped) { + /* CONT blocks are not supported in the vmemmap */ + WARN_ON(pmd_cont(pmd)); + flush_tlb_kernel_range(addr, addr + PMD_SIZE); free_hotplug_page_range(pmd_page(pmd), PMD_SIZE, altmap); + } + /* unmap_hotplug_range() flushes TLB for !free_mapped */ continue; } WARN_ON(!pmd_table(pmd)); @@ -1515,15 +1518,12 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, un= signed long addr, WARN_ON(!pud_present(pud)); if (pud_sect(pud)) { pud_clear(pudp); - - /* - * One TLBI should be sufficient here as the PUD_SIZE - * range is mapped with a single block entry. - */ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); - if (free_mapped) + if (free_mapped) { + flush_tlb_kernel_range(addr, addr + PUD_SIZE); free_hotplug_page_range(pud_page(pud), PUD_SIZE, altmap); + } + /* unmap_hotplug_range() flushes TLB for !free_mapped */ continue; } WARN_ON(!pud_table(pud)); @@ -1553,6 +1553,7 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsi= gned long addr, static void unmap_hotplug_range(unsigned long addr, unsigned long end, bool free_mapped, struct vmem_altmap *altmap) { + unsigned long start =3D addr; unsigned long next; pgd_t *pgdp, pgd; =20 @@ -1574,6 +1575,9 @@ static void unmap_hotplug_range(unsigned long addr, u= nsigned long end, WARN_ON(!pgd_present(pgd)); unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); } while (addr =3D next, addr < end); + + if (!free_mapped) + flush_tlb_kernel_range(start, end); } =20 static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, --=20 2.30.2 From nobody Thu Apr 2 03:24:27 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4F88C27057D for ; Mon, 9 Mar 2026 02:57:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773025056; cv=none; b=W7SyvBsNOVx0TbkXDk1K6mbR3vopgIfMr8q+uAIaYO2tbNzZwzKnXJorLDq+71Q0u1Omg0yl2CPEXPSwoGvbe8BwK9tA90tDhe9OIyj+Inycw44y7c2ug6W6qsMRmUyyXax6Zx27z2TryLjO53xYOP7J+QYtNnMya87DgTzTuYw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773025056; c=relaxed/simple; bh=t/rVXxvWOVPBnlSxuySiGUDxh4xnB+eVLNK03O+zBsE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=uN2wtpRY5YctuD2pz/uq63ENbFzFuMDDXAC3dAYb0VKL5kcRY1qLoDslYVSnLjevtYyHD8Z+dvnKU7KgDgXP2cm08i1VtX+MtJBntD9zY4x8DwbxkNy913/G4lRICTZbZAXOY8VGr3rQCGsXS2Ycdmw4SvcOu5vDUUm3gpvi+uo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 89E5B1D6F; Sun, 8 Mar 2026 19:57:28 -0700 (PDT) Received: from ergosum.cambridge.arm.com (ergosum.cambridge.arm.com [10.1.196.45]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 69BF73F694; Sun, 8 Mar 2026 19:57:33 -0700 (PDT) From: Anshuman Khandual To: linux-arm-kernel@lists.infradead.org Cc: Anshuman Khandual , Catalin Marinas , Will Deacon , Ryan Roberts , David Hildenbrand , Yang Shi , Christoph Lameter , linux-kernel@vger.kernel.org Subject: [PATCH V5 2/2] arm64/mm: Reject memory removal that splits a kernel leaf mapping Date: Mon, 9 Mar 2026 02:57:25 +0000 Message-Id: <20260309025725.455004-3-anshuman.khandual@arm.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20260309025725.455004-1-anshuman.khandual@arm.com> References: <20260309025725.455004-1-anshuman.khandual@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Linear and vmemmap mappings that get torn down during a memory hot remove operation might contain leaf level entries on any page table level. If the requested memory range's linear or vmemmap mappings falls within such leaf entries, new mappings need to be created for the remaining memory mapped on the leaf entry earlier, following standard break before make aka BBM rules. But kernel cannot tolerate BBM and hence remapping to fine grained leaves would not be possible on systems without BBML2_NOABORT. Currently memory hot remove operation does not perform such restructuring, and so removing memory ranges that could split a kernel leaf level mapping need to be rejected. While memory_hotplug.c does appear to permit hot removing arbitrary ranges of memory, the higher layers that drive memory_hotplug (e.g. ACPI, virtio, ...) all appear to treat memory as fixed size devices. So it is impossible to hot unplug a different amount than was previously hot plugged, and hence we should never see a rejection in practice, but adding the check makes us robust against a future change. Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Link: https://lore.kernel.org/all/aWZYXhrT6D2M-7-N@willie-the-truck/ Reviewed-by: David Hildenbrand (Arm) Reviewed-by: Ryan Roberts Suggested-by: Ryan Roberts Signed-off-by: Anshuman Khandual --- arch/arm64/mm/mmu.c | 120 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 114 insertions(+), 6 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 5dbf988120c8..5fb9a66f0754 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -2014,6 +2014,107 @@ void arch_remove_memory(u64 start, u64 size, struct= vmem_altmap *altmap) __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size); } =20 + +static bool addr_splits_kernel_leaf(unsigned long addr) +{ + pgd_t *pgdp, pgd; + p4d_t *p4dp, p4d; + pud_t *pudp, pud; + pmd_t *pmdp, pmd; + pte_t *ptep, pte; + + /* + * If the given address points at a the start address of + * a possible leaf, we certainly won't split. Otherwise, + * check if we would actually split a leaf by traversing + * the page tables further. + */ + if (IS_ALIGNED(addr, PGDIR_SIZE)) + return false; + + pgdp =3D pgd_offset_k(addr); + pgd =3D pgdp_get(pgdp); + if (!pgd_present(pgd)) + return false; + + if (IS_ALIGNED(addr, P4D_SIZE)) + return false; + + p4dp =3D p4d_offset(pgdp, addr); + p4d =3D p4dp_get(p4dp); + if (!p4d_present(p4d)) + return false; + + if (IS_ALIGNED(addr, PUD_SIZE)) + return false; + + pudp =3D pud_offset(p4dp, addr); + pud =3D pudp_get(pudp); + if (!pud_present(pud)) + return false; + + if (pud_leaf(pud)) + return true; + + if (IS_ALIGNED(addr, CONT_PMD_SIZE)) + return false; + + pmdp =3D pmd_offset(pudp, addr); + pmd =3D pmdp_get(pmdp); + if (!pmd_present(pmd)) + return false; + + if (pmd_cont(pmd)) + return true; + + if (IS_ALIGNED(addr, PMD_SIZE)) + return false; + + if (pmd_leaf(pmd)) + return true; + + if (IS_ALIGNED(addr, CONT_PTE_SIZE)) + return false; + + ptep =3D pte_offset_kernel(pmdp, addr); + pte =3D __ptep_get(ptep); + if (!pte_present(pte)) + return false; + + if (pte_cont(pte)) + return true; + + return !IS_ALIGNED(addr, PAGE_SIZE); +} + +static bool can_unmap_without_split(unsigned long pfn, unsigned long nr_pa= ges) +{ + unsigned long phys_start, phys_end, start, end; + + phys_start =3D PFN_PHYS(pfn); + phys_end =3D phys_start + nr_pages * PAGE_SIZE; + + /* PFN range's linear map edges are leaf entry aligned */ + start =3D __phys_to_virt(phys_start); + end =3D __phys_to_virt(phys_end); + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) { + pr_warn("[%lx %lx] splits a leaf entry in linear map\n", + phys_start, phys_end); + return false; + } + + /* PFN range's vmemmap edges are leaf entry aligned */ + BUILD_BUG_ON(!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)); + start =3D (unsigned long)pfn_to_page(pfn); + end =3D (unsigned long)pfn_to_page(pfn + nr_pages); + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) { + pr_warn("[%lx %lx] splits a leaf entry in vmemmap\n", + phys_start, phys_end); + return false; + } + return true; +} + /* * This memory hotplug notifier helps prevent boot memory from being * inadvertently removed as it blocks pfn range offlining process in @@ -2022,8 +2123,11 @@ void arch_remove_memory(u64 start, u64 size, struct = vmem_altmap *altmap) * In future if and when boot memory could be removed, this notifier * should be dropped and free_hotplug_page_range() should handle any * reserved pages allocated during boot. + * + * This also blocks any memory remove that would have caused a split + * in leaf entry in kernel linear or vmemmap mapping. */ -static int prevent_bootmem_remove_notifier(struct notifier_block *nb, +static int prevent_memory_remove_notifier(struct notifier_block *nb, unsigned long action, void *data) { struct mem_section *ms; @@ -2069,11 +2173,15 @@ static int prevent_bootmem_remove_notifier(struct n= otifier_block *nb, return NOTIFY_DONE; } } + + if (!can_unmap_without_split(pfn, arg->nr_pages)) + return NOTIFY_BAD; + return NOTIFY_OK; } =20 -static struct notifier_block prevent_bootmem_remove_nb =3D { - .notifier_call =3D prevent_bootmem_remove_notifier, +static struct notifier_block prevent_memory_remove_nb =3D { + .notifier_call =3D prevent_memory_remove_notifier, }; =20 /* @@ -2123,7 +2231,7 @@ static void validate_bootmem_online(void) } } =20 -static int __init prevent_bootmem_remove_init(void) +static int __init prevent_memory_remove_init(void) { int ret =3D 0; =20 @@ -2131,13 +2239,13 @@ static int __init prevent_bootmem_remove_init(void) return ret; =20 validate_bootmem_online(); - ret =3D register_memory_notifier(&prevent_bootmem_remove_nb); + ret =3D register_memory_notifier(&prevent_memory_remove_nb); if (ret) pr_err("%s: Notifier registration failed %d\n", __func__, ret); =20 return ret; } -early_initcall(prevent_bootmem_remove_init); +early_initcall(prevent_memory_remove_init); #endif =20 pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long add= r, --=20 2.30.2