From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0453E2517AF for ; Fri, 19 Sep 2025 05:42:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260560; cv=none; b=uclea99X50Q9cJZAw742PwuRfW+Ti8BH+uw79nQQHbViI7O/fscKhAH8MpUsvpihORVq8rGCQH3UcFgd/J4gfwTxwbPuuJsuwq8qtEvKGDRJFiRGy7OctIiM/4BBYkrEcsCqCdQQzeUifPuBLHGKpkU60Hp2G7QWF5RLeYyEywA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260560; c=relaxed/simple; bh=iUqj3COmeW15RDC5LUeEsbZGHpcevKH79WP9pbOzJHI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gKBnFc1kgh8qv/rS4aagz+sqszVjpCYaUBgvEYMwaPOvVFp6lwBqjWI2YMLXC2nLgmVzgSI2oGpKBW6gjzPfPpgagaebLQRlqIO04J1GwYT/JIrxQGSE5kAKxXOAErf6r49gqjwHsYjsG7ZxbHPMEm03Ngua/iCeJFmToh7O8mc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=NS28m+HF; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="NS28m+HF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260559; x=1789796559; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=iUqj3COmeW15RDC5LUeEsbZGHpcevKH79WP9pbOzJHI=; b=NS28m+HFR0wDSxKokrqb3n1nd2ajJnOtJ2wwZvhO1YQXDtQ8LR3hL4D+ 4z9tA6lr/9Z79pZDWAHGi7hBb4n21pmEMYdwG7yGDEAgdKhC/1DuniwL/ 8oYm5soST91NzgPlaFMwYMXTpu7OcVHzdmmVbuilN4beNj0IzqQP5owVt YrwNvCTO9EbX0ARnk+yeIN0AuzkFuu8ymBfDQ18Ktki6ra7Q8JLXFnbqy Y3opnDQ1KtpMpwbMvGmTrVIU+9xD0iAYpaEMbSSOvueWR/ieOm9L/wIbk f/udL2NWmDyTKBFpHJHgW1ud+5nsUg6b09O6BK4ydRhueKiBeYDKb9mxb A==; X-CSE-ConnectionGUID: HC7LaXePShyWILDmyMcW9A== X-CSE-MsgGUID: C9c6So1WRn+u8dme/jAv1w== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235727" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235727" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:42:39 -0700 X-CSE-ConnectionGUID: PjyqK40nSoqjLfkLARL2lA== X-CSE-MsgGUID: 3Q1FXywFT9WI1HxLfRFxXA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858570" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:42:32 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Lu Baolu Subject: [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables Date: Fri, 19 Sep 2025 13:39:59 +0800 Message-ID: <20250919054007.472493-2-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Dave Hansen The page tables used to map the kernel and userspace often have very different handling rules. There are frequently *_kernel() variants of functions just for kernel page tables. That's not great and has lead to code duplication. Instead of having completely separate call paths, allow a 'ptdesc' to be marked as being for kernel mappings. Introduce helpers to set and clear this status. Note: this uses the PG_referenced bit. Page flags are a great fit for this since it is truly a single bit of information. Use PG_referenced itself because it's a fairly benign flag (as opposed to things like PG_lock). It's also (according to Willy) unlikely to go away any time soon. PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to be cleared before freeing the page, and pages coming out of the allocator should have it cleared. Regardless, introduce an API to clear it anyway. Having symmetry in the API makes it easier to change the underlying implementation later, like if there was a need to move to a PAGE_FLAGS_CHECK_AT_FREE bit. Signed-off-by: Dave Hansen Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- include/linux/page-flags.h | 46 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 8d3fa3a91ce4..1d82fb6fffe5 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1244,6 +1244,52 @@ static inline int folio_has_private(const struct fol= io *folio) return !!(folio->flags & PAGE_FLAGS_PRIVATE); } =20 +/** + * ptdesc_set_kernel - Mark a ptdesc used to map the kernel + * @ptdesc: The ptdesc to be marked + * + * Kernel page tables often need special handling. Set a flag so that + * the handling code knows this ptdesc will not be used for userspace. + */ +static inline void ptdesc_set_kernel(struct ptdesc *ptdesc) +{ + struct folio *folio =3D ptdesc_folio(ptdesc); + + folio_set_referenced(folio); +} + +/** + * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel + * @ptdesc: The ptdesc to be unmarked + * + * Use when the ptdesc is no longer used to map the kernel and no longer + * needs special handling. + */ +static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc) +{ + struct folio *folio =3D ptdesc_folio(ptdesc); + + /* + * Note: the 'PG_referenced' bit does not strictly need to be + * cleared before freeing the page. But this is nice for + * symmetry. + */ + folio_clear_referenced(folio); +} + +/** + * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel + * @ptdesc: The ptdesc being tested + * + * Call to tell if the ptdesc used to map the kernel. + */ +static inline bool ptdesc_test_kernel(struct ptdesc *ptdesc) +{ + struct folio *folio =3D ptdesc_folio(ptdesc); + + return folio_test_referenced(folio); +} + #undef PF_ANY #undef PF_HEAD #undef PF_NO_TAIL --=20 2.43.0 From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D9C092517AF for ; Fri, 19 Sep 2025 05:42:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260566; cv=none; b=ceHGy9rXkqiPMxd61zPujcLbILoEnmJscnictSZffhUzCpwW19+Q6aJCCkv61+YAt/mEw0k7xZf1pv2ngYpaUfmer3jRXJpvBPMetqfIsmftbBpoWrrwbAKNhj2uU1KH2XR5s4J8ThUvKmerVYBHZvvsy06z70XASZAuIXcwrPI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260566; c=relaxed/simple; bh=Dg5FMFxijPdUqcW/Tef7765KtNVvQQKVcrGVoDEOYaA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=AkNxiBcGoRD+TU5XaeElPayxYii/wSV05VVCsURP1dAnt8SICKV+KYYAlxzH+WcO4IxLn15LpZoIjNDNL6Yfy2mr9a4/iI1fl1gTM0dGsJv9XvAq1Mj7zQ8Ozei3EkIhDJUhu6YYwWsyXAjvyeNJBaLIJuxd28LVWwqKLXB4KlA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=EDCbuFfg; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="EDCbuFfg" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260565; x=1789796565; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Dg5FMFxijPdUqcW/Tef7765KtNVvQQKVcrGVoDEOYaA=; b=EDCbuFfgNGumufU3ql2fYJprgEfZYQPQyfeHuI5b6NMmARcwdmnOjRyv UJmkj4iYIMeluA3QmIva3uHzt9/Vtb+4oRLF/W0TcgqGjmTb32ElzkCpD Ap5v20eQsCmwIuuTXzAkabVhNkuVo88/p2XiVu0m3Lg0K+9/3JyBBjLrq caE7yFaUGagK3cKFFxD4eWyUSQAVSN5Dlf3XnSI4YnZaqLJ0ozB8hOIFN O9XDNhcVwE7+mmwMuW6Quqa9k/+dLYGgRpwK/GF663gZfYNkHBA1Cxxn0 7LNDwaa/lnhNEswDPgeFna+SFk2VBTT8L8C/Qwh0AMwc8kxCmu6dSTAte A==; X-CSE-ConnectionGUID: gEUgKiC1TvOA1ybkiWlH3w== X-CSE-MsgGUID: C1qQ91QUSWKhfkqqj69n/A== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235751" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235751" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:42:45 -0700 X-CSE-ConnectionGUID: oGGKEl6tSHuZAt92ENdozg== X-CSE-MsgGUID: ogkZWCFtQg6LavjO4eounQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858596" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:42:38 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Lu Baolu Subject: [PATCH v5 2/8] mm: Actually mark kernel page table pages Date: Fri, 19 Sep 2025 13:40:00 +0800 Message-ID: <20250919054007.472493-3-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Dave Hansen Now that the API is in place, mark kernel page table pages just after they are allocated. Unmark them just before they are freed. Note: Unconditionally clearing the 'kernel' marking (via ptdesc_clear_kernel()) would be functionally identical to what is here. But having the if() makes it logically clear that this function can be used for kernel and non-kernel page tables. Signed-off-by: Dave Hansen Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- include/asm-generic/pgalloc.h | 18 ++++++++++++++++++ include/linux/mm.h | 3 +++ 2 files changed, 21 insertions(+) diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..b9d2a7c79b93 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -28,6 +28,8 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct= mm_struct *mm) return NULL; } =20 + ptdesc_set_kernel(ptdesc); + return ptdesc_address(ptdesc); } #define __pte_alloc_one_kernel(...) alloc_hooks(__pte_alloc_one_kernel_nop= rof(__VA_ARGS__)) @@ -146,6 +148,10 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_st= ruct *mm, unsigned long ad pagetable_free(ptdesc); return NULL; } + + if (mm =3D=3D &init_mm) + ptdesc_set_kernel(ptdesc); + return ptdesc_address(ptdesc); } #define pmd_alloc_one(...) alloc_hooks(pmd_alloc_one_noprof(__VA_ARGS__)) @@ -179,6 +185,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_= struct *mm, unsigned long return NULL; =20 pagetable_pud_ctor(ptdesc); + + if (mm =3D=3D &init_mm) + ptdesc_set_kernel(ptdesc); + return ptdesc_address(ptdesc); } #define __pud_alloc_one(...) alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS_= _)) @@ -233,6 +243,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_= struct *mm, unsigned long return NULL; =20 pagetable_p4d_ctor(ptdesc); + + if (mm =3D=3D &init_mm) + ptdesc_set_kernel(ptdesc); + return ptdesc_address(ptdesc); } #define __p4d_alloc_one(...) alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS_= _)) @@ -277,6 +291,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_stru= ct *mm, unsigned int order return NULL; =20 pagetable_pgd_ctor(ptdesc); + + if (mm =3D=3D &init_mm) + ptdesc_set_kernel(ptdesc); + return ptdesc_address(ptdesc); } #define __pgd_alloc(...) alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__)) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1ae97a0b8ec7..f3db3a5ebefe 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2895,6 +2895,9 @@ static inline void pagetable_free(struct ptdesc *pt) { struct page *page =3D ptdesc_page(pt); =20 + if (ptdesc_test_kernel(pt)) + ptdesc_clear_kernel(pt); + __free_pages(page, compound_order(page)); } =20 --=20 2.43.0 From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99CF625DCE0 for ; Fri, 19 Sep 2025 05:42:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260571; cv=none; b=fFA33wTDRzSxWAurkRiX7E84anLd8kIS9M+Dm2okIyS4BvNGqS5oZVdUqPDj651g6MopPfGZoWwL2glL7n2JA6RAhpGbBJZyD72vvILqSmQsT7bqydmUcPRTBmkxPOPmrwDeEKWhv9zxuPYT6K0fHD6G7hovVhjWiUyatMnbBb0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260571; c=relaxed/simple; bh=on6Ag9NJz0vk1kj+5aoSFuQQtRI0JdB9JVEfIv+bEQw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=l+rHwHyHlNfZa7+sYZKAKCaya0mk0lgiX52/ll4bjoCVJSFTJWxJMxEQ5HZL72fYljlOPQQacrWUsPmwfmhpf8aIgm4P76DrO7IU3/3+mru3TOfPrHiLH03ah+jZO0fSXlCGylMRl6EuV8xUU0REsVN6e2XLqVH5c9NmTdgX0nk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Ef/pAWxo; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Ef/pAWxo" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260571; x=1789796571; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=on6Ag9NJz0vk1kj+5aoSFuQQtRI0JdB9JVEfIv+bEQw=; b=Ef/pAWxoqZbaz1tnEp1AITYHGYQnNO3pLm41JvC1Qk41ehlnadfJ/28P myxzCct91DcgtIeNQYRwierBYn3wPBqQn0Rw1rMHTRoUvpqPx5AD1o18U NOCgc9XZSX83RSITIwyYuKwI/8uggYa9NUdKRD/MjvyW8+NlmcE2/So3E Bnhmjx4V4sDIL8hXnpi5aAk/weQuYLYR+ijy/2rMk0kGhqO4ARrQsZ8n0 lbqJl/OQOanuBhJn+D1wXeuIUDbaL+mfb/xhUW3OXGDLBf4fRBuwF6vbu syiDHBSOTN+eKFgillIlSmOj16svJF1HGMpqzyIjXs14mtx5X9NzY2O/D w==; X-CSE-ConnectionGUID: JtMB9KfwQ1WYdamPqUtvFA== X-CSE-MsgGUID: 3w+kKv0kT9Wq3qzuFNRfQA== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235778" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235778" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:42:50 -0700 X-CSE-ConnectionGUID: qkeOHv3KQkWG9mTRxGTyJw== X-CSE-MsgGUID: rMlp4P4RS12+n9gODhJ2ag== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858628" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:42:44 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Lu Baolu Subject: [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Date: Fri, 19 Sep 2025 13:40:01 +0800 Message-ID: <20250919054007.472493-4-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Dave Hansen There are a billion ways to refer to a physical memory address. One of the x86 PMD freeing code location chooses to use a 'pte_t *' to point to a PMD page and then call a PTE-specific freeing function for it. That's a bit wonky. Just use a 'struct ptdesc *' instead. Its entire purpose is to refer to page table pages. It also means being able to remove an explicit cast. Right now, pte_free_kernel() is a one-liner that calls pagetable_dtor_free(). Effectively, all this patch does is remove one superfluous __pa(__va(paddr)) conversion and then call pagetable_dtor_free() directly instead of through a helper. Signed-off-by: Dave Hansen Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- arch/x86/mm/pgtable.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index ddf248c3ee7d..2e5ecfdce73c 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -729,7 +729,7 @@ int pmd_clear_huge(pmd_t *pmd) int pud_free_pmd_page(pud_t *pud, unsigned long addr) { pmd_t *pmd, *pmd_sv; - pte_t *pte; + struct ptdesc *pt; int i; =20 pmd =3D pud_pgtable(*pud); @@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr) =20 for (i =3D 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { - pte =3D (pte_t *)pmd_page_vaddr(pmd_sv[i]); - pte_free_kernel(&init_mm, pte); + pt =3D page_ptdesc(pmd_page(pmd_sv[i])); + pagetable_dtor_free(pt); } } =20 @@ -772,15 +772,15 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr) */ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) { - pte_t *pte; + struct ptdesc *pt; =20 - pte =3D (pte_t *)pmd_page_vaddr(*pmd); + pt =3D page_ptdesc(pmd_page(*pmd)); pmd_clear(pmd); =20 /* INVLPG to clear all paging-structure caches */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1); =20 - pte_free_kernel(&init_mm, pte); + pagetable_dtor_free(pt); =20 return 1; } --=20 2.43.0 From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08E9925CC5E for ; Fri, 19 Sep 2025 05:42:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260577; cv=none; b=AMTGX0QD4NwzrlWLqHv05y/sQdDS23o+4LXp5EpWU3UTNcMNSRJQGQrCFPl2k1gOW5yCSciBkQ2u8c0UCHkDlFkILySM7YtZrqPXhdTNUp5V8emAF2462SN/M7XQYnY+blovxR6eOmREvbT1vqGAROf6wIvIm6vDPN7bV2dAalg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260577; c=relaxed/simple; bh=q4orb/9TcIzBYjB1lAxgyFEtZV3WTz8lFO2q41GLlcw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UwEwt8kJ2+gvNPE6vDaOkaO9Rp9kvgfGLQENkjeQe75Q6HESoEtvYMHrFal5IdO/2p+LOkU6S5Nk4u7jbPa2EJi9uquTN0TWI77htWFVj74pHbRqNj+lXhGHLuZbUqsOpB+veje5M8X9BZ4hW+XSiv0CrVJOLVwIVCL8pfaUy7U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=c4cQagnb; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="c4cQagnb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260576; x=1789796576; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=q4orb/9TcIzBYjB1lAxgyFEtZV3WTz8lFO2q41GLlcw=; b=c4cQagnbN+k7O6JjWSXIky0k5XXk8dfb358C+HH78x9UHZwVn+euRjIl oREIFD7mHxXI1jID4dxmjZF8s4IIqh96vSPZsr6aDsHeJ9feENEzhq5lD w0N79DJ1lyUtha2yUSfty0J6mKd+dty48iNNTyIfTvppBB+72RFsOBvRn NgS4DPcCIWKfg9PN8+ojitwJNUbBph6N4PXMRv9e1BLlf9wYfZXut4KC7 I8Thr1jhI39htQ+1Y6npZi0Po1rzZIAZiRnK1pA3B2WyouTQwr6zYgZEr oft/XDGSQbdbYB2ZDMRYV+kCTZB9FL+a4H6b7QUZZ7wUH7Y+HQZ+xIT8J w==; X-CSE-ConnectionGUID: 9YC19UekTqKvXSknnPQ1qA== X-CSE-MsgGUID: Tq3iyoAyRLaw6EGe0NCY5A== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235789" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235789" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:42:56 -0700 X-CSE-ConnectionGUID: +Vms+rOBQSqxVfmiElkoKg== X-CSE-MsgGUID: Q6CmgcwOT0Wklf9A8kA2kg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858678" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:42:50 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Lu Baolu Subject: [PATCH v5 4/8] mm: Introduce pure page table freeing function Date: Fri, 19 Sep 2025 13:40:02 +0800 Message-ID: <20250919054007.472493-5-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Dave Hansen The pages used for ptdescs are currently freed back to the allocator in a single location. They will shortly be freed from a second location. Create a simple helper that just frees them back to the allocator. Signed-off-by: Dave Hansen Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- include/linux/mm.h | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f3db3a5ebefe..668d519edc0f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2884,6 +2884,13 @@ static inline struct ptdesc *pagetable_alloc_noprof(= gfp_t gfp, unsigned int orde } #define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS_= _)) =20 +static inline void __pagetable_free(struct ptdesc *pt) +{ + struct page *page =3D ptdesc_page(pt); + + __free_pages(page, compound_order(page)); +} + /** * pagetable_free - Free pagetables * @pt: The page table descriptor @@ -2893,12 +2900,10 @@ static inline struct ptdesc *pagetable_alloc_noprof= (gfp_t gfp, unsigned int orde */ static inline void pagetable_free(struct ptdesc *pt) { - struct page *page =3D ptdesc_page(pt); - if (ptdesc_test_kernel(pt)) ptdesc_clear_kernel(pt); =20 - __free_pages(page, compound_order(page)); + __pagetable_free(pt); } =20 #if defined(CONFIG_SPLIT_PTE_PTLOCKS) --=20 2.43.0 From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8013725D528 for ; Fri, 19 Sep 2025 05:43:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260583; cv=none; b=ZbyGJ3CmQ+nfvns5Q6YOn34+nWE4ArB0pOS2rWnhIv0vvDq9HQGI4hRUDys0gB1zDY6FtRc/t58gfp5GpEN0IO74EhqqqJqJ0ZfE9ghkFSpjB1SaCqW9GxVtxMlYbxLKgBFO+V3W6OgmY/Ie/RQi4jdxQd0vMdMcQRc+hJFJLrA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260583; c=relaxed/simple; bh=nrilrx2mPA3TWSSuztVQs7qAYj2QPgzZaCIBERNxtEw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=aoeZ6AnGg8jgpPsWcMm/x9Be0hutbT+/dk/hlZARQTTj2MmuUowiC4GczciWYhK8v24NXX+fWUkiiOPHoJxCn1aqgsqnp0ZJLMMl/DR5sswk1Qn3/duvH0e/ZgSiF9Y4uy7tKVPa9d5vksN+dTJTcH9vbEeRbdZ8fE578ZkGET8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=jyG5ydIX; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="jyG5ydIX" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260582; x=1789796582; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nrilrx2mPA3TWSSuztVQs7qAYj2QPgzZaCIBERNxtEw=; b=jyG5ydIXexyg+wkl/hak0AY4KnHY9hoOfNs31I7PHvfvVJtq43ou+Icp OX1gFUUWXG0+v2iZbekUc6R8oVbr9PX22kWi77F0dpwBbxawNziwt92Qr Thsvyfg0MHpPVNA2JgbkpFNTS5D0na79wj88dzaO7bRUxCFEB+kFiE7uU oQMAscFT16pCyBypVidW5CriL4uEhr2DyVexcPo+852BfRtXWCyCzfwqq mhak8SUEA0MixuzK/AIXf2pZ5NOhbf6W/Ek/ojUUUvEj197Hi3raqJlI9 6sQTgsreuUMnRKKslwaghLSQ9LISksruSYDoParoCi5shMIowGtudfd8O Q==; X-CSE-ConnectionGUID: aGlqUZJ7ROm3ddtSGXP08w== X-CSE-MsgGUID: RQN6q2VSS22RNnLqE2xrlg== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235820" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235820" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:43:01 -0700 X-CSE-ConnectionGUID: 8Qvwh1+/T8iwkFdtVSsN2A== X-CSE-MsgGUID: wx26YAthRpau1H2H4cvuHA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858706" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:42:55 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Lu Baolu Subject: [PATCH v5 5/8] x86/mm: Use pagetable_free() Date: Fri, 19 Sep 2025 13:40:03 +0800 Message-ID: <20250919054007.472493-6-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The kernel's memory management subsystem provides a dedicated interface, pagetable_free(), for freeing page table pages. Updates two call sites to use pagetable_free() instead of the lower-level __free_page() or free_pages(). This improves code consistency and clarity, and ensures the correct freeing mechanism is used. Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe --- arch/x86/mm/init_64.c | 2 +- arch/x86/mm/pat/set_memory.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index b9426fce5f3e..3d9a5e4ccaa4 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1031,7 +1031,7 @@ static void __meminit free_pagetable(struct page *pag= e, int order) free_reserved_pages(page, nr_pages); #endif } else { - free_pages((unsigned long)page_address(page), order); + pagetable_free(page_ptdesc(page)); } } =20 diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 8834c76f91c9..8b78a8855024 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -438,7 +438,7 @@ static void cpa_collapse_large_pages(struct cpa_data *c= pa) =20 list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { list_del(&ptdesc->pt_list); - __free_page(ptdesc_page(ptdesc)); + pagetable_free(ptdesc); } } =20 --=20 2.43.0 From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DFFF426B96A for ; Fri, 19 Sep 2025 05:43:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260589; cv=none; b=Noe6vULB/72C1fGxHalhn372DMSPgR4XLnXbP4T4tDR4PPg2KyU0B026AdC0VEIn6J29dPjEFmOLoOaC4KLopZfgE9HQI4VRtaMHzsKrCOsp4/ylahU0YfXXvGZqwQxP8grbqofkYRNp0W+tmBjk6Gt434y/plBNNVk9rC9GXvI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260589; c=relaxed/simple; bh=XUszhuNZ389vB8JW9mMK3FCAh5uDSfA8sbiYQetS8U0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JtcXuz6fay1KvvmcGZR5TVE6TRmVTPOh3anEfAuqyHnD3cyV+79in/7JjIbaRm0DFezrwWMWdmBZS1vNer887mBPC2kyxS/XTxnYz8rjIDMlU7RyMcpMGQ9oj2Vyuss5eSNL9hxxfN6KJwkwaWF5K2LBsW4rAzLZiKQov/wxuDI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ZEIF1OsF; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ZEIF1OsF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260588; x=1789796588; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=XUszhuNZ389vB8JW9mMK3FCAh5uDSfA8sbiYQetS8U0=; b=ZEIF1OsFtQdFiuX9PDBXFdWVpOKW31yi8UaqowLiPrXuuRk0O2EnKwSN BXw8Rd/oMAk4D0Q6ak8n6ssnUK7m7YQ9cTi46jDXM6UDtXJDzRLDFR2e/ xSxFq43Ai5Lt1d0+WsnGzu3x7vPwD+y3SGrTynn9XnJ1QOrunl3kR2tVj xm6GoXlwkA4b11dkIJIFmsKnnalmZGa8QZAbmDDPyC60kZguNkV4xT4M6 a7hUQpf30eySkW0RPRm43II+pdwqK/jSrOJbFs/8jEA18RzQkWThp3rJz 4AbpOMt+gfTnwHAB2iHu9tK/2PohjpZX4Sb/xkJVQcEp3aYYHGXm1w0jL g==; X-CSE-ConnectionGUID: bIRhlVJDSG6Wy5qvIL3hlg== X-CSE-MsgGUID: axMiEAZqTH2o0QT0ZZQBmg== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235837" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235837" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:43:08 -0700 X-CSE-ConnectionGUID: bQxzzAHVRQKGYHmNzIp8gA== X-CSE-MsgGUID: DXTduhSRRmiv7fQx7EjT8A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858756" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:43:01 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Lu Baolu Subject: [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables Date: Fri, 19 Sep 2025 13:40:04 +0800 Message-ID: <20250919054007.472493-7-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Dave Hansen This introduces a conditional asynchronous mechanism, enabled by CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing of pages that are used as page tables for kernel address mappings. These pages are now queued to a work struct instead of being freed immediately. This deferred freeing provides a safe context for a future patch to add an IOMMU-specific callback, which might be expensive on large-scale systems. This ensures the necessary IOMMU cache invalidation is performed before the page is finally returned to the page allocator outside of any critical, non-sleepable path. Signed-off-by: Dave Hansen Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- include/linux/mm.h | 16 +++++++++++++--- mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 668d519edc0f..2d7b4af40442 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2891,6 +2891,14 @@ static inline void __pagetable_free(struct ptdesc *p= t) __free_pages(page, compound_order(page)); } =20 +#ifdef CONFIG_ASYNC_PGTABLE_FREE +void pagetable_free_kernel(struct ptdesc *pt); +#else +static inline void pagetable_free_kernel(struct ptdesc *pt) +{ + __pagetable_free(pt); +} +#endif /** * pagetable_free - Free pagetables * @pt: The page table descriptor @@ -2900,10 +2908,12 @@ static inline void __pagetable_free(struct ptdesc *= pt) */ static inline void pagetable_free(struct ptdesc *pt) { - if (ptdesc_test_kernel(pt)) + if (ptdesc_test_kernel(pt)) { ptdesc_clear_kernel(pt); - - __pagetable_free(pt); + pagetable_free_kernel(pt); + } else { + __pagetable_free(pt); + } } =20 #if defined(CONFIG_SPLIT_PTE_PTLOCKS) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 567e2d084071..0279399d4910 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -406,3 +406,40 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd= _t *pmd, pte_unmap_unlock(pte, ptl); goto again; } + +#ifdef CONFIG_ASYNC_PGTABLE_FREE +static void kernel_pgtable_work_func(struct work_struct *work); + +static struct { + struct list_head list; + /* protect above ptdesc lists */ + spinlock_t lock; + struct work_struct work; +} kernel_pgtable_work =3D { + .list =3D LIST_HEAD_INIT(kernel_pgtable_work.list), + .lock =3D __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock), + .work =3D __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_wor= k_func), +}; + +static void kernel_pgtable_work_func(struct work_struct *work) +{ + struct ptdesc *pt, *next; + LIST_HEAD(page_list); + + spin_lock(&kernel_pgtable_work.lock); + list_splice_tail_init(&kernel_pgtable_work.list, &page_list); + spin_unlock(&kernel_pgtable_work.lock); + + list_for_each_entry_safe(pt, next, &page_list, pt_list) + __pagetable_free(pt); +} + +void pagetable_free_kernel(struct ptdesc *pt) +{ + spin_lock(&kernel_pgtable_work.lock); + list_add(&pt->pt_list, &kernel_pgtable_work.list); + spin_unlock(&kernel_pgtable_work.lock); + + schedule_work(&kernel_pgtable_work.work); +} +#endif --=20 2.43.0 From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20D2326E6F0 for ; Fri, 19 Sep 2025 05:43:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260594; cv=none; b=Wbw/OYAtW++G7V+AiyXmwo+Rb7hCpwu71hpKlLk5hYo9uABAVO9O2DpBDZe5vI433sKOdNxoYsQ5cY0XAUkmkWtun321dfjJoCeBw1SYUE4BAIYd0SCmjRtnNoJFaVxhhjQBiySCDObfUsv1EZfNt/MEnRmH8blPGwBw/bLFbvE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260594; c=relaxed/simple; bh=4R1MBHQq1JefAbwJJHoL+8VoAIMfKTm1QFbquO2mYLA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NxYpAKBhCfua6LjBoK5f0YSz3j3fWIg3xnWodwMgh9EQn6Z7pLGSU9r/Z4FQgkZ6ZMzbV5KsNhNQc9ke+dW7nMLERi+ufGT0ysSRxGWcWQogLu/ToCkxLX5Ku2/LwwCu9No6Adr2NlfKHlKYRC+4mZsjUTRFo8fHG8lUPc5tZvk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=cWO1xr5K; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="cWO1xr5K" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260593; x=1789796593; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4R1MBHQq1JefAbwJJHoL+8VoAIMfKTm1QFbquO2mYLA=; b=cWO1xr5KVFFwmeUH2uHtd9LmHAeKaemo2tfzzYgdh7944xrXl2C4Dk7X tLI+2UPj+kHmHBvip6l5lneMeX2oDgBQZy6d+IHXGMVy3yHWrKmfUST4M 8/T0mB9mL3hT3+IR+SnpGY2PSZOq7Ijtn1pkimW8tN+TCXCo+7cCPVKHA Cc0Wh+C6hePvk+kVo+e1VFwQNoaeW3xCPg54wp8tqq+31Ac4eRAkpZarf fA047MeCW89xQYHU7RaGOWQa729yir4mOsDZJFwemJw+IDgnhfZk4fSCw BHAjm8rMIrykSxw1hAv5ESBtD0eZzXpdjYlKuLFPOZ54XRuTaRIlsqoSA Q==; X-CSE-ConnectionGUID: q5fE3uolTLmHLaiM+JNzZQ== X-CSE-MsgGUID: JtIHQUqETGWDiMTr3rnkow== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235847" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235847" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:43:13 -0700 X-CSE-ConnectionGUID: AnSarqLJRvCAo0XtC3fUqA== X-CSE-MsgGUID: pmp2LUxXQ/WqgCkFl7wFLg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858828" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:43:07 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Lu Baolu Subject: [PATCH v5 7/8] mm: Hook up Kconfig options for async page table freeing Date: Fri, 19 Sep 2025 13:40:05 +0800 Message-ID: <20250919054007.472493-8-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Dave Hansen The CONFIG_ASYNC_PGTABLE_FREE option controls whether an architecture requires asynchronous page table freeing. On x86, this is selected if IOMMU_SVA is enabled, because both Intel and AMD IOMMU architectures could potentially cache kernel page table entries in their paging structure cache, regardless of the permission. Signed-off-by: Dave Hansen Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- arch/x86/Kconfig | 1 + mm/Kconfig | 3 +++ 2 files changed, 4 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 52c8910ba2ef..247caac65e22 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -281,6 +281,7 @@ config X86 select HAVE_PCI select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP + select ASYNC_PGTABLE_FREE if IOMMU_SVA select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_MERGE_VMAS select HAVE_POSIX_CPU_TIMERS_TASK_WORK diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..1576409cec03 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -920,6 +920,9 @@ config PAGE_MAPCOUNT config PGTABLE_HAS_HUGE_LEAVES def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE =20 +config ASYNC_PGTABLE_FREE + def_bool n + # TODO: Allow to be enabled without THP config ARCH_SUPPORTS_HUGE_PFNMAP def_bool n --=20 2.43.0 From nobody Thu Oct 2 07:46:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00BB72571AD; Fri, 19 Sep 2025 05:43:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260600; cv=none; b=Xi4Uos59DXerGTErMS+3sHmqTmK9Om49hrPPgtI+8IBYOkh6Ki7G75hl3bDWnbNJ9NlbG7dQZ6nR9NNzE+iWgq0vR0ildRIutRQPAiGONgUuGBWpyfQM9O3xm4VipjesaPd5pKLh+dSSdClkq9lmveJxIe2sZKbmE+/AO+gibTs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758260600; c=relaxed/simple; bh=+8njrf145jINfn00inC0gZtB0ssqDzEJPA2lNTd2qt0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OhI6M70WdWkQ3qfTUzHhPfI/TsKlUS9Nu6c5qXV75zpqdWW1VXICtwTIEtnKgGvL8nxqfmJV7Bq0a6cJoiXN6APBK294JLr2YuiBM7ukiaVGtd3iClIPKApynT5RvTr5seCl7JWDOxaMb8GbEyPG0+oFwFbPpqQI351+4gsDtG8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=UKwZEWg6; arc=none smtp.client-ip=192.198.163.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="UKwZEWg6" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758260599; x=1789796599; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+8njrf145jINfn00inC0gZtB0ssqDzEJPA2lNTd2qt0=; b=UKwZEWg604VGiJDOOuR/quZJBONpF6D2D9XzeVh3SFsAK7nOhOT+nxHs 8GvVysnQtKA6fEj79ce5rscylzPrvgXZvxZjVdXf+2RYK7HctNToIBfNi 0LScWvOMUv4PHrccyQdID7N+2i2dCRMueAt+J97omngb/Is4K/zjeanjV OYPKGfGmO171Otll53Qdsfatzi11UR8s0B82oB9TEF0E5zj+KK8rY7eQg DxXiH3ASSQvsvzjPhZDVeImgxniVe3KweZfcoVL34kY2DyJJyKIE6Inkq C2CCsEjmJU9czt/2aTvHcLu0WE0RxQo3QRsCqUi8dcpcSp2F/qg9Zwa7f w==; X-CSE-ConnectionGUID: towRKhGbSnysOuIiseEk/A== X-CSE-MsgGUID: ZbbT8y3cRdaTkLdGWyxtfQ== X-IronPort-AV: E=McAfee;i="6800,10657,11557"; a="63235869" X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="63235869" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2025 22:43:19 -0700 X-CSE-ConnectionGUID: +6W7M8nUTJCbR5cI6gK3tQ== X-CSE-MsgGUID: QXAdZle/R3eTi2kaVM+C2g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,277,1751266800"; d="scan'208";a="180858904" Received: from allen-box.sh.intel.com ([10.239.159.52]) by orviesa005.jf.intel.com with ESMTP; 18 Sep 2025 22:43:12 -0700 From: Lu Baolu To: Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jason Gunthorpe , Jann Horn , Vasant Hegde , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai Cc: iommu@lists.linux.dev, security@kernel.org, x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Lu Baolu , stable@vger.kernel.org Subject: [PATCH v5 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Date: Fri, 19 Sep 2025 13:40:06 +0800 Message-ID: <20250919054007.472493-9-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250919054007.472493-1-baolu.lu@linux.intel.com> References: <20250919054007.472493-1-baolu.lu@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The x86 architecture maps the kernel's virtual address space into the upper portion of every process's page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel page table entries. The Linux kernel currently lacks a notification mechanism for kernel page table changes, specifically when page table pages are freed and reused. The IOMMU driver is only notified of changes to user virtual address mappings. This can cause the IOMMU's internal caches to retain stale entries for kernel VA. A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when kernel page table pages are freed and later reallocated. The IOMMU could misinterpret the new data as valid page table entries. The IOMMU might then walk into attacker-controlled memory, leading to arbitrary physical memory DMA access or privilege escalation. This is also a Write-After-Free issue, as the IOMMU will potentially continue to write Accessed and Dirty bits to the freed memory while attempting to walk the stale page tables. Currently, SVA contexts are unprivileged and cannot access kernel mappings. However, the IOMMU will still walk kernel-only page tables all the way down to the leaf entries, where it realizes the mapping is for the kernel and errors out. This means the IOMMU still caches these intermediate page table entries, making the described vulnerability a real concern. To mitigate this, a new IOMMU interface is introduced to flush IOTLB entries for the kernel address space. This interface is invoked from the x86 architecture code that manages combined user and kernel page tables, specifically before any kernel page table page is freed and reused. This addresses the main issue with vfree() which is a common occurrence and can be triggered by unprivileged users. While this resolves the primary problem, it doesn't address some extremely rare case related to memory unplug of memory that was present as reserved memory at boot, which cannot be triggered by unprivileged users. The discussion can be found at the link below. Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") Cc: stable@vger.kernel.org Suggested-by: Jann Horn Co-developed-by: Jason Gunthorpe Signed-off-by: Jason Gunthorpe Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Vasant Hegde Reviewed-by: Kevin Tian Link: https://lore.kernel.org/linux-iommu/04983c62-3b1d-40d4-93ae-34ca04b82= 7e5@intel.com/ --- drivers/iommu/iommu-sva.c | 29 ++++++++++++++++++++++++++++- include/linux/iommu.h | 4 ++++ mm/pgtable-generic.c | 2 ++ 3 files changed, 34 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 1a51cfd82808..d236aef80a8d 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -10,6 +10,8 @@ #include "iommu-priv.h" =20 static DEFINE_MUTEX(iommu_sva_lock); +static bool iommu_sva_present; +static LIST_HEAD(iommu_sva_mms); static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm); =20 @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct m= m_struct *mm, struct de return ERR_PTR(-ENOSPC); } iommu_mm->pasid =3D pasid; + iommu_mm->mm =3D mm; INIT_LIST_HEAD(&iommu_mm->sva_domains); /* * Make sure the write to mm->iommu_mm is not reordered in front of @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device = *dev, struct mm_struct *mm if (ret) goto out_free_domain; domain->users =3D 1; - list_add(&domain->next, &mm->iommu_mm->sva_domains); =20 + if (list_empty(&iommu_mm->sva_domains)) { + if (list_empty(&iommu_sva_mms)) + iommu_sva_present =3D true; + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms); + } + list_add(&domain->next, &iommu_mm->sva_domains); out: refcount_set(&handle->users, 1); mutex_unlock(&iommu_sva_lock); @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) list_del(&domain->next); iommu_domain_free(domain); } + + if (list_empty(&iommu_mm->sva_domains)) { + list_del(&iommu_mm->mm_list_elm); + if (list_empty(&iommu_sva_mms)) + iommu_sva_present =3D false; + } + mutex_unlock(&iommu_sva_lock); kfree(handle); } @@ -312,3 +327,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(str= uct device *dev, =20 return domain; } + +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) +{ + struct iommu_mm_data *iommu_mm; + + guard(mutex)(&iommu_sva_lock); + if (!iommu_sva_present) + return; + + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm) + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end); +} diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c30d12e16473..66e4abb2df0d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -1134,7 +1134,9 @@ struct iommu_sva { =20 struct iommu_mm_data { u32 pasid; + struct mm_struct *mm; struct list_head sva_domains; + struct list_head mm_list_elm; }; =20 int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwno= de); @@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device= *dev, struct mm_struct *mm); void iommu_sva_unbind_device(struct iommu_sva *handle); u32 iommu_sva_get_pasid(struct iommu_sva *handle); +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end= ); #else static inline struct iommu_sva * iommu_sva_bind_device(struct device *dev, struct mm_struct *mm) @@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struc= t *mm) } =20 static inline void mm_pasid_drop(struct mm_struct *mm) {} +static inline void iommu_sva_invalidate_kva_range(unsigned long start, uns= igned long end) {} #endif /* CONFIG_IOMMU_SVA */ =20 #ifdef CONFIG_IOMMU_IOPF diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 0279399d4910..2717dc9afff0 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include =20 @@ -430,6 +431,7 @@ static void kernel_pgtable_work_func(struct work_struct= *work) list_splice_tail_init(&kernel_pgtable_work.list, &page_list); spin_unlock(&kernel_pgtable_work.lock); =20 + iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL); list_for_each_entry_safe(pt, next, &page_list, pt_list) __pagetable_free(pt); } --=20 2.43.0