From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72F93CA9EBD for ; Tue, 13 Jun 2023 00:12:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238885AbjFMAMf (ORCPT ); Mon, 12 Jun 2023 20:12:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51896 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232682AbjFMAMW (ORCPT ); Mon, 12 Jun 2023 20:12:22 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F580118; Mon, 12 Jun 2023 17:12:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615130; x=1718151130; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+8Ah+gwuuWVdk3MaZgJFhmVmdzU+WHC34h/afb7gaAY=; b=LQp277AQF+AhGUDJYeavNrVreevL5C/aPRLewBH6YDBgycnSvl3F6Zic UdKOSVFsHGWZfRHgpMraTnzzOGiu8tssta+tkrce18NOoDtE+d+RTdpMx rUOQaxVuzGX58Y9Bcc+eqJV7qdXY1fxFZXWYCcy3hj4MxnlA+RE5eTn9o pTq8M+EE+civ9APJKvQlQIouV2x1ThvFxQ5fOkSkKtS8jmpDDP2fGj/7r h/qW7YxUCQiAtc0DfCJvs4ev2i/zqM4FEJ/jaoyzKPOpQjBb1YQyOX3L9 YLHHbCIhacZOluir8aC086+WlAgAp2Cyus0tp8W1QcnMuLQqbmryQAmY9 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556651" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556651" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:07 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670967" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670967" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:06 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, linux-alpha@vger.kernel.org, linux-snps-arc@lists.infradead.org, linux-arm-kernel@lists.infradead.org, linux-csky@vger.kernel.org, linux-hexagon@vger.kernel.org, linux-ia64@vger.kernel.org, loongarch@lists.linux.dev, linux-m68k@lists.linux-m68k.org, Michal Simek , Dinh Nguyen , linux-mips@vger.kernel.org, openrisc@lists.librecores.org, linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, sparclinux@vger.kernel.org, linux-um@lists.infradead.org, Linus Torvalds Subject: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Date: Mon, 12 Jun 2023 17:10:27 -0700 Message-Id: <20230613001108.3040476-2-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable memory or shadow stack memory. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(), create a new arch config HAS_HUGE_PAGE that can be used to tell if pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the compiler would not be able to find pmd_mkwrite_novma(). No functional change. Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: linux-hexagon@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-m68k@lists.linux-m68k.org Cc: Michal Simek Cc: Dinh Nguyen Cc: linux-mips@vger.kernel.org Cc: openrisc@lists.librecores.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-riscv@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Suggested-by: Linus Torvalds Signed-off-by: Rick Edgecombe Link: https://lore.kernel.org/lkml/CAHk-=3DwiZjSu7c9sFYZb3q04108stgHff2wfbo= kGCCgW7riz+8Q@mail.gmail.com/ Acked-by: David Hildenbrand Acked-by: Geert Uytterhoeven Acked-by: Helge Deller # parisc Reviewed-by: Mike Rapoport (IBM) --- Hi Non-x86 Arch=E2=80=99s, x86 has a feature that allows for the creation of a special type of writable memory (shadow stack) that is only writable in limited specific ways. Previously, changes were proposed to core MM code to teach it to decide when to create normally writable memory or the special shadow stack writable memory, but David Hildenbrand suggested[0] to change pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be moved into x86 code. Later Linus suggested a less error-prone way[1] to go about this after the first attempt had a bug. Since pXX_mkwrite() is defined in every arch, it requires some tree-wide changes. So that is why you are seeing some patches out of a big x86 series pop up in your arch mailing list. There is no functional change. After this refactor, the shadow stack series goes on to use the arch helpers to push arch memory details inside arch/x86 and other arch's with upcoming shadow stack features. Testing was just 0-day build testing. Hopefully that is enough context. Thanks! [0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redha= t.com/ [1] https://lore.kernel.org/lkml/CAHk-=3DwiZjSu7c9sFYZb3q04108stgHff2wfbokG= CCgW7riz+8Q@mail.gmail.com/ --- Documentation/mm/arch_pgtable_helpers.rst | 6 ++++++ arch/Kconfig | 3 +++ arch/alpha/include/asm/pgtable.h | 2 +- arch/arc/include/asm/hugepage.h | 2 +- arch/arc/include/asm/pgtable-bits-arcv2.h | 2 +- arch/arm/include/asm/pgtable-3level.h | 2 +- arch/arm/include/asm/pgtable.h | 2 +- arch/arm64/include/asm/pgtable.h | 4 ++-- arch/csky/include/asm/pgtable.h | 2 +- arch/hexagon/include/asm/pgtable.h | 2 +- arch/ia64/include/asm/pgtable.h | 2 +- arch/loongarch/include/asm/pgtable.h | 4 ++-- arch/m68k/include/asm/mcf_pgtable.h | 2 +- arch/m68k/include/asm/motorola_pgtable.h | 2 +- arch/m68k/include/asm/sun3_pgtable.h | 2 +- arch/microblaze/include/asm/pgtable.h | 2 +- arch/mips/include/asm/pgtable.h | 6 +++--- arch/nios2/include/asm/pgtable.h | 2 +- arch/openrisc/include/asm/pgtable.h | 2 +- arch/parisc/include/asm/pgtable.h | 2 +- arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++-- arch/powerpc/include/asm/nohash/32/pgtable.h | 4 ++-- arch/powerpc/include/asm/nohash/32/pte-8xx.h | 4 ++-- arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +- arch/riscv/include/asm/pgtable.h | 6 +++--- arch/s390/include/asm/hugetlb.h | 2 +- arch/s390/include/asm/pgtable.h | 4 ++-- arch/sh/include/asm/pgtable_32.h | 4 ++-- arch/sparc/include/asm/pgtable_32.h | 2 +- arch/sparc/include/asm/pgtable_64.h | 6 +++--- arch/um/include/asm/pgtable.h | 2 +- arch/x86/include/asm/pgtable.h | 4 ++-- arch/xtensa/include/asm/pgtable.h | 2 +- include/asm-generic/hugetlb.h | 2 +- include/linux/pgtable.h | 14 ++++++++++++++ 36 files changed, 70 insertions(+), 47 deletions(-) diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/a= rch_pgtable_helpers.rst index af3891f895b0..69ce1f2aa4d1 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -48,6 +48,9 @@ PTE Page Table Helpers +---------------------------+---------------------------------------------= -----+ | pte_mkwrite | Creates a writable PTE = | +---------------------------+---------------------------------------------= -----+ +| pte_mkwrite_novma | Creates a writable PTE, of the conventional = type | +| | of writable. = | ++---------------------------+---------------------------------------------= -----+ | pte_wrprotect | Creates a write protected PTE = | +---------------------------+---------------------------------------------= -----+ | pte_mkspecial | Creates a special PTE = | @@ -120,6 +123,9 @@ PMD Page Table Helpers +---------------------------+---------------------------------------------= -----+ | pmd_mkwrite | Creates a writable PMD = | +---------------------------+---------------------------------------------= -----+ +| pmd_mkwrite_novma | Creates a writable PMD, of the conventional = type | +| | of writable. = | ++---------------------------+---------------------------------------------= -----+ | pmd_wrprotect | Creates a write protected PMD = | +---------------------------+---------------------------------------------= -----+ | pmd_mkspecial | Creates a special PMD = | diff --git a/arch/Kconfig b/arch/Kconfig index 205fd23e0cad..3bc11c9a2ac1 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -919,6 +919,9 @@ config HAVE_ARCH_HUGE_VMALLOC config ARCH_WANT_HUGE_PMD_SHARE bool =20 +config HAS_HUGE_PAGE + def_bool HAVE_ARCH_HUGE_VMAP || TRANSPARENT_HUGEPAGE || HUGETLBFS + config HAVE_ARCH_SOFT_DIRTY bool =20 diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgta= ble.h index ba43cb841d19..af1a13ab3320 100644 --- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -256,7 +256,7 @@ extern inline int pte_young(pte_t pte) { return pte_va= l(pte) & _PAGE_ACCESSED; extern inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |=3D _PAGE_FOW= ; return pte; } extern inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &=3D ~(__DIRTY_B= ITS); return pte; } extern inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &=3D ~(__ACCESS_BI= TS); return pte; } -extern inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &=3D ~_PAGE_FOW;= return pte; } +extern inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) &=3D ~_PAGE= _FOW; return pte; } extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |=3D __DIRTY_BIT= S; return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |=3D __ACCESS_BI= TS; return pte; } =20 diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepag= e.h index 5001b796fb8d..ef8d4166370c 100644 --- a/arch/arc/include/asm/hugepage.h +++ b/arch/arc/include/asm/hugepage.h @@ -21,7 +21,7 @@ static inline pmd_t pte_pmd(pte_t pte) } =20 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd))) -#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd))) +#define pmd_mkwrite_novma(pmd) pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))) #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd))) #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) diff --git a/arch/arc/include/asm/pgtable-bits-arcv2.h b/arch/arc/include/a= sm/pgtable-bits-arcv2.h index 6e9f8ca6d6a1..5c073d9f41c2 100644 --- a/arch/arc/include/asm/pgtable-bits-arcv2.h +++ b/arch/arc/include/asm/pgtable-bits-arcv2.h @@ -87,7 +87,7 @@ =20 PTE_BIT_FUNC(mknotpresent, &=3D ~(_PAGE_PRESENT)); PTE_BIT_FUNC(wrprotect, &=3D ~(_PAGE_WRITE)); -PTE_BIT_FUNC(mkwrite, |=3D (_PAGE_WRITE)); +PTE_BIT_FUNC(mkwrite_novma, |=3D (_PAGE_WRITE)); PTE_BIT_FUNC(mkclean, &=3D ~(_PAGE_DIRTY)); PTE_BIT_FUNC(mkdirty, |=3D (_PAGE_DIRTY)); PTE_BIT_FUNC(mkold, &=3D ~(_PAGE_ACCESSED)); diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/p= gtable-3level.h index 106049791500..71c3add6417f 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -202,7 +202,7 @@ static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) = op; return pmd; } =20 PMD_BIT_FUNC(wrprotect, |=3D L_PMD_SECT_RDONLY); PMD_BIT_FUNC(mkold, &=3D ~PMD_SECT_AF); -PMD_BIT_FUNC(mkwrite, &=3D ~L_PMD_SECT_RDONLY); +PMD_BIT_FUNC(mkwrite_novma, &=3D ~L_PMD_SECT_RDONLY); PMD_BIT_FUNC(mkdirty, |=3D L_PMD_SECT_DIRTY); PMD_BIT_FUNC(mkclean, &=3D ~L_PMD_SECT_DIRTY); PMD_BIT_FUNC(mkyoung, |=3D PMD_SECT_AF); diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index a58ccbb406ad..f37ba2472eae 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -227,7 +227,7 @@ static inline pte_t pte_wrprotect(pte_t pte) return set_pte_bit(pte, __pgprot(L_PTE_RDONLY)); } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { return clear_pte_bit(pte, __pgprot(L_PTE_RDONLY)); } diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index 0bd18de9fd97..7a3d62cb9bee 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -180,7 +180,7 @@ static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t pro= t) return pmd; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte =3D set_pte_bit(pte, __pgprot(PTE_WRITE)); pte =3D clear_pte_bit(pte, __pgprot(PTE_RDONLY)); @@ -487,7 +487,7 @@ static inline int pmd_trans_huge(pmd_t pmd) #define pmd_cont(pmd) pte_cont(pmd_pte(pmd)) #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd))) #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd))) -#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd))) +#define pmd_mkwrite_novma(pmd) pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))) #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtabl= e.h index d4042495febc..aa0cce4fc02f 100644 --- a/arch/csky/include/asm/pgtable.h +++ b/arch/csky/include/asm/pgtable.h @@ -176,7 +176,7 @@ static inline pte_t pte_mkold(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITE; if (pte_val(pte) & _PAGE_MODIFIED) diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/= pgtable.h index 59393613d086..fc2d2d83368d 100644 --- a/arch/hexagon/include/asm/pgtable.h +++ b/arch/hexagon/include/asm/pgtable.h @@ -300,7 +300,7 @@ static inline pte_t pte_wrprotect(pte_t pte) } =20 /* pte_mkwrite - mark page as writable */ -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITE; return pte; diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtabl= e.h index 21c97e31a28a..f80aba7cad99 100644 --- a/arch/ia64/include/asm/pgtable.h +++ b/arch/ia64/include/asm/pgtable.h @@ -268,7 +268,7 @@ ia64_phys_addr_valid (unsigned long addr) * access rights: */ #define pte_wrprotect(pte) (__pte(pte_val(pte) & ~_PAGE_AR_RW)) -#define pte_mkwrite(pte) (__pte(pte_val(pte) | _PAGE_AR_RW)) +#define pte_mkwrite_novma(pte) (__pte(pte_val(pte) | _PAGE_AR_RW)) #define pte_mkold(pte) (__pte(pte_val(pte) & ~_PAGE_A)) #define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A)) #define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D)) diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/= asm/pgtable.h index d28fb9dbec59..8245cf367b31 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -390,7 +390,7 @@ static inline pte_t pte_mkdirty(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITE; if (pte_val(pte) & _PAGE_MODIFIED) @@ -490,7 +490,7 @@ static inline int pmd_write(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_WRITE); } =20 -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) { pmd_val(pmd) |=3D _PAGE_WRITE; if (pmd_val(pmd) & _PAGE_MODIFIED) diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mc= f_pgtable.h index d97fbb812f63..42ebea0488e3 100644 --- a/arch/m68k/include/asm/mcf_pgtable.h +++ b/arch/m68k/include/asm/mcf_pgtable.h @@ -211,7 +211,7 @@ static inline pte_t pte_mkold(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D CF_PAGE_WRITABLE; return pte; diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/a= sm/motorola_pgtable.h index ec0dc19ab834..ba28ca4d219a 100644 --- a/arch/m68k/include/asm/motorola_pgtable.h +++ b/arch/m68k/include/asm/motorola_pgtable.h @@ -155,7 +155,7 @@ static inline int pte_young(pte_t pte) { return pte_va= l(pte) & _PAGE_ACCESSED; static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |=3D _PAGE_RON= LY; return pte; } static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &=3D ~_PAGE_DIRT= Y; return pte; } static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &=3D ~_PAGE_ACCESS= ED; return pte; } -static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &=3D ~_PAGE_RONL= Y; return pte; } +static inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) &=3D ~_PAGE= _RONLY; return pte; } static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |=3D _PAGE_DIRTY= ; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |=3D _PAGE_ACCES= SED; return pte; } static inline pte_t pte_mknocache(pte_t pte) diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/s= un3_pgtable.h index e582b0484a55..4114eaff7404 100644 --- a/arch/m68k/include/asm/sun3_pgtable.h +++ b/arch/m68k/include/asm/sun3_pgtable.h @@ -143,7 +143,7 @@ static inline int pte_young(pte_t pte) { return pte_va= l(pte) & SUN3_PAGE_ACCESS static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &=3D ~SUN3_PAG= E_WRITEABLE; return pte; } static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &=3D ~SUN3_PAGE_= MODIFIED; return pte; } static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &=3D ~SUN3_PAGE_AC= CESSED; return pte; } -static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |=3D SUN3_PAGE_W= RITEABLE; return pte; } +static inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) |=3D SUN3_P= AGE_WRITEABLE; return pte; } static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |=3D SUN3_PAGE_M= ODIFIED; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |=3D SUN3_PAGE_A= CCESSED; return pte; } static inline pte_t pte_mknocache(pte_t pte) { pte_val(pte) |=3D SUN3_PAGE= _NOCACHE; return pte; } diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/includ= e/asm/pgtable.h index d1b8272abcd9..9108b33a7886 100644 --- a/arch/microblaze/include/asm/pgtable.h +++ b/arch/microblaze/include/asm/pgtable.h @@ -266,7 +266,7 @@ static inline pte_t pte_mkread(pte_t pte) \ { pte_val(pte) |=3D _PAGE_USER; return pte; } static inline pte_t pte_mkexec(pte_t pte) \ { pte_val(pte) |=3D _PAGE_USER | _PAGE_EXEC; return pte; } -static inline pte_t pte_mkwrite(pte_t pte) \ +static inline pte_t pte_mkwrite_novma(pte_t pte) \ { pte_val(pte) |=3D _PAGE_RW; return pte; } static inline pte_t pte_mkdirty(pte_t pte) \ { pte_val(pte) |=3D _PAGE_DIRTY; return pte; } diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtabl= e.h index 574fa14ac8b2..40a54fd6e48d 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -309,7 +309,7 @@ static inline pte_t pte_mkold(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte.pte_low |=3D _PAGE_WRITE; if (pte.pte_low & _PAGE_MODIFIED) { @@ -364,7 +364,7 @@ static inline pte_t pte_mkold(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITE; if (pte_val(pte) & _PAGE_MODIFIED) @@ -627,7 +627,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd) return pmd; } =20 -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) { pmd_val(pmd) |=3D _PAGE_WRITE; if (pmd_val(pmd) & _PAGE_MODIFIED) diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgta= ble.h index 0f5c2564e9f5..cf1ffbc1a121 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -129,7 +129,7 @@ static inline pte_t pte_mkold(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITE; return pte; diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/as= m/pgtable.h index 3eb9b9555d0d..828820c74fc5 100644 --- a/arch/openrisc/include/asm/pgtable.h +++ b/arch/openrisc/include/asm/pgtable.h @@ -250,7 +250,7 @@ static inline pte_t pte_mkold(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITE; return pte; diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pg= table.h index e715df5385d6..79d1cef2fd7c 100644 --- a/arch/parisc/include/asm/pgtable.h +++ b/arch/parisc/include/asm/pgtable.h @@ -331,7 +331,7 @@ static inline pte_t pte_mkold(pte_t pte) { pte_val(pte)= &=3D ~_PAGE_ACCESSED; retu static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &=3D ~_PAGE_WR= ITE; return pte; } static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |=3D _PAGE_DIRTY= ; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |=3D _PAGE_ACCES= SED; return pte; } -static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITE= ; return pte; } +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE= _WRITE; return pte; } static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |=3D _PAGE_SPE= CIAL; return pte; } =20 /* diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/in= clude/asm/book3s/32/pgtable.h index 7bf1fe7297c6..67dfb674a4c1 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -498,7 +498,7 @@ static inline pte_t pte_mkpte(pte_t pte) return pte; } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { return __pte(pte_val(pte) | _PAGE_RW); } diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/in= clude/asm/book3s/64/pgtable.h index 4acc9690f599..0328d917494a 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -600,7 +600,7 @@ static inline pte_t pte_mkexec(pte_t pte) return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC)); } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { /* * write implies read, hence set both @@ -1071,7 +1071,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd) #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) -#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd))) +#define pmd_mkwrite_novma(pmd) pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))) =20 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY #define pmd_soft_dirty(pmd) pte_soft_dirty(pmd_pte(pmd)) diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/in= clude/asm/nohash/32/pgtable.h index fec56d965f00..33213b31fcbb 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -170,8 +170,8 @@ void unmap_kernel_page(unsigned long va); #define pte_clear(mm, addr, ptep) \ do { pte_update(mm, addr, ptep, ~0, 0, 0); } while (0) =20 -#ifndef pte_mkwrite -static inline pte_t pte_mkwrite(pte_t pte) +#ifndef pte_mkwrite_novma +static inline pte_t pte_mkwrite_novma(pte_t pte) { return __pte(pte_val(pte) | _PAGE_RW); } diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/in= clude/asm/nohash/32/pte-8xx.h index 1a89ebdc3acc..21f681ee535a 100644 --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h @@ -101,12 +101,12 @@ static inline int pte_write(pte_t pte) =20 #define pte_write pte_write =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_RO); } =20 -#define pte_mkwrite pte_mkwrite +#define pte_mkwrite_novma pte_mkwrite_novma =20 static inline bool pte_user(pte_t pte) { diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/in= clude/asm/nohash/64/pgtable.h index 287e25864ffa..abe4fd82721e 100644 --- a/arch/powerpc/include/asm/nohash/64/pgtable.h +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h @@ -85,7 +85,7 @@ #ifndef __ASSEMBLY__ /* pte_clear moved to later in this file */ =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { return __pte(pte_val(pte) | _PAGE_RW); } diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgta= ble.h index 2258b27173b0..b38faec98154 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -379,7 +379,7 @@ static inline pte_t pte_wrprotect(pte_t pte) =20 /* static inline pte_t pte_mkread(pte_t pte) */ =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { return __pte(pte_val(pte) | _PAGE_WRITE); } @@ -665,9 +665,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) return pte_pmd(pte_mkyoung(pmd_pte(pmd))); } =20 -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) { - return pte_pmd(pte_mkwrite(pmd_pte(pmd))); + return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))); } =20 static inline pmd_t pmd_wrprotect(pmd_t pmd) diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetl= b.h index ccdbccfde148..f07267875a19 100644 --- a/arch/s390/include/asm/hugetlb.h +++ b/arch/s390/include/asm/hugetlb.h @@ -104,7 +104,7 @@ static inline int huge_pte_dirty(pte_t pte) =20 static inline pte_t huge_pte_mkwrite(pte_t pte) { - return pte_mkwrite(pte); + return pte_mkwrite_novma(pte); } =20 static inline pte_t huge_pte_mkdirty(pte_t pte) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtabl= e.h index 6822a11c2c8a..699406036f30 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1005,7 +1005,7 @@ static inline pte_t pte_wrprotect(pte_t pte) return set_pte_bit(pte, __pgprot(_PAGE_PROTECT)); } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte =3D set_pte_bit(pte, __pgprot(_PAGE_WRITE)); if (pte_val(pte) & _PAGE_DIRTY) @@ -1488,7 +1488,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd) return set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_PROTECT)); } =20 -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) { pmd =3D set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_WRITE)); if (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY) diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable= _32.h index 21952b094650..165b4fd08152 100644 --- a/arch/sh/include/asm/pgtable_32.h +++ b/arch/sh/include/asm/pgtable_32.h @@ -359,11 +359,11 @@ static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h= op; return pte; } * kernel permissions), we attempt to couple them a bit more sanely here. */ PTE_BIT_FUNC(high, wrprotect, &=3D ~(_PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN= _WRITE)); -PTE_BIT_FUNC(high, mkwrite, |=3D _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRI= TE); +PTE_BIT_FUNC(high, mkwrite_novma, |=3D _PAGE_EXT_USER_WRITE | _PAGE_EXT_KE= RN_WRITE); PTE_BIT_FUNC(high, mkhuge, |=3D _PAGE_SZHUGE); #else PTE_BIT_FUNC(low, wrprotect, &=3D ~_PAGE_RW); -PTE_BIT_FUNC(low, mkwrite, |=3D _PAGE_RW); +PTE_BIT_FUNC(low, mkwrite_novma, |=3D _PAGE_RW); PTE_BIT_FUNC(low, mkhuge, |=3D _PAGE_SZHUGE); #endif =20 diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/p= gtable_32.h index d4330e3c57a6..a2d909446539 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -241,7 +241,7 @@ static inline pte_t pte_mkold(pte_t pte) return __pte(pte_val(pte) & ~SRMMU_REF); } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { return __pte(pte_val(pte) | SRMMU_WRITE); } diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/p= gtable_64.h index 5563efa1a19f..4dd4f6cdc670 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -517,7 +517,7 @@ static inline pte_t pte_mkclean(pte_t pte) return __pte(val); } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { unsigned long val =3D pte_val(pte), mask; =20 @@ -772,11 +772,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) return __pmd(pte_val(pte)); } =20 -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) { pte_t pte =3D __pte(pmd_val(pmd)); =20 - pte =3D pte_mkwrite(pte); + pte =3D pte_mkwrite_novma(pte); =20 return __pmd(pte_val(pte)); } diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h index a70d1618eb35..46f59a8bc812 100644 --- a/arch/um/include/asm/pgtable.h +++ b/arch/um/include/asm/pgtable.h @@ -207,7 +207,7 @@ static inline pte_t pte_mkyoung(pte_t pte) return(pte); } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { if (unlikely(pte_get_bits(pte, _PAGE_RW))) return pte; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 15ae4d6ba476..112e6060eafa 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -352,7 +352,7 @@ static inline pte_t pte_mkyoung(pte_t pte) return pte_set_flags(pte, _PAGE_ACCESSED); } =20 -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { return pte_set_flags(pte, _PAGE_RW); } @@ -453,7 +453,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_ACCESSED); } =20 -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_RW); } diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pg= table.h index fc7a14884c6c..27e3ae38a5de 100644 --- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -262,7 +262,7 @@ static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |=3D _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |=3D _PAGE_ACCESSED; return pte; } -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |=3D _PAGE_WRITABLE; return pte; } =20 #define pgprot_noncached(prot) \ diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h index d7f6335d3999..4da02798a00b 100644 --- a/include/asm-generic/hugetlb.h +++ b/include/asm-generic/hugetlb.h @@ -22,7 +22,7 @@ static inline unsigned long huge_pte_dirty(pte_t pte) =20 static inline pte_t huge_pte_mkwrite(pte_t pte) { - return pte_mkwrite(pte); + return pte_mkwrite_novma(pte); } =20 #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index c5a51481bbb9..ae271a307584 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -507,6 +507,20 @@ extern pud_t pudp_huge_clear_flush(struct vm_area_stru= ct *vma, pud_t *pudp); #endif =20 +#ifndef pte_mkwrite +static inline pte_t pte_mkwrite(pte_t pte) +{ + return pte_mkwrite_novma(pte); +} +#endif + +#if defined(CONFIG_HAS_HUGE_PAGE) && !defined(pmd_mkwrite) +static inline pmd_t pmd_mkwrite(pmd_t pmd) +{ + return pmd_mkwrite_novma(pmd); +} +#endif + #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT struct mm_struct; static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long = address, pte_t *ptep) --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass header.i=@intel.com; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass(p=none dis=none) header.from=intel.com ARC-Seal: i=1; a=rsa-sha256; t=1686615174; cv=none; d=zohomail.com; s=zohoarc; b=WKrL/gl+Qjehti5H5oi1OX5GsnDdnT2prOiqF7xlRnGKPr1wGWrLet1L60/woIIugXhkvzsZjH1ecaRJ4aHVabGLbutRWTWWAvycNQoHOzTbIUkozKM+rq1Yox6jonrnCnNQFPa/KVAvlXhjB50NnKvdGL4j0rXzK9UKpHIP/HU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1686615174; h=Content-Type:Content-Transfer-Encoding:Cc:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:To; bh=bhEM/cU7s2w8AGabEsZ/KJXI1WiWuC9Y86R+7+JHEQk=; b=lDroSuBbQak3a1DRKPIQdIkXQ3r2XylQYxEIa/pnlnKqIG8Eumo2Ad3psD0J5egEmRzqAlGMa2bff4VGn/dYlhvsCB0rIqKFudLfVSOXlSFDpoB+gf++632FIsLI7uRuZ9JnFpADxApjTW0XZ6Gg1Z9okxpHNLSJM8VdlGGJHGA= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=@intel.com; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 1686615174193612.11955769701; Mon, 12 Jun 2023 17:12:54 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.547697.855265 (Exim 4.92) (envelope-from ) id 1q8rdx-0007jd-28; Tue, 13 Jun 2023 00:12:17 +0000 Received: by outflank-mailman (output) from mailman id 547697.855265; Tue, 13 Jun 2023 00:12:17 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1q8rdw-0007jW-VR; Tue, 13 Jun 2023 00:12:16 +0000 Received: by outflank-mailman (input) for mailman id 547697; Tue, 13 Jun 2023 00:12:15 +0000 Received: from se1-gles-sth1-in.inumbo.com ([159.253.27.254] helo=se1-gles-sth1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1q8rdv-0007jQ-9m for xen-devel@lists.xenproject.org; Tue, 13 Jun 2023 00:12:15 +0000 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by se1-gles-sth1.inumbo.com (Halon) with ESMTPS id f08fc573-097e-11ee-b232-6b7b168915f2; Tue, 13 Jun 2023 02:12:12 +0200 (CEST) Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:08 -0700 Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:07 -0700 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: f08fc573-097e-11ee-b232-6b7b168915f2 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615132; x=1718151132; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3c77rqX6YUy6Cyz5r8ePlQOYiEkqAYIhDQVjx5oVoZw=; b=g+ZYXoAlr+ngok4F9bueFmrFBdo0jxaex0ZmEZJed+xR1ofr5h0nf485 WnWKuNjlAuF6oNqVWVs8tSKc3DS1bD2TMpsyzRsrzzns3ABqZgeTxSpYE QNB7mulmMPh57NMK1WLxrtb/hyAJ9/20uUSy+mMKKw+zDDVnwPRCeiBLX pxHylPw8z3BfYu52D/sNKlYsTFdnRr3zw0ipaQ4ep9DjSqrs7lrLWIaBb lepqbFnNWl7785sGMGIxfisGjct0I2pKOOFNxpQE2T2KFZ1nwX9iWG+JT l2hRaM86KUHcJ300P2QP6N7JdTdAqhiblaMfmLhAtAE0r3VALRPHU+dg7 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556695" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556695" X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670972" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670972" From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, linux-arm-kernel@lists.infradead.org, linux-s390@vger.kernel.org, xen-devel@lists.xenproject.org Subject: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma() Date: Mon, 12 Jun 2023 17:10:28 -0700 Message-Id: <20230613001108.3040476-3-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity @intel.com) X-ZM-MESSAGEID: 1686615175275100001 The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable memory or shadow stack memory. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Previous patches have done the first step, so next move the callers that don't have a VMA to pte_mkwrite_novma(). Also do the same for pmd_mkwrite(). This will be ok for the shadow stack feature, as these callers are on kernel memory which will not need to be made shadow stack, and the other architectures only currently support one type of memory in pte_mkwrite() Cc: linux-doc@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: xen-devel@lists.xenproject.org Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Rick Edgecombe Acked-by: David Hildenbrand Reviewed-by: Mike Rapoport (IBM) --- Hi Non-x86 Arch=E2=80=99s, x86 has a feature that allows for the creation of a special type of writable memory (shadow stack) that is only writable in limited specific ways. Previously, changes were proposed to core MM code to teach it to decide when to create normally writable memory or the special shadow stack writable memory, but David Hildenbrand suggested[0] to change pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be moved into x86 code. Later Linus suggested a less error-prone way[1] to go about this after the first attempt had a bug. Since pXX_mkwrite() is defined in every arch, it requires some tree-wide changes. So that is why you are seeing some patches out of a big x86 series pop up in your arch mailing list. There is no functional change. After this refactor, the shadow stack series goes on to use the arch helpers to push arch memory details inside arch/x86 and other arch's with upcoming shadow stack features. Testing was just 0-day build testing. Hopefully that is enough context. Thanks! [0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redha= t.com/ [1] https://lore.kernel.org/lkml/CAHk-=3DwiZjSu7c9sFYZb3q04108stgHff2wfbokG= CCgW7riz+8Q@mail.gmail.com/ --- arch/arm64/mm/trans_pgd.c | 4 ++-- arch/s390/mm/pageattr.c | 4 ++-- arch/x86/xen/mmu_pv.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c index 4ea2eefbc053..a01493f3a06f 100644 --- a/arch/arm64/mm/trans_pgd.c +++ b/arch/arm64/mm/trans_pgd.c @@ -40,7 +40,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, u= nsigned long addr) * read only (code, rodata). Clear the RDONLY bit from * the temporary mappings we use during restore. */ - set_pte(dst_ptep, pte_mkwrite(pte)); + set_pte(dst_ptep, pte_mkwrite_novma(pte)); } else if (debug_pagealloc_enabled() && !pte_none(pte)) { /* * debug_pagealloc will removed the PTE_VALID bit if @@ -53,7 +53,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, u= nsigned long addr) */ BUG_ON(!pfn_valid(pte_pfn(pte))); =20 - set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte))); + set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte))); } } =20 diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c index 5ba3bd8a7b12..6931d484d8a7 100644 --- a/arch/s390/mm/pageattr.c +++ b/arch/s390/mm/pageattr.c @@ -97,7 +97,7 @@ static int walk_pte_level(pmd_t *pmdp, unsigned long addr= , unsigned long end, if (flags & SET_MEMORY_RO) new =3D pte_wrprotect(new); else if (flags & SET_MEMORY_RW) - new =3D pte_mkwrite(pte_mkdirty(new)); + new =3D pte_mkwrite_novma(pte_mkdirty(new)); if (flags & SET_MEMORY_NX) new =3D set_pte_bit(new, __pgprot(_PAGE_NOEXEC)); else if (flags & SET_MEMORY_X) @@ -155,7 +155,7 @@ static void modify_pmd_page(pmd_t *pmdp, unsigned long = addr, if (flags & SET_MEMORY_RO) new =3D pmd_wrprotect(new); else if (flags & SET_MEMORY_RW) - new =3D pmd_mkwrite(pmd_mkdirty(new)); + new =3D pmd_mkwrite_novma(pmd_mkdirty(new)); if (flags & SET_MEMORY_NX) new =3D set_pmd_bit(new, __pgprot(_SEGMENT_ENTRY_NOEXEC)); else if (flags & SET_MEMORY_X) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index b3b8d289b9ab..63fced067057 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -150,7 +150,7 @@ void make_lowmem_page_readwrite(void *vaddr) if (pte =3D=3D NULL) return; /* vaddr missing */ =20 - ptev =3D pte_mkwrite(*pte); + ptev =3D pte_mkwrite_novma(*pte); =20 if (HYPERVISOR_update_va_mapping(address, ptev, 0)) BUG(); --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19A11CA9EAB for ; Tue, 13 Jun 2023 00:12:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238785AbjFMAM1 (ORCPT ); Mon, 12 Jun 2023 20:12:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229454AbjFMAMN (ORCPT ); Mon, 12 Jun 2023 20:12:13 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D90FE197; Mon, 12 Jun 2023 17:12:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615132; x=1718151132; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=p3wIDF2hCv3eUQbed5EBHWb3M5fOtv++6fox2OA9QBk=; b=WEe03L+dip8DerGbMMAGDK4Ow/Xw3dy+cm3Ti1gMWsu0r8R69pWiUPgo yHvJlcXbTuopsQg2EcCGEDgz70Kdl9QU+6uBcXFqur2ZIud2W2/HSF2L8 GJF34qaIjUY1hH2/iOaPR2Ra5fqOnTyKPQD7My8Al5sqDtO6B/U5kiZ2p br+WLfaTzzdY3gmGGdF3C4RicDik8cpHfEjxNDgBRUMWq7J2sye/PfL4+ wBzTIn1f9weZPypHBlSu1KwlfZ8/6I2kqDTZ04buk+VO2LbNe0/fopnEX Ev26Em8f0ik2wR0lsoFrX0bOhClFj9zptlCu3RsbTBmwGt9GpsHX/3lyi Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556717" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556717" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:09 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670975" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670975" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:08 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com Subject: [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA Date: Mon, 12 Jun 2023 17:10:29 -0700 Message-Id: <20230613001108.3040476-4-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable memory or shadow stack memory. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. In a previous patches, pte_mkwrite() was renamed pte_mkwrite_novma() and callers that don't have a VMA were changed to use pte_mkwrite_novma(). So now change pte_mkwrite() to take a VMA and change the remaining callers to pass a VMA. Apply the same changes for pmd_mkwrite(). No functional change. Suggested-by: David Hildenbrand Signed-off-by: Rick Edgecombe Acked-by: David Hildenbrand Reviewed-by: Mike Rapoport (IBM) --- Documentation/mm/arch_pgtable_helpers.rst | 6 ++++-- include/linux/mm.h | 2 +- include/linux/pgtable.h | 4 ++-- mm/debug_vm_pgtable.c | 12 ++++++------ mm/huge_memory.c | 10 +++++----- mm/memory.c | 4 ++-- mm/migrate.c | 2 +- mm/migrate_device.c | 2 +- mm/mprotect.c | 2 +- mm/userfaultfd.c | 2 +- 10 files changed, 24 insertions(+), 22 deletions(-) diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/a= rch_pgtable_helpers.rst index 69ce1f2aa4d1..c82e3ee20e51 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -46,7 +46,8 @@ PTE Page Table Helpers +---------------------------+---------------------------------------------= -----+ | pte_mkclean | Creates a clean PTE = | +---------------------------+---------------------------------------------= -----+ -| pte_mkwrite | Creates a writable PTE = | +| pte_mkwrite | Creates a writable PTE of the type specified= by | +| | the VMA. = | +---------------------------+---------------------------------------------= -----+ | pte_mkwrite_novma | Creates a writable PTE, of the conventional = type | | | of writable. = | @@ -121,7 +122,8 @@ PMD Page Table Helpers +---------------------------+---------------------------------------------= -----+ | pmd_mkclean | Creates a clean PMD = | +---------------------------+---------------------------------------------= -----+ -| pmd_mkwrite | Creates a writable PMD = | +| pmd_mkwrite | Creates a writable PMD of the type specified= by | +| | the VMA. = | +---------------------------+---------------------------------------------= -----+ | pmd_mkwrite_novma | Creates a writable PMD, of the conventional = type | | | of writable. = | diff --git a/include/linux/mm.h b/include/linux/mm.h index 27ce77080c79..43701bf223d3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1284,7 +1284,7 @@ void free_compound_page(struct page *page); static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) { if (likely(vma->vm_flags & VM_WRITE)) - pte =3D pte_mkwrite(pte); + pte =3D pte_mkwrite(pte, vma); return pte; } =20 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index ae271a307584..0f3cf726812a 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -508,14 +508,14 @@ extern pud_t pudp_huge_clear_flush(struct vm_area_str= uct *vma, #endif =20 #ifndef pte_mkwrite -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) { return pte_mkwrite_novma(pte); } #endif =20 #if defined(CONFIG_HAS_HUGE_PAGE) && !defined(pmd_mkwrite) -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) { return pmd_mkwrite_novma(pmd); } diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index c54177aabebd..107e293904d3 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -109,10 +109,10 @@ static void __init pte_basic_tests(struct pgtable_deb= ug_args *args, int idx) WARN_ON(!pte_same(pte, pte)); WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte)))); WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte)))); - WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte)))); + WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte), args->vma))); WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte)))); WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte)))); - WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte)))); + WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte, args->vma)))); WARN_ON(pte_dirty(pte_wrprotect(pte_mkclean(pte)))); WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte)))); } @@ -153,7 +153,7 @@ static void __init pte_advanced_tests(struct pgtable_de= bug_args *args) pte =3D pte_mkclean(pte); set_pte_at(args->mm, args->vaddr, args->ptep, pte); flush_dcache_page(page); - pte =3D pte_mkwrite(pte); + pte =3D pte_mkwrite(pte, args->vma); pte =3D pte_mkdirty(pte); ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1); pte =3D ptep_get(args->ptep); @@ -199,10 +199,10 @@ static void __init pmd_basic_tests(struct pgtable_deb= ug_args *args, int idx) WARN_ON(!pmd_same(pmd, pmd)); WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd)))); WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd)))); - WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd)))); + WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd), args->vma))); WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd)))); WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd)))); - WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd)))); + WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd, args->vma)))); WARN_ON(pmd_dirty(pmd_wrprotect(pmd_mkclean(pmd)))); WARN_ON(!pmd_dirty(pmd_wrprotect(pmd_mkdirty(pmd)))); /* @@ -253,7 +253,7 @@ static void __init pmd_advanced_tests(struct pgtable_de= bug_args *args) pmd =3D pmd_mkclean(pmd); set_pmd_at(args->mm, vaddr, args->pmdp, pmd); flush_dcache_page(page); - pmd =3D pmd_mkwrite(pmd); + pmd =3D pmd_mkwrite(pmd, args->vma); pmd =3D pmd_mkdirty(pmd); pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1); pmd =3D READ_ONCE(*args->pmdp); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 624671aaa60d..37dd56b7b3d1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -551,7 +551,7 @@ __setup("transparent_hugepage=3D", setup_transparent_hu= gepage); pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) { if (likely(vma->vm_flags & VM_WRITE)) - pmd =3D pmd_mkwrite(pmd); + pmd =3D pmd_mkwrite(pmd, vma); return pmd; } =20 @@ -1572,7 +1572,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) pmd =3D pmd_modify(oldpmd, vma->vm_page_prot); pmd =3D pmd_mkyoung(pmd); if (writable) - pmd =3D pmd_mkwrite(pmd); + pmd =3D pmd_mkwrite(pmd, vma); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); spin_unlock(vmf->ptl); @@ -1924,7 +1924,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, /* See change_pte_range(). */ if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) && can_change_pmd_writable(vma, addr, entry)) - entry =3D pmd_mkwrite(entry); + entry =3D pmd_mkwrite(entry, vma); =20 ret =3D HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); @@ -2234,7 +2234,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, } else { entry =3D mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); if (write) - entry =3D pte_mkwrite(entry); + entry =3D pte_mkwrite(entry, vma); if (anon_exclusive) SetPageAnonExclusive(page + i); if (!young) @@ -3271,7 +3271,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk= *pvmw, struct page *new) if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde =3D pmd_mksoft_dirty(pmde); if (is_writable_migration_entry(entry)) - pmde =3D pmd_mkwrite(pmde); + pmde =3D pmd_mkwrite(pmde, vma); if (pmd_swp_uffd_wp(*pvmw->pmd)) pmde =3D pmd_mkuffd_wp(pmde); if (!is_migration_entry_young(entry)) diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..c1b6fe944c20 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4100,7 +4100,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *= vmf) entry =3D mk_pte(&folio->page, vma->vm_page_prot); entry =3D pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) - entry =3D pte_mkwrite(pte_mkdirty(entry)); + entry =3D pte_mkwrite(pte_mkdirty(entry), vma); =20 vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); @@ -4796,7 +4796,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) pte =3D pte_modify(old_pte, vma->vm_page_prot); pte =3D pte_mkyoung(pte); if (writable) - pte =3D pte_mkwrite(pte); + pte =3D pte_mkwrite(pte, vma); ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); update_mmu_cache(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); diff --git a/mm/migrate.c b/mm/migrate.c index 01cac26a3127..8b46b722f1a4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -219,7 +219,7 @@ static bool remove_migration_pte(struct folio *folio, if (folio_test_dirty(folio) && is_migration_entry_dirty(entry)) pte =3D pte_mkdirty(pte); if (is_writable_migration_entry(entry)) - pte =3D pte_mkwrite(pte); + pte =3D pte_mkwrite(pte, vma); else if (pte_swp_uffd_wp(*pvmw.pte)) pte =3D pte_mkuffd_wp(pte); =20 diff --git a/mm/migrate_device.c b/mm/migrate_device.c index d30c9de60b0d..df3f5e9d5f76 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -646,7 +646,7 @@ static void migrate_vma_insert_page(struct migrate_vma = *migrate, } entry =3D mk_pte(page, vma->vm_page_prot); if (vma->vm_flags & VM_WRITE) - entry =3D pte_mkwrite(pte_mkdirty(entry)); + entry =3D pte_mkwrite(pte_mkdirty(entry), vma); } =20 ptep =3D pte_offset_map_lock(mm, pmdp, addr, &ptl); diff --git a/mm/mprotect.c b/mm/mprotect.c index 92d3d3ca390a..afdb6723782e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -198,7 +198,7 @@ static long change_pte_range(struct mmu_gather *tlb, if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pte_write(ptent) && can_change_pte_writable(vma, addr, ptent)) - ptent =3D pte_mkwrite(ptent); + ptent =3D pte_mkwrite(ptent, vma); =20 ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); if (pte_needs_flush(oldpte, ptent)) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e97a0b4889fc..6dea7f57026e 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -72,7 +72,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, if (page_in_cache && !vm_shared) writable =3D false; if (writable) - _dst_pte =3D pte_mkwrite(_dst_pte); + _dst_pte =3D pte_mkwrite(_dst_pte, dst_vma); if (flags & MFILL_ATOMIC_WP) _dst_pte =3D pte_mkuffd_wp(_dst_pte); =20 --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21BA5C88CBB for ; Tue, 13 Jun 2023 00:12:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232473AbjFMAMa (ORCPT ); Mon, 12 Jun 2023 20:12:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51768 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229869AbjFMAMO (ORCPT ); Mon, 12 Jun 2023 20:12:14 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F095B19B; Mon, 12 Jun 2023 17:12:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615132; x=1718151132; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tDE8EspiBAdPvpd4BkDoYawjxYWlY+HjRp9EjAaWS/U=; b=fCT1br1C/pf5ZkfVJejD1weJCG6FHypSMBy2efglxfDp/2Ck+CUXiu0w BWztVo6lEsqrwDqrk6NdNd+hEJ7eVUyqGQzMROCLsYGWAAxVEEscz8mFP W4xqPTRhtdBZbCsdi46QAwc6rqDCaMDVtVAtQKH9qLR/uL9of1knlk6jw AFInAMvK7XEM8JvWuo8TwEdq5JMArbChQGmUB35QPR75h+FUNeANvB+12 jo5riqGU+yD3LAS0LtDs4Q9tBqYd2FmYczfnBueV4otLx9Wa5q4aZJWhm To1xgjw/mRRudqf14CYQU1Zzgg9rwl1w8MWktgr2J5lgxwBJAOtVjubYR g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556728" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556728" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:09 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670979" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670979" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:08 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Peter Collingbourne , Pengfei Xu Subject: [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap() Date: Mon, 12 Jun 2023 17:10:30 -0700 Message-Id: <20230613001108.3040476-5-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Yu-cheng Yu There was no more caller passing vm_flags to do_mmap(), and vm_flags was removed from the function's input by: commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_p= goff()"). There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to do_mmap(). Thus, re-introduce vm_flags to do_mmap(). Signed-off-by: Yu-cheng Yu Co-developed-by: Rick Edgecombe Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Peter Collingbourne Reviewed-by: Kees Cook Reviewed-by: Kirill A. Shutemov Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Acked-by: David Hildenbrand Reviewed-by: Mark Brown Tested-by: Mark Brown --- fs/aio.c | 2 +- include/linux/mm.h | 3 ++- ipc/shm.c | 2 +- mm/mmap.c | 10 +++++----- mm/nommu.c | 4 ++-- mm/util.c | 2 +- 6 files changed, 12 insertions(+), 11 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index b0b17bd098bb..4a7576989719 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -558,7 +558,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned = int nr_events) =20 ctx->mmap_base =3D do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, - MAP_SHARED, 0, &unused, NULL); + MAP_SHARED, 0, 0, &unused, NULL); mmap_write_unlock(mm); if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size =3D 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 43701bf223d3..9ec20cbb20c1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3133,7 +3133,8 @@ extern unsigned long mmap_region(struct file *file, u= nsigned long addr, struct list_head *uf); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, - unsigned long pgoff, unsigned long *populate, struct list_head *uf); + vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, + struct list_head *uf); extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, bool downgrade); diff --git a/ipc/shm.c b/ipc/shm.c index 60e45e7045d4..576a543b7cff 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1662,7 +1662,7 @@ long do_shmat(int shmid, char __user *shmaddr, int sh= mflg, goto invalid; } =20 - addr =3D do_mmap(file, addr, size, prot, flags, 0, &populate, NULL); + addr =3D do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL); *raddr =3D addr; err =3D 0; if (IS_ERR_VALUE(addr)) diff --git a/mm/mmap.c b/mm/mmap.c index 13678edaa22c..afdf5f78432b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1221,11 +1221,11 @@ static inline bool file_mmap_ok(struct file *file, = struct inode *inode, */ unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, - unsigned long flags, unsigned long pgoff, - unsigned long *populate, struct list_head *uf) + unsigned long flags, vm_flags_t vm_flags, + unsigned long pgoff, unsigned long *populate, + struct list_head *uf) { struct mm_struct *mm =3D current->mm; - vm_flags_t vm_flags; int pkey =3D 0; =20 validate_mm(mm); @@ -1286,7 +1286,7 @@ unsigned long do_mmap(struct file *file, unsigned lon= g addr, * to. we assume access permissions have been handled by the open * of the memory object, so we don't do any here. */ - vm_flags =3D calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | + vm_flags |=3D calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; =20 if (flags & MAP_LOCKED) @@ -2903,7 +2903,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, star= t, unsigned long, size, =20 file =3D get_file(vma->vm_file); ret =3D do_mmap(vma->vm_file, start, size, - prot, flags, pgoff, &populate, NULL); + prot, flags, 0, pgoff, &populate, NULL); fput(file); out: mmap_write_unlock(mm); diff --git a/mm/nommu.c b/mm/nommu.c index f670d9979a26..138826c4a872 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1002,6 +1002,7 @@ unsigned long do_mmap(struct file *file, unsigned long len, unsigned long prot, unsigned long flags, + vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) @@ -1009,7 +1010,6 @@ unsigned long do_mmap(struct file *file, struct vm_area_struct *vma; struct vm_region *region; struct rb_node *rb; - vm_flags_t vm_flags; unsigned long capabilities, result; int ret; VMA_ITERATOR(vmi, current->mm, 0); @@ -1029,7 +1029,7 @@ unsigned long do_mmap(struct file *file, =20 /* we've determined that we can make the mapping, now translate what we * now know into VMA flags */ - vm_flags =3D determine_vm_flags(file, prot, flags, capabilities); + vm_flags |=3D determine_vm_flags(file, prot, flags, capabilities); =20 =20 /* we're going to need to record the mapping */ diff --git a/mm/util.c b/mm/util.c index dd12b9531ac4..8e7fc6cacab4 100644 --- a/mm/util.c +++ b/mm/util.c @@ -540,7 +540,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned= long addr, if (!ret) { if (mmap_write_lock_killable(mm)) return -EINTR; - ret =3D do_mmap(file, addr, len, prot, flag, pgoff, &populate, + ret =3D do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate, &uf); mmap_write_unlock(mm); userfaultfd_unmap_complete(mm, &uf); --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B544BCA9EA3 for ; Tue, 13 Jun 2023 00:12:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238657AbjFMAMl (ORCPT ); Mon, 12 Jun 2023 20:12:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237807AbjFMAMW (ORCPT ); Mon, 12 Jun 2023 20:12:22 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B9A6A19F; Mon, 12 Jun 2023 17:12:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615133; x=1718151133; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HsIyEUzZjyBgsbaXIFC6qpwhsPOWh1g+KPlkVPDuH8M=; b=U8wH8LhsGtKZUHBcEkduZ4nQxDxYGp/RRA3NNcR0f538h25SL7Kazd8w W1LeUkIXENSJmrPyyf4P6GPVl4Nvr7mlFLUAQ4ENyzwhEK0QKP4GHEnbx duiqB68RpY9NYuyBMPKn8eOVu7k1axQoe0+lINjv7xq0re9LIYB5CMASC YNjB4pSNEUWN3DgYyaZXSBb+6eRTjA8TDAx0MIQ9hvBEH4KM25KaHrLXC jck7ynSAZwsWBoOpIEIOx+4M8VaeN3tmAAqRrCna8Z1DMFuzGGCv9xQDN eUnd7gK8NlW20M06T0Qj6VGl5rHUh2kD/yn33VZ/BTS+CFdUJKPYTa9Kn A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556752" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556752" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:10 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670984" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670984" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:09 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Axel Rasmussen , Peter Xu , Pengfei Xu Subject: [PATCH v9 05/42] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Date: Mon, 12 Jun 2023 17:10:31 -0700 Message-Id: <20230613001108.3040476-6-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Yu-cheng Yu The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. Future patches will introduce a new VM flag VM_SHADOW_STACK that will be VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake of order, make all VM_HIGH_ARCH_BITs stay together by moving VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be introduced as 37. Signed-off-by: Yu-cheng Yu Co-developed-by: Rick Edgecombe Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Reviewed-by: Axel Rasmussen Acked-by: Mike Rapoport (IBM) Acked-by: Peter Xu Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Reviewed-by: David Hildenbrand --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 9ec20cbb20c1..6f52c1e7c640 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -370,7 +370,7 @@ extern unsigned int kobjsize(const void *objp); #endif =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR -# define VM_UFFD_MINOR_BIT 37 +# define VM_UFFD_MINOR_BIT 38 # define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */ #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ # define VM_UFFD_MINOR VM_NONE --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2400CC7EE43 for ; Tue, 13 Jun 2023 00:13:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239000AbjFMANS (ORCPT ); Mon, 12 Jun 2023 20:13:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238475AbjFMAMW (ORCPT ); Mon, 12 Jun 2023 20:12:22 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 43A79171C; Mon, 12 Jun 2023 17:12:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615134; x=1718151134; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VIRIy8IvN5ZfY5u3MVh2ThurjhiZOJC7RcXup7TqiYM=; b=J5dsVrvsQoEiGXjU2/F2mZxHX9Wtgd8eR1tZAx8sTiWaU7g0Nl12dqYK WC3qIDwv9pTqoHO0SUKRtQbFVgJJXVmXRdQq776vDZ4XBTtUinn7yXkpX ERI5enbupURzIofkxsAzQJggymoC2YwNLTDDS+h4d2p1nfMoSJCBZc/Kg eavVqt6vEYuEaRlEVK9WIqidIFBhMoFKjW197xXJdbTN5Yq4FpZ4taULP CCWpLI0rtXeHOqMTe1e4WkxVbWxAAihvoerV/Tg1OhCah5+fCvni+h7rs 76q2oZ2nMw/rCfj7YwiGzBlGan0ehh2rHw5r0hr9f0ABIGqfzZ0iSoLJv Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556774" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556774" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:11 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670988" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670988" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:10 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 06/42] x86/shstk: Add Kconfig option for shadow stack Date: Mon, 12 Jun 2023 17:10:32 -0700 Message-Id: <20230613001108.3040476-7-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Shadow stack provides protection for applications against function return address corruption. It is active when the processor supports it, the kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built for the feature. This is only implemented for the 64-bit kernel. When it is enabled, legacy non-shadow stack applications continue to work, but without protection. Since there is another feature that utilizes CET (Kernel IBT) that will share implementation with shadow stacks, create CONFIG_CET to signify that at least one CET feature is configured. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/Kconfig | 24 ++++++++++++++++++++++++ arch/x86/Kconfig.assembler | 5 +++++ 2 files changed, 29 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 53bab123a8ee..ce460d6b4e25 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1852,6 +1852,11 @@ config CC_HAS_IBT (CC_IS_CLANG && CLANG_VERSION >=3D 140000)) && \ $(as-instr,endbr64) =20 +config X86_CET + def_bool n + help + CET features configured (Shadow stack or IBT) + config X86_KERNEL_IBT prompt "Indirect Branch Tracking" def_bool y @@ -1859,6 +1864,7 @@ config X86_KERNEL_IBT # https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc2= 31f9e579f2d0f depends on !LD_IS_LLD || LLD_VERSION >=3D 140000 select OBJTOOL + select X86_CET help Build the kernel with support for Indirect Branch Tracking, a hardware support course-grain forward-edge Control Flow Integrity @@ -1952,6 +1958,24 @@ config X86_SGX =20 If unsure, say N. =20 +config X86_USER_SHADOW_STACK + bool "X86 userspace shadow stack" + depends on AS_WRUSS + depends on X86_64 + select ARCH_USES_HIGH_VMA_FLAGS + select X86_CET + help + Shadow stack protection is a hardware feature that detects function + return address corruption. This helps mitigate ROP attacks. + Applications must be enabled to use it, and old userspace does not + get protection "for free". + + CPUs supporting shadow stacks were first released in 2020. + + See Documentation/x86/shstk.rst for more information. + + If unsure, say N. + config EFI bool "EFI runtime service support" depends on ACPI diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler index b88f784cb02e..8ad41da301e5 100644 --- a/arch/x86/Kconfig.assembler +++ b/arch/x86/Kconfig.assembler @@ -24,3 +24,8 @@ config AS_GFNI def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2) help Supported by binutils >=3D 2.30 and LLVM integrated assembler + +config AS_WRUSS + def_bool $(as-instr,wrussq %rax$(comma)(%rbx)) + help + Supported by binutils >=3D 2.31 and LLVM integrated assembler --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9794C88CB5 for ; Tue, 13 Jun 2023 00:12:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238621AbjFMAMo (ORCPT ); Mon, 12 Jun 2023 20:12:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51916 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238496AbjFMAMX (ORCPT ); Mon, 12 Jun 2023 20:12:23 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08EED1726; Mon, 12 Jun 2023 17:12:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615136; x=1718151136; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1T/qgZrUQhYBEmir0bddzuZOwE3WskkI/v/mGEi34h4=; b=R3/b+BSDam0NqDlAT1HhuUrsopGL40H9zZf3quXKnqZM8STTQFcNglZ/ 9GVzOcJb/LA1ZIXm2HoBRfGZP69oQEndy38eukFBmj5Xw22YEHG3fWV2G V5Grxd7Uqp7qmOALzhjSEIIEKr0Kod1F0p3vGVgK22hn/fZ7UDhYX+wJI RsI+XOuVWRl5V+fPTnY9sYcNONw0hBPOSXta27XwFUrM3qvTHfebL/DXu EOq5y0Y5MPn25nC3rAQWSNogSRfcxNYmDQsiE9OyTagnSCXOLMyidMPHH wnMwndWjUKzgkuW/uaaketD1g4GXc2dRny1y/ZokKMefHTXN/l7Uxw9y7 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556815" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556815" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670993" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670993" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:11 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 07/42] x86/traps: Move control protection handler to separate file Date: Mon, 12 Jun 2023 17:10:33 -0700 Message-Id: <20230613001108.3040476-8-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Today the control protection handler is defined in traps.c and used only for the kernel IBT feature. To reduce ifdeffery, move it to it's own file. In future patches, functionality will be added to make this handler also handle user shadow stack faults. So name the file cet.c. No functional change. Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/kernel/Makefile | 2 ++ arch/x86/kernel/cet.c | 76 ++++++++++++++++++++++++++++++++++++++++ arch/x86/kernel/traps.c | 75 --------------------------------------- 3 files changed, 78 insertions(+), 75 deletions(-) create mode 100644 arch/x86/kernel/cet.c diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 4070a01c11b7..abee0564b750 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -145,6 +145,8 @@ obj-$(CONFIG_CFI_CLANG) +=3D cfi.o =20 obj-$(CONFIG_CALL_THUNKS) +=3D callthunks.o =20 +obj-$(CONFIG_X86_CET) +=3D cet.o + ### # 64 bit specific files ifeq ($(CONFIG_X86_64),y) diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c new file mode 100644 index 000000000000..7ad22b705b64 --- /dev/null +++ b/arch/x86/kernel/cet.c @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include + +static __ro_after_init bool ibt_fatal =3D true; + +extern void ibt_selftest_ip(void); /* code label defined in asm below */ + +enum cp_error_code { + CP_EC =3D (1 << 15) - 1, + + CP_RET =3D 1, + CP_IRET =3D 2, + CP_ENDBR =3D 3, + CP_RSTRORSSP =3D 4, + CP_SETSSBSY =3D 5, + + CP_ENCL =3D 1 << 15, +}; + +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection) +{ + if (!cpu_feature_enabled(X86_FEATURE_IBT)) { + pr_err("Unexpected #CP\n"); + BUG(); + } + + if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) !=3D CP_ENDBR)) + return; + + if (unlikely(regs->ip =3D=3D (unsigned long)&ibt_selftest_ip)) { + regs->ax =3D 0; + return; + } + + pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs)); + if (!ibt_fatal) { + printk(KERN_DEFAULT CUT_HERE); + __warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL); + return; + } + BUG(); +} + +/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */ +noinline bool ibt_selftest(void) +{ + unsigned long ret; + + asm (" lea ibt_selftest_ip(%%rip), %%rax\n\t" + ANNOTATE_RETPOLINE_SAFE + " jmp *%%rax\n\t" + "ibt_selftest_ip:\n\t" + UNWIND_HINT_FUNC + ANNOTATE_NOENDBR + " nop\n\t" + + : "=3Da" (ret) : : "memory"); + + return !ret; +} + +static int __init ibt_setup(char *str) +{ + if (!strcmp(str, "off")) + setup_clear_cpu_cap(X86_FEATURE_IBT); + + if (!strcmp(str, "warn")) + ibt_fatal =3D false; + + return 1; +} + +__setup("ibt=3D", ibt_setup); diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 58b1f208eff5..6f666dfa97de 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -213,81 +213,6 @@ DEFINE_IDTENTRY(exc_overflow) do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL); } =20 -#ifdef CONFIG_X86_KERNEL_IBT - -static __ro_after_init bool ibt_fatal =3D true; - -extern void ibt_selftest_ip(void); /* code label defined in asm below */ - -enum cp_error_code { - CP_EC =3D (1 << 15) - 1, - - CP_RET =3D 1, - CP_IRET =3D 2, - CP_ENDBR =3D 3, - CP_RSTRORSSP =3D 4, - CP_SETSSBSY =3D 5, - - CP_ENCL =3D 1 << 15, -}; - -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection) -{ - if (!cpu_feature_enabled(X86_FEATURE_IBT)) { - pr_err("Unexpected #CP\n"); - BUG(); - } - - if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) !=3D CP_ENDBR)) - return; - - if (unlikely(regs->ip =3D=3D (unsigned long)&ibt_selftest_ip)) { - regs->ax =3D 0; - return; - } - - pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs)); - if (!ibt_fatal) { - printk(KERN_DEFAULT CUT_HERE); - __warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL); - return; - } - BUG(); -} - -/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */ -noinline bool ibt_selftest(void) -{ - unsigned long ret; - - asm (" lea ibt_selftest_ip(%%rip), %%rax\n\t" - ANNOTATE_RETPOLINE_SAFE - " jmp *%%rax\n\t" - "ibt_selftest_ip:\n\t" - UNWIND_HINT_FUNC - ANNOTATE_NOENDBR - " nop\n\t" - - : "=3Da" (ret) : : "memory"); - - return !ret; -} - -static int __init ibt_setup(char *str) -{ - if (!strcmp(str, "off")) - setup_clear_cpu_cap(X86_FEATURE_IBT); - - if (!strcmp(str, "warn")) - ibt_fatal =3D false; - - return 1; -} - -__setup("ibt=3D", ibt_setup); - -#endif /* CONFIG_X86_KERNEL_IBT */ - #ifdef CONFIG_X86_F00F_BUG void handle_invalid_op(struct pt_regs *regs) #else --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D91FEC88CB5 for ; Tue, 13 Jun 2023 00:12:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238277AbjFMAM5 (ORCPT ); Mon, 12 Jun 2023 20:12:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238629AbjFMAMY (ORCPT ); Mon, 12 Jun 2023 20:12:24 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BC0B01981; Mon, 12 Jun 2023 17:12:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615138; x=1718151138; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sARsUThnVYVYYqaKHA46gpgv7OakP1T/BusMzOfelkg=; b=YwIeoSXkzRjb3keq9ckb0SycQRpJv0oOyV6yZB2MS/xN57lRQ2dT7MC0 0rAty46XsY3JhLO22NCzJQrDT/gwW4qJgh6o6XAeGNjUA1JpJobNMwnnk hGARQxAjxORBPNQlyUdD/uQM9Q9yc34v/SD1tadyB2kTp5nEKN3NokHXC wGCkkgYsrb5QxjBKyPAECjnfOZeuIriytnqUxemsd2mdH7ZE0wz3YFmSd zh8bONtzu7dOVuZe2e2EzqIKCgxZPNJbfs0JJ3+xgJRsL+qegCZ42oZB2 ubFwRFS0w0hjc89e18lJ7ytUU61X6oIyOgBraGc/OddEiFhxt+58QlXXv A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556836" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556836" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835670996" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835670996" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:13 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 08/42] x86/cpufeatures: Add CPU feature flags for shadow stacks Date: Mon, 12 Jun 2023 17:10:34 -0700 Message-Id: <20230613001108.3040476-9-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The Control-Flow Enforcement Technology contains two related features, one of which is Shadow Stacks. Future patches will utilize this feature for shadow stack support in KVM, so add a CPU feature flags for Shadow Stacks (CPUID.(EAX=3D7,ECX=3D0):ECX[bit 7]). To protect shadow stack state from malicious modification, the registers are only accessible in supervisor mode. This implementation context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend on XSAVES. The shadow stack feature, enumerated by the CPUID bit described above, encompasses both supervisor and userspace support for shadow stack. In near future patches, only userspace shadow stack will be enabled. In expectation of future supervisor shadow stack support, create a software CPU capability to enumerate kernel utilization of userspace shadow stack support. This user shadow stack bit should depend on the HW "shstk" capability and that logic will be implemented in future patches. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/cpufeatures.h | 2 ++ arch/x86/include/asm/disabled-features.h | 8 +++++++- arch/x86/kernel/cpu/cpuid-deps.c | 1 + 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpuf= eatures.h index cb8ca46213be..d7215c8b7923 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -308,6 +308,7 @@ #define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel= ) implemented */ #define X86_FEATURE_SMBA (11*32+21) /* "" Slow Memory Bandwidth Allocatio= n */ #define X86_FEATURE_BMEC (11*32+22) /* "" Bandwidth Monitoring Event Conf= iguration */ +#define X86_FEATURE_USER_SHSTK (11*32+23) /* Shadow stack support for use= r mode applications */ =20 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */ @@ -380,6 +381,7 @@ #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */ #define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instruct= ions */ #define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bi= t Manipulation Instructions */ +#define X86_FEATURE_SHSTK (16*32+ 7) /* "" Shadow stack */ #define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */ #define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */ #define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Do= uble Quadword */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/as= m/disabled-features.h index fafe9be7a6f4..b9c7eae2e70f 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -105,6 +105,12 @@ # define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31)) #endif =20 +#ifdef CONFIG_X86_USER_SHADOW_STACK +#define DISABLE_USER_SHSTK 0 +#else +#define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31)) +#endif + /* * Make sure to add features to the correct mask */ @@ -120,7 +126,7 @@ #define DISABLED_MASK9 (DISABLE_SGX) #define DISABLED_MASK10 0 #define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \ - DISABLE_CALL_DEPTH_TRACKING) + DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK) #define DISABLED_MASK12 (DISABLE_LAM) #define DISABLED_MASK13 0 #define DISABLED_MASK14 0 diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-d= eps.c index f6748c8bd647..e462c1d3800a 100644 --- a/arch/x86/kernel/cpu/cpuid-deps.c +++ b/arch/x86/kernel/cpu/cpuid-deps.c @@ -81,6 +81,7 @@ static const struct cpuid_dep cpuid_deps[] =3D { { X86_FEATURE_XFD, X86_FEATURE_XSAVES }, { X86_FEATURE_XFD, X86_FEATURE_XGETBV1 }, { X86_FEATURE_AMX_TILE, X86_FEATURE_XFD }, + { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES }, {} }; =20 --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 638D6C88CB9 for ; Tue, 13 Jun 2023 00:12:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233768AbjFMAMx (ORCPT ); Mon, 12 Jun 2023 20:12:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51902 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238592AbjFMAMX (ORCPT ); Mon, 12 Jun 2023 20:12:23 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A8371173F; Mon, 12 Jun 2023 17:12:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615138; x=1718151138; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YP96Yn7FdNN68fPWKmeUZZw0SU1yEnE0kYU9/y2Lvpc=; b=QefwEUnAhkJ5xF6fSIomS1+EFsRVdKxpgTDF6YIqTUjTxp1QLWoEDZgU Ut4C46TN409YUgx2wYW23j+eUexqKg+lgZzfvmE1z5fR36mxceMNqFHHq a2IPzXlEkksvNKvkuKDC69SHpzFYFoubFjXwZ8oFHQ6hxiR+Eto7Vlws5 cyhcqXPmQa+ztulNzwkrRbGsymXV79MYeiJ67J1jNEe0DyXMyOM14UddW 2NzNIrPVGdzrsC+Uaed8ONPncvdhQMqVyByAWkI041ZXXYVwxDBbVq20V 7JMZsORUM1yMQGT07MmegA2/NprzxMZBvlwfKQrtsPbdEjWycVyJPuelV w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556844" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556844" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671000" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671000" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:14 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 09/42] x86/mm: Move pmd_write(), pud_write() up in the file Date: Mon, 12 Jun 2023 17:10:35 -0700 Message-Id: <20230613001108.3040476-10-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" To prepare the introduction of _PAGE_SAVED_DIRTY, move pmd_write() and pud_write() up in the file, so that they can be used by other helpers below. No functional changes. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Reviewed-by: Kirill A. Shutemov Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 112e6060eafa..768ee46782c9 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -160,6 +160,18 @@ static inline int pte_write(pte_t pte) return pte_flags(pte) & _PAGE_RW; } =20 +#define pmd_write pmd_write +static inline int pmd_write(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_RW; +} + +#define pud_write pud_write +static inline int pud_write(pud_t pud) +{ + return pud_flags(pud) & _PAGE_RW; +} + static inline int pte_huge(pte_t pte) { return pte_flags(pte) & _PAGE_PSE; @@ -1120,12 +1132,6 @@ extern int pmdp_clear_flush_young(struct vm_area_str= uct *vma, unsigned long address, pmd_t *pmdp); =20 =20 -#define pmd_write pmd_write -static inline int pmd_write(pmd_t pmd) -{ - return pmd_flags(pmd) & _PAGE_RW; -} - #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned= long addr, pmd_t *pmdp) @@ -1155,12 +1161,6 @@ static inline void pmdp_set_wrprotect(struct mm_stru= ct *mm, clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp); } =20 -#define pud_write pud_write -static inline int pud_write(pud_t pud) -{ - return pud_flags(pud) & _PAGE_RW; -} - #ifndef pmdp_establish #define pmdp_establish pmdp_establish static inline pmd_t pmdp_establish(struct vm_area_struct *vma, --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DEF7EC7EE2E for ; Tue, 13 Jun 2023 00:13:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238993AbjFMANP (ORCPT ); Mon, 12 Jun 2023 20:13:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51940 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238728AbjFMAM0 (ORCPT ); Mon, 12 Jun 2023 20:12:26 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF43F199C; Mon, 12 Jun 2023 17:12:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615140; x=1718151140; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=n6G5K0m9WwVLI5yH5jOOTWumHDLO47Q1gA3TW71PVjE=; b=loB2opO6O87gwVNZFycyLg4y6ZdtdfzU/qRjxnCAsaar9EBTCJhlX1Tp OnROA2J8qh+hxo9aqhakfYp4Yw1MC2URHudZd6tFjsmthGKQ1GO0khw5v VAvOscEm1j67n7YylYi/qBbnDLsREinH9Tkd57skPB7xdmeLnXCO3A3kc LI2jHOVokAd6dBHwQaZkvizEUdSWXc5Sbo+Y1W/MO2KgrfvYaUj2x//tA pRczdsFGjVGUH6ELMBo7LAKKNjJT1pA+4glbEwwrngOMKttjp9IjdXYg9 lK7GrcDS8cc+OEQMyd6iV+q/TPG7KBWgQC5OLk3Cs8h5awfz9OvLGwke+ g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556869" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556869" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:16 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671005" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671005" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:15 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY Date: Mon, 12 Jun 2023 17:10:36 -0700 Message-Id: <20230613001108.3040476-11-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Some OSes have a greater dependence on software available bits in PTEs than Linux. That left the hardware architects looking for a way to represent a new memory type (shadow stack) within the existing bits. They chose to repurpose a lightly-used state: Write=3D0,Dirty=3D1. So in order to support shadow stack memory, Linux should avoid creating memory with this PTE bit combination unless it intends for it to be shadow stack. The reason it's lightly used is that Dirty=3D1 is normally set by HW _before_ a write. A write with a Write=3D0 PTE would typically only generate a fault, not set Dirty=3D1. Hardware can (rarely) both set Dirty=3D1 *and* generate the fault, resulting in a Write=3D0,Dirty=3D1 PTE. Hardware which supports shadow stacks will no longer exhibit this oddity. So that leaves Write=3D0,Dirty=3D1 PTEs created in software. To avoid inadvertently created shadow stack memory, in places where Linux normally creates Write=3D0,Dirty=3D1, it can use the software-defined _PAGE_SAVED_DI= RTY in place of the hardware _PAGE_DIRTY. In other words, whenever Linux needs to create Write=3D0,Dirty=3D1, it instead creates Write=3D0,SavedDirty=3D1 = except for shadow stack, which is Write=3D0,Dirty=3D1. There are six bits left available to software in the 64-bit PTE after consuming a bit for _PAGE_SAVED_DIRTY. For 32 bit, the same bit as _PAGE_BIT_UFFD_WP is used, since user fault fd is not supported on 32 bit. This leaves one unused software bit on 32 bit (_PAGE_BIT_SOFT_DIRTY, as this is also not supported on 32 bit). Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to actually begin creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are in place. Since this SavedDirty shifting is done for all x86 CPUs, this leaves the possibility for the hardware oddity to still create Write=3D0,Dirty=3D1 PTEs in rare cases. Since these CPUs also don't support shadow stack, this will be harmless as it was before the introduction of SavedDirty. Implement the shifting logic to be branchless. Embed the logic of whether to do the shifting (including checking the Write bits) so that it can be called by future callers that would otherwise need additional branching logic. This efficiency allows the logic of when to do the shifting to be centralized, making the code easier to reason about. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- v9: - Use bit shifting instead of conditionals (Linus) - Make saved dirty bit unconditional (Linus) - Add 32 bit support to make it extra unconditional - Don't re-order PAGE flags (Dave) --- arch/x86/include/asm/pgtable.h | 83 ++++++++++++++++++++++++++++ arch/x86/include/asm/pgtable_types.h | 38 +++++++++++-- arch/x86/include/asm/tlbflush.h | 3 +- 3 files changed, 119 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 768ee46782c9..a95f872c7429 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -301,6 +301,53 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_= t clear) return native_make_pte(v & ~clear); } =20 +/* + * Write protection operations can result in Dirty=3D1,Write=3D0 PTEs. But= in the + * case of X86_FEATURE_USER_SHSTK, these PTEs denote shadow stack memory. = So + * when creating dirty, write-protected memory, a software bit is used: + * _PAGE_BIT_SAVED_DIRTY. The following functions take a PTE and transitio= n the + * Dirty bit to SavedDirty, and vice-vesra. + * + * This shifting is only done if needed. In the case of shifting + * Dirty->SavedDirty, the condition is if the PTE is Write=3D0. In the cas= e of + * shifting SavedDirty->Dirty, the condition is Write=3D1. + */ +static inline unsigned long mksaveddirty_shift(unsigned long v) +{ + unsigned long cond =3D !(v & (1 << _PAGE_BIT_RW)); + + v |=3D ((v >> _PAGE_BIT_DIRTY) & cond) << _PAGE_BIT_SAVED_DIRTY; + v &=3D ~(cond << _PAGE_BIT_DIRTY); + + return v; +} + +static inline unsigned long clear_saveddirty_shift(unsigned long v) +{ + unsigned long cond =3D !!(v & (1 << _PAGE_BIT_RW)); + + v |=3D ((v >> _PAGE_BIT_SAVED_DIRTY) & cond) << _PAGE_BIT_DIRTY; + v &=3D ~(cond << _PAGE_BIT_SAVED_DIRTY); + + return v; +} + +static inline pte_t pte_mksaveddirty(pte_t pte) +{ + pteval_t v =3D native_pte_val(pte); + + v =3D mksaveddirty_shift(v); + return native_make_pte(v); +} + +static inline pte_t pte_clear_saveddirty(pte_t pte) +{ + pteval_t v =3D native_pte_val(pte); + + v =3D clear_saveddirty_shift(v); + return native_make_pte(v); +} + static inline pte_t pte_wrprotect(pte_t pte) { return pte_clear_flags(pte, _PAGE_RW); @@ -413,6 +460,24 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_= t clear) return native_make_pmd(v & ~clear); } =20 +/* See comments above mksaveddirty_shift() */ +static inline pmd_t pmd_mksaveddirty(pmd_t pmd) +{ + pmdval_t v =3D native_pmd_val(pmd); + + v =3D mksaveddirty_shift(v); + return native_make_pmd(v); +} + +/* See comments above mksaveddirty_shift() */ +static inline pmd_t pmd_clear_saveddirty(pmd_t pmd) +{ + pmdval_t v =3D native_pmd_val(pmd); + + v =3D clear_saveddirty_shift(v); + return native_make_pmd(v); +} + static inline pmd_t pmd_wrprotect(pmd_t pmd) { return pmd_clear_flags(pmd, _PAGE_RW); @@ -484,6 +549,24 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_= t clear) return native_make_pud(v & ~clear); } =20 +/* See comments above mksaveddirty_shift() */ +static inline pud_t pud_mksaveddirty(pud_t pud) +{ + pudval_t v =3D native_pud_val(pud); + + v =3D mksaveddirty_shift(v); + return native_make_pud(v); +} + +/* See comments above mksaveddirty_shift() */ +static inline pud_t pud_clear_saveddirty(pud_t pud) +{ + pudval_t v =3D native_pud_val(pud); + + v =3D clear_saveddirty_shift(v); + return native_make_pud(v); +} + static inline pud_t pud_mkold(pud_t pud) { return pud_clear_flags(pud, _PAGE_ACCESSED); diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pg= table_types.h index 447d4bee25c4..ee6f8e57e115 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -21,7 +21,8 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ @@ -34,6 +35,13 @@ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 =20 +#ifdef CONFIG_X86_64 +#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit */ +#else +/* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */ +#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW2 /* Saved Dirty bit */ +#endif + /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL @@ -117,6 +125,22 @@ #define _PAGE_SOFTW4 (_AT(pteval_t, 0)) #endif =20 +/* + * The hardware requires shadow stack to be Write=3D0,Dirty=3D1. However, + * there are valid cases where the kernel might create read-only PTEs that + * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty tracking). In + * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bi= t, + * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have + * (Write=3D0,SavedDirty=3D1,Dirty=3D0) set. + */ +#ifdef CONFIG_X86_64 +#define _PAGE_SAVED_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_SAVED_DIRTY) +#else +#define _PAGE_SAVED_DIRTY (_AT(pteval_t, 0)) +#endif + +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_SAVED_DIRTY) + #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) =20 /* @@ -125,9 +149,9 @@ * instance, and is *not* included in this mask since * pte_modify() does modify it. */ -#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ - _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \ +#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ + _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_BITS | \ + _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \ _PAGE_UFFD_WP) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) =20 @@ -188,10 +212,16 @@ enum page_cache_mode { =20 #define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G) #define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G) + +/* + * Page tables needs to have Write=3D1 in order for any lower PTEs to be + * writable. This includes shadow stack memory (Write=3D0, Dirty=3D1) + */ #define _KERNPG_TABLE_NOENC (__PP|__RW| 0|___A| 0|___D| 0| 0) #define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC) #define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0) #define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC) + #define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G) #define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0|___D| 0|___G) #define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| _= _NC) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflus= h.h index 75bfaa421030..965659d2c965 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -293,7 +293,8 @@ static inline bool pte_flags_need_flush(unsigned long o= ldflags, const pteval_t flush_on_clear =3D _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED; const pteval_t software_flags =3D _PAGE_SOFTW1 | _PAGE_SOFTW2 | - _PAGE_SOFTW3 | _PAGE_SOFTW4; + _PAGE_SOFTW3 | _PAGE_SOFTW4 | + _PAGE_SAVED_DIRTY; const pteval_t flush_on_change =3D _PAGE_RW | _PAGE_USER | _PAGE_PWT | _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT | _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 | --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 13782C88CB2 for ; Tue, 13 Jun 2023 00:13:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238964AbjFMANC (ORCPT ); Mon, 12 Jun 2023 20:13:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51934 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238695AbjFMAMZ (ORCPT ); Mon, 12 Jun 2023 20:12:25 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 33AAE1996; Mon, 12 Jun 2023 17:12:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615139; x=1718151139; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=m52H3JZHS5dKvCQFGYyapfckCiGOGTRhvdDeSntBBxk=; b=QY28iBE1VRwIzr0w1+d1sPUyhwOeLbtzUqxmISgNndXNT8KXB+rmz4MG nEpBMDa1hMXmF0tm/naPI4x8811U9wPlZzw/w1R1q7TZnn4fb1vbT9HDl 0Fr/MY8kSvDWeJnFrPMrgivTbTcVrI4/QVo5WafbE/II+DSA/rUvi1LJF ARapMg4l5qGppyWqsVO1mnDEOrLWIrGxmEPGdRKRwJagiVWDjDUHk/qHk uOzeWvVb9VPd0qgsSXkCRKqQgU15spXPqPIh34NCpo4+dEinnJyog9jVh oNco9BeRWA0AF/Om/StG6V+JDnEZp2uwHKHi/JgFNDmmT85cp5P9x8QV5 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556890" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556890" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:17 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671009" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671009" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:16 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 11/42] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Date: Mon, 12 Jun 2023 17:10:37 -0700 Message-Id: <20230613001108.3040476-12-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When shadow stack is in use, Write=3D0,Dirty=3D1 PTE are preserved for shadow stack. Copy-on-write PTEs then have Write=3D0,SavedDirty=3D1. When a PTE goes from Write=3D1,Dirty=3D1 to Write=3D0,SavedDirty=3D1, it co= uld become a transient shadow stack PTE in two cases: 1. Some processors can start a write but end up seeing a Write=3D0 PTE by the time they get to the Dirty bit, creating a transient shadow stack PTE. However, this will not occur on processors supporting shadow stack, and a TLB flush is not necessary. 2. When _PAGE_DIRTY is replaced with _PAGE_SAVED_DIRTY non-atomically, a transient shadow stack PTE can be created as a result. Prevent the second case when doing a write protection and Dirty->SavedDirty shift at the same time with a CMPXCHG loop. The first case Note, in the PAE case CMPXCHG will need to operate on 8 byte, but try_cmpxchg() will not use CMPXCHG8B, so it cannot operate on a full PAE PTE. However the exiting logic is not operating on a full 8 byte region either, and relies on the fact that the Write bit is in the first 4 bytes when doing the clear_bit(). Since both the Dirty, SavedDirty and Write bits are in the first 4 bytes, casting to a long will be similar to the existing behavior which also casts to a long. Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many insights to the issue. Jann Horn provided the CMPXCHG solution. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- v9: - Use bit shifting helpers that don't need any extra conditional logic. (Linus) - Always do the SavedDirty shifting (Linus) --- arch/x86/include/asm/pgtable.h | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a95f872c7429..99b54ab0a919 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1189,7 +1189,17 @@ static inline pte_t ptep_get_and_clear_full(struct m= m_struct *mm, static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { - clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); + /* + * Avoid accidentally creating shadow stack PTEs + * (Write=3D0,Dirty=3D1). Use cmpxchg() to prevent races with + * the hardware setting Dirty=3D1. + */ + pte_t old_pte, new_pte; + + old_pte =3D READ_ONCE(*ptep); + do { + new_pte =3D pte_wrprotect(old_pte); + } while (!try_cmpxchg((long *)&ptep->pte, (long *)&old_pte, *(long *)&new= _pte)); } =20 #define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) @@ -1241,7 +1251,17 @@ static inline pud_t pudp_huge_get_and_clear(struct m= m_struct *mm, static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp) { - clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp); + /* + * Avoid accidentally creating shadow stack PTEs + * (Write=3D0,Dirty=3D1). Use cmpxchg() to prevent races with + * the hardware setting Dirty=3D1. + */ + pmd_t old_pmd, new_pmd; + + old_pmd =3D READ_ONCE(*pmdp); + do { + new_pmd =3D pmd_wrprotect(old_pmd); + } while (!try_cmpxchg((long *)pmdp, (long *)&old_pmd, *(long *)&new_pmd)); } =20 #ifndef pmdp_establish --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27522C88CB2 for ; Tue, 13 Jun 2023 00:13:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238976AbjFMANH (ORCPT ); Mon, 12 Jun 2023 20:13:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51950 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238803AbjFMAM2 (ORCPT ); Mon, 12 Jun 2023 20:12:28 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DF31197; Mon, 12 Jun 2023 17:12:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615143; x=1718151143; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7eVLFaUcP7DtfYYGhz2IcxmOiAMntCwFG2k86EQ1hPY=; b=goZUV0P6gFKBYGIKAOm/P0F/uIFo+IlIBLC+CXnTZAKZTtfIaxNeMohx 4tz0GWKQ+EVG3QOAW9wWITARrbKuhMNG7ESA1iOIoDDNSa8YJNFnYnRsC r3EyEpyZiimgCLOXcgaUoOhdhZlLBft3QayzYw2AEXABMCUpSkUEC+9Y7 SIfHSFCpQ6pB5O7wngp8uqidh3HioAcL+B+VIMlI7vqBFk4eC/g0K0p/z I7RSv3/BjQFHzw3fJuYNhy46j8fl79eSiTTQOD/uUY4Dt2QqXIm+9KwHi 2hzm9LCaHCzBcNOZ5IGzKZIMtZwZrcoYHMJJofZoYNab2LUTKsB2D/P24 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556918" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556918" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:18 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671013" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671013" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:17 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 12/42] x86/mm: Start actually marking _PAGE_SAVED_DIRTY Date: Mon, 12 Jun 2023 17:10:38 -0700 Message-Id: <20230613001108.3040476-13-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The recently introduced _PAGE_SAVED_DIRTY should be used instead of the HW Dirty bit whenever a PTE is Write=3D0, in order to not inadvertently create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply the same changes to pmd and pud. Since there is no x86 version of pte_mkwrite() to hold this arch specific logic, create one. Add it to x86/mm/pgtable.c instead of x86/asm/include/pgtable.h as future patches will require it to live in pgtable.c and it will make the diff easier for reviewers. Since CPUs without shadow stack support could create Write=3D0,Dirty=3D1 PTEs, only return true for pte_shstk() if the CPU also supports shadow stack. This will prevent these HW creates PTEs as showing as true for pte_write(). For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which was not necessarily created with any of the existing PTE bit helpers. That means that it can return a pte_t with Write=3D0,Dirty=3D1, a shadow stack PTE, when it did not intend to create one. Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid creating Write=3D0,Dirty=3D1 PTEs, pte_modify() needs to avoid: 1. Marking Write=3D0 PTEs Dirty=3D1 2. Marking Dirty=3D1 PTEs Write=3D0 The first case cannot happen as the existing behavior of pte_modify() is to filter out any Dirty bit passed in newprot. Handle the second case by shifting _PAGE_DIRTY=3D1 to _PAGE_SAVED_DIRTY=3D1 if the PTE was write protected by the pte_modify() call. Apply the same changes to pmd_modify(). Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- v9: - Use bit shifting helpers that don't need any extra conditional logic. (Linus) - Handle the harmless Write=3D0->Write=3D1 pte_modify() case with the shifting helpers. - Don't ever return true for pte_shstk() if the CPU does not support shadow stack. --- arch/x86/include/asm/pgtable.h | 151 ++++++++++++++++++++++++++++----- arch/x86/mm/pgtable.c | 14 +++ 2 files changed, 144 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 99b54ab0a919..d8724f5b1202 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -124,9 +124,15 @@ extern pmdval_t early_pmd_flags; * The following only work if pte_present() is true. * Undefined behaviour if not.. */ -static inline int pte_dirty(pte_t pte) +static inline bool pte_dirty(pte_t pte) { - return pte_flags(pte) & _PAGE_DIRTY; + return pte_flags(pte) & _PAGE_DIRTY_BITS; +} + +static inline bool pte_shstk(pte_t pte) +{ + return cpu_feature_enabled(X86_FEATURE_SHSTK) && + (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) =3D=3D _PAGE_DIRTY; } =20 static inline int pte_young(pte_t pte) @@ -134,9 +140,16 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } =20 -static inline int pmd_dirty(pmd_t pmd) +static inline bool pmd_dirty(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_DIRTY; + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; +} + +static inline bool pmd_shstk(pmd_t pmd) +{ + return cpu_feature_enabled(X86_FEATURE_SHSTK) && + (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) =3D=3D + (_PAGE_DIRTY | _PAGE_PSE); } =20 #define pmd_young pmd_young @@ -145,9 +158,9 @@ static inline int pmd_young(pmd_t pmd) return pmd_flags(pmd) & _PAGE_ACCESSED; } =20 -static inline int pud_dirty(pud_t pud) +static inline bool pud_dirty(pud_t pud) { - return pud_flags(pud) & _PAGE_DIRTY; + return pud_flags(pud) & _PAGE_DIRTY_BITS; } =20 static inline int pud_young(pud_t pud) @@ -157,13 +170,21 @@ static inline int pud_young(pud_t pud) =20 static inline int pte_write(pte_t pte) { - return pte_flags(pte) & _PAGE_RW; + /* + * Shadow stack pages are logically writable, but do not have + * _PAGE_RW. Check for them separately from _PAGE_RW itself. + */ + return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte); } =20 #define pmd_write pmd_write static inline int pmd_write(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_RW; + /* + * Shadow stack pages are logically writable, but do not have + * _PAGE_RW. Check for them separately from _PAGE_RW itself. + */ + return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd); } =20 #define pud_write pud_write @@ -350,7 +371,14 @@ static inline pte_t pte_clear_saveddirty(pte_t pte) =20 static inline pte_t pte_wrprotect(pte_t pte) { - return pte_clear_flags(pte, _PAGE_RW); + pte =3D pte_clear_flags(pte, _PAGE_RW); + + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PTE (Write=3D0,Dirty=3D1). Move the hardware + * dirty value to the software bit, if present. + */ + return pte_mksaveddirty(pte); } =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP @@ -388,7 +416,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) =20 static inline pte_t pte_mkclean(pte_t pte) { - return pte_clear_flags(pte, _PAGE_DIRTY); + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); } =20 static inline pte_t pte_mkold(pte_t pte) @@ -403,7 +431,16 @@ static inline pte_t pte_mkexec(pte_t pte) =20 static inline pte_t pte_mkdirty(pte_t pte) { - return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pte =3D pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + + return pte_mksaveddirty(pte); +} + +static inline pte_t pte_mkwrite_shstk(pte_t pte) +{ + pte =3D pte_clear_flags(pte, _PAGE_RW); + + return pte_set_flags(pte, _PAGE_DIRTY); } =20 static inline pte_t pte_mkyoung(pte_t pte) @@ -416,6 +453,10 @@ static inline pte_t pte_mkwrite_novma(pte_t pte) return pte_set_flags(pte, _PAGE_RW); } =20 +struct vm_area_struct; +pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma); +#define pte_mkwrite pte_mkwrite + static inline pte_t pte_mkhuge(pte_t pte) { return pte_set_flags(pte, _PAGE_PSE); @@ -480,7 +521,14 @@ static inline pmd_t pmd_clear_saveddirty(pmd_t pmd) =20 static inline pmd_t pmd_wrprotect(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_RW); + pmd =3D pmd_clear_flags(pmd, _PAGE_RW); + + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PMD (RW=3D0, Dirty=3D1). Move the hardware + * dirty value to the software bit. + */ + return pmd_mksaveddirty(pmd); } =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP @@ -507,12 +555,21 @@ static inline pmd_t pmd_mkold(pmd_t pmd) =20 static inline pmd_t pmd_mkclean(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_DIRTY); + return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS); } =20 static inline pmd_t pmd_mkdirty(pmd_t pmd) { - return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pmd =3D pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + + return pmd_mksaveddirty(pmd); +} + +static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) +{ + pmd =3D pmd_clear_flags(pmd, _PAGE_RW); + + return pmd_set_flags(pmd, _PAGE_DIRTY); } =20 static inline pmd_t pmd_mkdevmap(pmd_t pmd) @@ -535,6 +592,9 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_RW); } =20 +pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); +#define pmd_mkwrite pmd_mkwrite + static inline pud_t pud_set_flags(pud_t pud, pudval_t set) { pudval_t v =3D native_pud_val(pud); @@ -574,17 +634,26 @@ static inline pud_t pud_mkold(pud_t pud) =20 static inline pud_t pud_mkclean(pud_t pud) { - return pud_clear_flags(pud, _PAGE_DIRTY); + return pud_clear_flags(pud, _PAGE_DIRTY_BITS); } =20 static inline pud_t pud_wrprotect(pud_t pud) { - return pud_clear_flags(pud, _PAGE_RW); + pud =3D pud_clear_flags(pud, _PAGE_RW); + + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PUD (RW=3D0, Dirty=3D1). Move the hardware + * dirty value to the software bit. + */ + return pud_mksaveddirty(pud); } =20 static inline pud_t pud_mkdirty(pud_t pud) { - return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pud =3D pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + + return pud_mksaveddirty(pud); } =20 static inline pud_t pud_mkdevmap(pud_t pud) @@ -604,7 +673,9 @@ static inline pud_t pud_mkyoung(pud_t pud) =20 static inline pud_t pud_mkwrite(pud_t pud) { - return pud_set_flags(pud, _PAGE_RW); + pud =3D pud_set_flags(pud, _PAGE_RW); + + return pud_clear_saveddirty(pud); } =20 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY @@ -721,6 +792,7 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 v= al, u64 mask); static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pteval_t val =3D pte_val(pte), oldval =3D val; + pte_t pte_result; =20 /* * Chop off the NX bit (if present), and add the NX portion of @@ -729,17 +801,54 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t ne= wprot) val &=3D _PAGE_CHG_MASK; val |=3D check_pgprot(newprot) & ~_PAGE_CHG_MASK; val =3D flip_protnone_guard(oldval, val, PTE_PFN_MASK); - return __pte(val); + + pte_result =3D __pte(val); + + /* + * To avoid creating Write=3D0,Dirty=3D1 PTEs, pte_modify() needs to avoi= d: + * 1. Marking Write=3D0 PTEs Dirty=3D1 + * 2. Marking Dirty=3D1 PTEs Write=3D0 + * + * The first case cannot happen because the _PAGE_CHG_MASK will filter + * out any Dirty bit passed in newprot. Handle the second case by + * going through the mksaveddirty exercise. Only do this if the old + * value was Write=3D1 to avoid doing this on Shadow Stack PTEs. + */ + if (oldval & _PAGE_RW) + pte_result =3D pte_mksaveddirty(pte_result); + else + pte_result =3D pte_clear_saveddirty(pte_result); + + return pte_result; } =20 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) { pmdval_t val =3D pmd_val(pmd), oldval =3D val; + pmd_t pmd_result; =20 - val &=3D _HPAGE_CHG_MASK; + val &=3D (_HPAGE_CHG_MASK & ~_PAGE_DIRTY); val |=3D check_pgprot(newprot) & ~_HPAGE_CHG_MASK; val =3D flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK); - return __pmd(val); + + pmd_result =3D __pmd(val); + + /* + * To avoid creating Write=3D0,Dirty=3D1 PMDs, pte_modify() needs to avoi= d: + * 1. Marking Write=3D0 PMDs Dirty=3D1 + * 2. Marking Dirty=3D1 PMDs Write=3D0 + * + * The first case cannot happen because the _PAGE_CHG_MASK will filter + * out any Dirty bit passed in newprot. Handle the second case by + * going through the mksaveddirty exercise. Only do this if the old + * value was Write=3D1 to avoid doing this on Shadow Stack PTEs. + */ + if (oldval & _PAGE_RW) + pmd_result =3D pmd_mksaveddirty(pmd_result); + else + pmd_result =3D pmd_clear_saveddirty(pmd_result); + + return pmd_result; } =20 /* diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index e4f499eb0f29..0ad2c62ac0a8 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -880,3 +880,17 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) =20 #endif /* CONFIG_X86_64 */ #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ + +pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + pte =3D pte_mkwrite_novma(pte); + + return pte_clear_saveddirty(pte); +} + +pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) +{ + pmd =3D pmd_mkwrite_novma(pmd); + + return pmd_clear_saveddirty(pmd); +} --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC346C7EE2E for ; Tue, 13 Jun 2023 00:13:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238980AbjFMANM (ORCPT ); Mon, 12 Jun 2023 20:13:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51946 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238738AbjFMAM0 (ORCPT ); Mon, 12 Jun 2023 20:12:26 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C6C9F0; Mon, 12 Jun 2023 17:12:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615142; x=1718151142; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qno7FNbVCOlU0l2tsMlcvtzg5iSqFMH8Q3FgqO3yA3s=; b=HQRuUlSox7Nv4w8nTyOBqx3/ZuiQHHppNSrNzzabwGyn30BldrP+ftmG kg+hWfBNLPIXsfkxqD55ZDOnylyC9LjyIGgnwwcfraLeNgRscX9h6KCno +mt/7Tq2xLyrKoK8bgA+iyqq2znz8MGvf/oU174MrggH4Ab2c8lIQtoYc TvcqvENX8sEdZZ5nnjECZgeg5biSBxZiJjBJXK+fjbU/+NRiq7enCkbEq 8oLJliAvQz6yeERjjLOOvfD1dcBK4viqdFyv3/UaRyF1iVAFvdDBm86RM h2f6WSpDWvFYM8CL8iUQqHNjYthEU6SvO1xtmtUu9K7/F+c9i3TrN3UCp Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556946" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556946" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:19 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671016" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671016" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:18 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 13/42] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Date: Mon, 12 Jun 2023 17:10:39 -0700 Message-Id: <20230613001108.3040476-14-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" New processors that support Shadow Stack regard Write=3D0,Dirty=3D1 PTEs as shadow stack pages. In normal cases, it can be helpful to create Write=3D1 PTEs as also Dirty= =3D1 if HW dirty tracking is not needed, because if the Dirty bit is not already set the CPU has to set Dirty=3D1 when the memory gets written to. This creates additional work for the CPU. So traditional wisdom was to simply set the Dirty bit whenever you didn't care about it. However, it was never really very helpful for read-only kernel memory. When CR4.CET=3D1 and IA32_S_CET.SH_STK_EN=3D1, some instructions can write = to such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so avoiding kernel Write=3D0,Dirty=3D1 memory is not strictly needed for any functional reason. But having Write=3D0,Dirty=3D1 kernel memory doesn't have any functional benefit either, so to reduce ambiguity between shadow stack and regular Write=3D0 pages, remove Dirty=3D1 from any kernel Write=3D0 PTE= s. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/pgtable_types.h | 8 +++++--- arch/x86/mm/pat/set_memory.c | 4 ++-- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pg= table_types.h index ee6f8e57e115..26f07d6d5758 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -222,10 +222,12 @@ enum page_cache_mode { #define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0) #define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC) =20 -#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G) -#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0|___D| 0|___G) +#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G) +#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G) +#define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G) +#define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G) #define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| _= _NC) -#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G) +#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G) #define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G) #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G) #define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 7159cf787613..fc627acfe40e 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2073,12 +2073,12 @@ int set_memory_nx(unsigned long addr, int numpages) =20 int set_memory_ro(unsigned long addr, int numpages) { - return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0); + return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_= DIRTY), 0); } =20 int set_memory_rox(unsigned long addr, int numpages) { - pgprot_t clr =3D __pgprot(_PAGE_RW); + pgprot_t clr =3D __pgprot(_PAGE_RW | _PAGE_DIRTY); =20 if (__supported_pte_mask & _PAGE_NX) clr.pgprot |=3D _PAGE_NX; --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92592C7EE2E for ; Tue, 13 Jun 2023 00:13:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239006AbjFMANW (ORCPT ); Mon, 12 Jun 2023 20:13:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238845AbjFMAMe (ORCPT ); Mon, 12 Jun 2023 20:12:34 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1FC51730; Mon, 12 Jun 2023 17:12:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615146; x=1718151146; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=svMWn5R9eThoiar+k0qdORAiS7Vwyk8B+BTeN3eLytc=; b=n8if8qdQukN6W8TFmTvv6muKDj4494uFDuN6u1F5bMSfjuF5qAXnnQTP ucPwzsXPUCkf/zNUJ/SubFkseQj/EOOSA1UaozanhV5uAp/yJukoXs6ha mnAjNh4Fw0yRlMxyzgKtEw0vmz3/QQxLIRsAUrelYmH+XNIH9W+gx6Ja6 BUbd26SHCizexRqcNBjfg/vpzbfjZ/NeuIuKlUV3bGLj8GYEiCCOGdH3L uJfwABlj6Fj9E6OX7umvycvJ+eD0Qp86f5lYriCkbUrqmekNIZjAO1yFT zM5j3yvLZ40VVMPqFe6poR5uhFC/MHWQadrCOEnwjns+FcE21gQQ4GdoU Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556974" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556974" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671022" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671022" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:19 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory Date: Mon, 12 Jun 2023 17:10:40 -0700 Message-Id: <20230613001108.3040476-15-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Yu-cheng Yu New hardware extensions implement support for shadow stack memory, such as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to identify these areas, for example, to be used to properly indicate shadow stack PTEs to the hardware. Shadow stack VMA creation will be tightly controlled and limited to anonymous memory to make the implementation simpler and since that is all that is required. The solution will rely on pte_mkwrite() to create the shadow stack PTEs, so it will not be required for vm_get_page_prot() to learn how to create shadow stack memory. For this reason document that VM_SHADOW_STACK should not be mixed with VM_SHARED. Signed-off-by: Yu-cheng Yu Co-developed-by: Rick Edgecombe Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Reviewed-by: Kirill A. Shutemov Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Acked-by: David Hildenbrand Reviewed-by: Mark Brown Tested-by: Mark Brown --- Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 8 ++++++++ 3 files changed, 12 insertions(+) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems= /proc.rst index 7897a7dafcbc..6ccb57089a06 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -566,6 +566,7 @@ encoded manner. The codes are the following: mt arm64 MTE allocation tags are enabled um userfaultfd missing tracking uw userfaultfd wr-protect tracking + ss shadow stack page =3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 Note that there is no guarantee that every flag and associated mnemonic wi= ll diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 420510f6a545..38b19a757281 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, str= uct vm_area_struct *vma) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR [ilog2(VM_UFFD_MINOR)] =3D "ui", #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ +#ifdef CONFIG_X86_USER_SHADOW_STACK + [ilog2(VM_SHADOW_STACK)] =3D "ss", +#endif }; size_t i; =20 diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f52c1e7c640..fb17cbd531ac 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -319,11 +319,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ =20 #ifdef CONFIG_ARCH_HAS_PKEYS @@ -339,6 +341,12 @@ extern unsigned int kobjsize(const void *objp); #endif #endif /* CONFIG_ARCH_HAS_PKEYS */ =20 +#ifdef CONFIG_X86_USER_SHADOW_STACK +# define VM_SHADOW_STACK VM_HIGH_ARCH_5 /* Should not be set with VM_SHARE= D */ +#else +# define VM_SHADOW_STACK VM_NONE +#endif + #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95891C88CB2 for ; Tue, 13 Jun 2023 00:13:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239017AbjFMAN2 (ORCPT ); Mon, 12 Jun 2023 20:13:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52162 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238870AbjFMAMf (ORCPT ); Mon, 12 Jun 2023 20:12:35 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD5A81728; Mon, 12 Jun 2023 17:12:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615146; x=1718151146; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VxsR8A3vgFZdU2Uya7vVGhJBClJyO0tAkA+BqmUO8Zs=; b=MAkibPwx6HjXO50QeclLoNDH6D2TGh73wITGile4ozU9mHcxnH+PheEm qOjOV8daMbTI/KuLiGHq4DTFRyRxFm676WzWhjn4w9iO9g845IDe9z26b fLqErGBXzqHHP/1PSPelFvMbtbKi/g/NPubKjam5QPWQAuZRmOJWDXiDi DjWXx/G850tGLUankXiSV3sx/AD+jF9uujteIqChg2rfcgOqef46XIszU jmuYjA/lcME4NaA6VS+8A3AViz/M9lJ7dOR1FP9CBYfFijfQsY8GfK1O3 lk4CwotlRYtaR+WONZpp2wd+jlQh8t2dx/ZYGlXPs7xRyo1an1mbNDzBZ g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361556998" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361556998" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671026" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671026" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:20 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 15/42] x86/mm: Check shadow stack page fault errors Date: Mon, 12 Jun 2023 17:10:41 -0700 Message-Id: <20230613001108.3040476-16-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The CPU performs "shadow stack accesses" when it expects to encounter shadow stack mappings. These accesses can be implicit (via CALL/RET instructions) or explicit (instructions like WRSS). Shadow stack accesses to shadow-stack mappings can result in faults in normal, valid operation just like regular accesses to regular mappings. Shadow stacks need some of the same features like delayed allocation, swap and copy-on-write. The kernel needs to use faults to implement those features. The architecture has concepts of both shadow stack reads and shadow stack writes. Any shadow stack access to non-shadow stack memory will generate a fault with the shadow stack error code bit set. This means that, unlike normal write protection, the fault handler needs to create a type of memory that can be written to (with instructions that generate shadow stack writes), even to fulfill a read access. So in the case of COW memory, the COW needs to take place even with a shadow stack read. Otherwise the page will be left (shadow stack) writable in userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE for shadow stack accesses, even if the access was a shadow stack read. For the purpose of making this clearer, consider the following example. If a process has a shadow stack, and forks, the shadow stack PTEs will become read-only due to COW. If the CPU in one process performs a shadow stack read access to the shadow stack, for example executing a RET and causing the CPU to read the shadow stack copy of the return address, then in order for the fault to be resolved the PTE will need to be set with shadow stack permissions. But then the memory would be changeable from userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger COW, otherwise the shared page would be changeable from both processes. Shadow stack accesses can also result in errors, such as when a shadow stack overflows, or if a shadow stack access occurs to a non-shadow-stack mapping. Also, generate the errors for invalid shadow stack accesses. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/trap_pf.h | 2 ++ arch/x86/mm/fault.c | 22 ++++++++++++++++++++++ 2 files changed, 24 insertions(+) diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h index 10b1de500ab1..afa524325e55 100644 --- a/arch/x86/include/asm/trap_pf.h +++ b/arch/x86/include/asm/trap_pf.h @@ -11,6 +11,7 @@ * bit 3 =3D=3D 1: use of reserved bit detected * bit 4 =3D=3D 1: fault was an instruction fetch * bit 5 =3D=3D 1: protection keys block access + * bit 6 =3D=3D 1: shadow stack access fault * bit 15 =3D=3D 1: SGX MMU page-fault */ enum x86_pf_error_code { @@ -20,6 +21,7 @@ enum x86_pf_error_code { X86_PF_RSVD =3D 1 << 3, X86_PF_INSTR =3D 1 << 4, X86_PF_PK =3D 1 << 5, + X86_PF_SHSTK =3D 1 << 6, X86_PF_SGX =3D 1 << 15, }; =20 diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e4399983c50c..fe68119ce2cc 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1118,8 +1118,22 @@ access_error(unsigned long error_code, struct vm_are= a_struct *vma) (error_code & X86_PF_INSTR), foreign)) return 1; =20 + /* + * Shadow stack accesses (PF_SHSTK=3D1) are only permitted to + * shadow stack VMAs. All other accesses result in an error. + */ + if (error_code & X86_PF_SHSTK) { + if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK))) + return 1; + if (unlikely(!(vma->vm_flags & VM_WRITE))) + return 1; + return 0; + } + if (error_code & X86_PF_WRITE) { /* write, present and write, not present: */ + if (unlikely(vma->vm_flags & VM_SHADOW_STACK)) + return 1; if (unlikely(!(vma->vm_flags & VM_WRITE))) return 1; return 0; @@ -1311,6 +1325,14 @@ void do_user_addr_fault(struct pt_regs *regs, =20 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); =20 + /* + * Read-only permissions can not be expressed in shadow stack PTEs. + * Treat all shadow stack accesses as WRITE faults. This ensures + * that the MM will prepare everything (e.g., break COW) such that + * maybe_mkwrite() can create a proper shadow stack PTE. + */ + if (error_code & X86_PF_SHSTK) + flags |=3D FAULT_FLAG_WRITE; if (error_code & X86_PF_WRITE) flags |=3D FAULT_FLAG_WRITE; if (error_code & X86_PF_INSTR) --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99122C7EE2E for ; Tue, 13 Jun 2023 00:14:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239042AbjFMAOD (ORCPT ); Mon, 12 Jun 2023 20:14:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52218 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238907AbjFMAMh (ORCPT ); Mon, 12 Jun 2023 20:12:37 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C89F918C; Mon, 12 Jun 2023 17:12:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615148; x=1718151148; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=oZdKPqbrR7Qk+U72in77o4gzn1lGdOYqXWRfhhqI1mg=; b=lgizI1y9RJzdU8F/vZYEjcUSqeNizgbKGojD/5jhzbPP3PcuX5KnK5ii d/2nrGuUDxwfwyVKQUDkAbyv7TKi8qxI/vYQ2b/n99dj7A/zOXjLrImpZ lfG9XL7xBRXCqA1RJWGow0wK5i5z6koEgmo4WpKvdszPcFKWDjAc3Hkqs KCjqo1bzNVoDPWIyAdRtbYYdcTUPyu/SdPJmRN0jjhpx+0BgsPieHRGkQ YYibl6hyWXN8w302fN7uD9GD+Ht1vCJGpaum2tCC8HD058UT+e+hmoACy JOFWG9QIsSVGqUO1pEV9V4eI55Mqn3cF7AGnyUiqWxM9Rd/D8BWm+u3Ed w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557020" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557020" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671031" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671031" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:20 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 16/42] mm: Add guard pages around a shadow stack. Date: Mon, 12 Jun 2023 17:10:42 -0700 Message-Id: <20230613001108.3040476-17-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. The architecture of shadow stack constrains the ability of userspace to move the shadow stack pointer (SSP) in order to prevent corrupting or switching to other shadow stacks. The RSTORSSP instruction can move the SSP to different shadow stacks, but it requires a specially placed token in order to do this. However, the architecture does not prevent incrementing the stack pointer to wander onto an adjacent shadow stack. To prevent this in software, enforce guard pages at the beginning of shadow stack VMAs, such that there will always be a gap between adjacent shadow stacks. Make the gap big enough so that no userspace SSP changing operations (besides RSTORSSP), can move the SSP from one stack to the next. The SSP can be incremented or decremented by CALL, RET and INCSSP. CALL and RET can move the SSP by a maximum of 8 bytes, at which point the shadow stack would be accessed. The INCSSP instruction can also increment the shadow stack pointer. It is the shadow stack analog of an instruction like: addq $0x80, %rsp However, there is one important difference between an ADD on %rsp and INCSSP. In addition to modifying SSP, INCSSP also reads from the memory of the first and last elements that were "popped". It can be thought of as acting like this: READ_ONCE(ssp); // read+discard top element on stack ssp +=3D nr_to_pop * 8; // move the shadow stack READ_ONCE(ssp-8); // read+discard last popped stack element The maximum distance INCSSP can move the SSP is 2040 bytes, before it would read the memory. Therefore, a single page gap will be enough to prevent any operation from shifting the SSP to an adjacent stack, since it would have to land in the gap at least once, causing a fault. This could be accomplished by using VM_GROWSDOWN, but this has a downside. The behavior would allow shadow stacks to grow, which is unneeded and adds a strange difference to how most regular stacks work. In the maple tree code, there is some logic for retrying the unmapped area search if a guard gap is violated. This retry should happen for shadow stack guard gap violations as well. This logic currently only checks for VM_GROWSDOWN for start gaps. Since shadow stacks also have a start gap as well, create an new define VM_STARTGAP_FLAGS to hold all the VM flag bits that have start gaps, and make mmap use it. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Reviewed-by: Mark Brown --- v9: - Add logic needed to still have guard gaps with maple tree. --- include/linux/mm.h | 54 ++++++++++++++++++++++++++++++++++++++++------ mm/mmap.c | 4 ++-- 2 files changed, 50 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fb17cbd531ac..535c58d3b2e4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -342,7 +342,36 @@ extern unsigned int kobjsize(const void *objp); #endif /* CONFIG_ARCH_HAS_PKEYS */ =20 #ifdef CONFIG_X86_USER_SHADOW_STACK -# define VM_SHADOW_STACK VM_HIGH_ARCH_5 /* Should not be set with VM_SHARE= D */ +/* + * This flag should not be set with VM_SHARED because of lack of support + * core mm. It will also get a guard page. This helps userspace protect + * itself from attacks. The reasoning is as follows: + * + * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The + * INCSSP instruction can increment the shadow stack pointer. It is the + * shadow stack analog of an instruction like: + * + * addq $0x80, %rsp + * + * However, there is one important difference between an ADD on %rsp + * and INCSSP. In addition to modifying SSP, INCSSP also reads from the + * memory of the first and last elements that were "popped". It can be + * thought of as acting like this: + * + * READ_ONCE(ssp); // read+discard top element on stack + * ssp +=3D nr_to_pop * 8; // move the shadow stack + * READ_ONCE(ssp-8); // read+discard last popped stack element + * + * The maximum distance INCSSP can move the SSP is 2040 bytes, before + * it would read the memory. Therefore a single page gap will be enough + * to prevent any operation from shifting the SSP to an adjacent stack, + * since it would have to land in the gap at least once, causing a + * fault. + * + * Prevent using INCSSP to move the SSP between shadow stacks by + * having a PAGE_SIZE guard gap. + */ +# define VM_SHADOW_STACK VM_HIGH_ARCH_5 #else # define VM_SHADOW_STACK VM_NONE #endif @@ -405,6 +434,8 @@ extern unsigned int kobjsize(const void *objp); #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS #endif =20 +#define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK) + #ifdef CONFIG_STACK_GROWSUP #define VM_STACK VM_GROWSUP #else @@ -3235,15 +3266,26 @@ struct vm_area_struct *vma_lookup(struct mm_struct = *mm, unsigned long addr) return mtree_load(&mm->mm_mt, addr); } =20 +static inline unsigned long stack_guard_start_gap(struct vm_area_struct *v= ma) +{ + if (vma->vm_flags & VM_GROWSDOWN) + return stack_guard_gap; + + /* See reasoning around the VM_SHADOW_STACK definition */ + if (vma->vm_flags & VM_SHADOW_STACK) + return PAGE_SIZE; + + return 0; +} + static inline unsigned long vm_start_gap(struct vm_area_struct *vma) { + unsigned long gap =3D stack_guard_start_gap(vma); unsigned long vm_start =3D vma->vm_start; =20 - if (vma->vm_flags & VM_GROWSDOWN) { - vm_start -=3D stack_guard_gap; - if (vm_start > vma->vm_start) - vm_start =3D 0; - } + vm_start -=3D gap; + if (vm_start > vma->vm_start) + vm_start =3D 0; return vm_start; } =20 diff --git a/mm/mmap.c b/mm/mmap.c index afdf5f78432b..d4793600a8d4 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1570,7 +1570,7 @@ static unsigned long unmapped_area(struct vm_unmapped= _area_info *info) gap =3D mas.index; gap +=3D (info->align_offset - gap) & info->align_mask; tmp =3D mas_next(&mas, ULONG_MAX); - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possi= ble */ + if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if = possible */ if (vm_start_gap(tmp) < gap + length - 1) { low_limit =3D tmp->vm_end; mas_reset(&mas); @@ -1622,7 +1622,7 @@ static unsigned long unmapped_area_topdown(struct vm_= unmapped_area_info *info) gap -=3D (gap - info->align_offset) & info->align_mask; gap_end =3D mas.last; tmp =3D mas_next(&mas, ULONG_MAX); - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possi= ble */ + if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if = possible */ if (vm_start_gap(tmp) <=3D gap_end) { high_limit =3D vm_start_gap(tmp); mas_reset(&mas); --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A61BDC88CB8 for ; Tue, 13 Jun 2023 00:15:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239134AbjFMAPI (ORCPT ); Mon, 12 Jun 2023 20:15:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52892 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239019AbjFMAN3 (ORCPT ); Mon, 12 Jun 2023 20:13:29 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC2F91BD4; Mon, 12 Jun 2023 17:12:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615154; x=1718151154; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dhZWE6YTdZ+FD9PI/U6uJC2MaFgGrtV9CcOxkJ3S4o4=; b=nKs6GDAfAmXL3lLEMiSLw4dNNNs7s5A0Stx5EeIE6X9n4SO805ioHHFN eFUI+pVLjs79BDazntQy/EVRVkoXCZsQXPey8yNjshgQxvR8UJTzwrbJO Ug4HNzoqjKRYyB2akgYKDOzPKJPxdvNEL6vMEdPTlKs/nv+4KxDu6IuxP aHrC41y37agZj2bNv7iNNls1mG0Bw2Xxtdi873dw8akSgzFl6dhDDUhWx jTikhY5vmUOvOyOPnYoOPRbSh+CagAtYlIdcB/vd3Sg46JvauC+3ycJYZ C30ZZJrDXSYoBb4IVhQs6I8z12z07fgXjHPw7LOplyOal1i72UTeoQ5kT g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557047" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557047" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671035" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671035" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:21 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 17/42] mm: Warn on shadow stack memory in wrong vma Date: Mon, 12 Jun 2023 17:10:43 -0700 Message-Id: <20230613001108.3040476-18-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One sharp edge is that PTEs that are both Write=3D0 and Dirty=3D1 are treated as shadow by the CPU, but this combination used to be created by the kernel on x86. Previous patches have changed the kernel to now avoid creating these PTEs unless they are for shadow stack memory. In case any missed corners of the kernel are still creating PTEs like this for non-shadow stack memory, and to catch any re-introductions of the logic, warn if any shadow stack PTEs (Write=3D0, Dirty=3D1) are found in non-shadow stack VMAs when they are being zapped. This won't catch transient cases but should have decent coverage. In order to check if a PTE is shadow stack in core mm code, add two arch breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack specific code to be kept in arch/x86. Only do the check if shadow stack is supported by the CPU and configured because in rare cases older CPUs may write Dirty=3D1 to a Write=3D0 CPU on older CPUs. This check is handled in pte_shstk()/pmd_shstk(). Signed-off-by: Rick Edgecombe Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Reviewed-by: Mark Brown --- v9: - Add comments about not doing the check on non-shadow stack CPUs --- arch/x86/include/asm/pgtable.h | 6 ++++++ arch/x86/mm/pgtable.c | 20 ++++++++++++++++++++ include/linux/pgtable.h | 14 ++++++++++++++ mm/huge_memory.c | 1 + mm/memory.c | 1 + 5 files changed, 42 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index d8724f5b1202..89cfa93d0ad6 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1664,6 +1664,12 @@ static inline bool arch_has_hw_pte_young(void) return true; } =20 +#define arch_check_zapped_pte arch_check_zapped_pte +void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte); + +#define arch_check_zapped_pmd arch_check_zapped_pmd +void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd); + #ifdef CONFIG_XEN_PV #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young static inline bool arch_has_hw_nonleaf_pmd_young(void) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 0ad2c62ac0a8..101e721d74aa 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -894,3 +894,23 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vm= a) =20 return pmd_clear_saveddirty(pmd); } + +void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte) +{ + /* + * Hardware before shadow stack can (rarely) set Dirty=3D1 + * on a Write=3D0 PTE. So the below condition + * only indicates a software bug when shadow stack is + * supported by the HW. This checking is covered in + * pte_shstk(). + */ + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && + pte_shstk(pte)); +} + +void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd) +{ + /* See note in arch_check_zapped_pte() */ + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && + pmd_shstk(pmd)); +} diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 0f3cf726812a..feb1fd2c814f 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -291,6 +291,20 @@ static inline bool arch_has_hw_pte_young(void) } #endif =20 +#ifndef arch_check_zapped_pte +static inline void arch_check_zapped_pte(struct vm_area_struct *vma, + pte_t pte) +{ +} +#endif + +#ifndef arch_check_zapped_pmd +static inline void arch_check_zapped_pmd(struct vm_area_struct *vma, + pmd_t pmd) +{ +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 37dd56b7b3d1..c3cc20c1b26c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1681,6 +1681,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_ar= ea_struct *vma, */ orig_pmd =3D pmdp_huge_get_and_clear_full(vma, addr, pmd, tlb->fullmm); + arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); if (vma_is_special_huge(vma)) { if (arch_needs_pgtable_deposit()) diff --git a/mm/memory.c b/mm/memory.c index c1b6fe944c20..40c0b233b61d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1412,6 +1412,7 @@ static unsigned long zap_pte_range(struct mmu_gather = *tlb, continue; ptent =3D ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7680C88CB2 for ; Tue, 13 Jun 2023 00:15:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238907AbjFMAPV (ORCPT ); Mon, 12 Jun 2023 20:15:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53206 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239022AbjFMAN4 (ORCPT ); Mon, 12 Jun 2023 20:13:56 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 43BA81BDC; Mon, 12 Jun 2023 17:12:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615155; x=1718151155; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ioNZyjx88vkgqQos5HBjwuXN8Qld1af278Y586V7kGQ=; b=jrL0WHpcBkHj3QYgXv2hBY/Ru0gbYbl9aOhT0ffF8Yxx9A7RpOEezA7d 4drBnPu33Qj9l1tAypE4TJl0cAru7b1OxQckh31A5sYUja3AJyEnmipMG CzyT2JOxbh3IMStaROCNNwRVCQb4hWAyx99qG6MypQ2RsRRK2/lZGsd+Z f8jbOvNVBPoRgNMJ7E9C4PL2qZWhM22glD3JFWBywa6eznOzM6tDHSN5Q jsmnxb3Z2loQFm+wCVhCgY93pErg05k6vDxfcqSJrtEgWKXpfO5McKxd/ 3nnGLFPOEwTByd6Yy9785wKchaGjP0F3nU/TWWUzGyEMNmn4bjp8bcsHo g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557071" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557071" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671040" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671040" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:22 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 18/42] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Date: Mon, 12 Jun 2023 17:10:44 -0700 Message-Id: <20230613001108.3040476-19-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When user shadow stack is in use, Write=3D0,Dirty=3D1 is treated by the CPU= as shadow stack memory. So for shadow stack memory this bit combination is valid, but when Dirty=3D1,Write=3D1 (conventionally writable) memory is bei= ng write protected, the kernel has been taught to transition the Dirty=3D1 bit to SavedDirty=3D1, to avoid inadvertently creating shadow stack memory. It does this inside pte_wrprotect() because it knows the PTE is not intended to be a writable shadow stack entry, it is supposed to be write protected. However, when a PTE is created by a raw prot using mk_pte(), mk_pte() can't know whether to adjust Dirty=3D1 to SavedDirty=3D1. It can't distinguish between the caller intending to create a shadow stack PTE or needing the SavedDirty shift. The kernel has been updated to not do this, and so Write=3D0,Dirty=3D1 memory should only be created by the pte_mkfoo() helpers. Add a warning to make sure no new mk_pte() start doing this, like, for example, set_memory_rox() did. Signed-off-by: Rick Edgecombe Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- v9: - Always do the check since 32 bit now supports SavedDirty --- arch/x86/include/asm/pgtable.h | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 89cfa93d0ad6..5383f7282f89 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1032,7 +1032,14 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) * (Currently stuck as a macro because of indirect forward reference * to linux/mm.h:page_to_nid()) */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) +#define mk_pte(page, pgprot) \ +({ \ + pgprot_t __pgprot =3D pgprot; \ + \ + WARN_ON_ONCE((pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW)) =3D=3D \ + _PAGE_DIRTY); \ + pfn_pte(page_to_pfn(page), __pgprot); \ +}) =20 static inline int pmd_bad(pmd_t pmd) { --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB52CC88CB5 for ; Tue, 13 Jun 2023 00:15:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238710AbjFMAPY (ORCPT ); Mon, 12 Jun 2023 20:15:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239064AbjFMAOF (ORCPT ); Mon, 12 Jun 2023 20:14:05 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 747A01BE5; Mon, 12 Jun 2023 17:12:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615157; x=1718151157; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0w0eM3eQB07Ass1qJGPwxV0mmwv4IZmjBMBIbi82ENo=; b=ReZ5iDxtnRq/j9yOwTVVm8jfsnwBm5XBzlh2zJrrkvlhQ4Sw4nW3aQ0x FZQtGGk3lOA9NIbXbS7iz6esqfttDuhVz05nhygtkjTnQDYBEsSYcVrla gO/7pVN5RpeiK1tH42ldhPb3FBiDpz8gBe8WonZbzFy1fBUC8jf+tIn9G 9nTbtQo4hC4p+VZjDST0dKlQ3OJKzQo7ZwDcCyLkJzrtZ+wxV3Frhn0v+ OKhtfCOhz0i32BNjTL1smFYtrCD/KE+yjLF8BifgWD5xZabAkZ1W7+LIS 9FOX/bOIDh90mmht5l41dCvjRxos7UlYE4QqUfiNQql7zYHZGYmmenc+K w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557093" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557093" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:24 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671045" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671045" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:23 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 19/42] mm/mmap: Add shadow stack pages to memory accounting Date: Mon, 12 Jun 2023 17:10:45 -0700 Message-Id: <20230613001108.3040476-20-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Acked-by: David Hildenbrand Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- mm/internal.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 68410c6d97ac..dd2ded32d3d5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -535,14 +535,14 @@ static inline bool is_exec_mapping(vm_flags_t flags) } =20 /* - * Stack area - automatically grows in one direction + * Stack area (including shadow stacks) * * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous: * do_mmap() forbids all other combinations. */ static inline bool is_stack_mapping(vm_flags_t flags) { - return (flags & VM_STACK) =3D=3D VM_STACK; + return ((flags & VM_STACK) =3D=3D VM_STACK) || (flags & VM_SHADOW_STACK); } =20 /* --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 491D1C7EE43 for ; Tue, 13 Jun 2023 00:15:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239275AbjFMAPy (ORCPT ); Mon, 12 Jun 2023 20:15:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235018AbjFMAOY (ORCPT ); Mon, 12 Jun 2023 20:14:24 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3AA1F173C; Mon, 12 Jun 2023 17:12:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615175; x=1718151175; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EOoWyIFrsGRkjg2/O3hP5P9SjQgryxLlQpl2aAV/598=; b=cDL23HLNne7N8bc6doKlwnM2fFAxzlxsDwkJ0dTESoLpYtOeUl39B27B 7QxMUmcJrDITpgoSyBmEG2SHEecLMu4A2u7e9jNQeT9qRhN5aga283Ag5 h4rRsN3T0ikADWQtc/iRdzzNPm4ndRkx75rxCGBzWL+PBB6YDqyApWuXd YI7809pftOFgxdNHxdi8/u+0GzxyOCJSkZANewLhQ9phzux8v2VmlENsW 8FOeh2ewdf8Vv+Jh4McLcAOHAIzgSgxu/aa2Mc/14xbFLOVB0jSbtUCZS XyIpXhFt+pgU0Qsqj8ylieTeN6SrO7B39hOJqXmuvbKVhX05X+KrO13EY g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557117" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557117" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671048" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671048" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:24 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 20/42] x86/mm: Introduce MAP_ABOVE4G Date: Mon, 12 Jun 2023 17:10:46 -0700 Message-Id: <20230613001108.3040476-21-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which require some core mm changes to function properly. One of the properties is that the shadow stack pointer (SSP), which is a CPU register that points to the shadow stack like the stack pointer points to the stack, can't be pointing outside of the 32 bit address space when the CPU is executing in 32 bit mode. It is desirable to prevent executing in 32 bit mode when shadow stack is enabled because the kernel can't easily support 32 bit signals. On x86 it is possible to transition to 32 bit mode without any special interaction with the kernel, by doing a "far call" to a 32 bit segment. So the shadow stack implementation can use this address space behavior as a feature, by enforcing that shadow stack memory is always mapped outside of the 32 bit address space. This way userspace will trigger a general protection fault which will in turn trigger a segfault if it tries to transition to 32 bit mode with shadow stack enabled. This provides a clean error generating border for the user if they try attempt to do 32 bit mode shadow stack, rather than leave the kernel in a half working state for userspace to be surprised by. So to allow future shadow stack enabling patches to map shadow stacks out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior is pretty much like MAP_32BIT, except that it has the opposite address range. The are a few differences though. If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit syscall. Since the default search behavior is top down, the normal kaslr base can be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add its own randomization in the bottom up case. For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G both are potentially valid, so both are used. In the bottomup search path, the default behavior is already consistent with MAP_ABOVE4G since mmap base should be above 4GB. Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB. So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode with shadow stack enabled would usually segfault anyway. This is already pretty decent guard rails. But the addition of MAP_ABOVE4G is some small complexity spent to make it make it more complete. Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/uapi/asm/mman.h | 1 + arch/x86/kernel/sys_x86_64.c | 6 +++++- include/linux/mman.h | 4 ++++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/m= man.h index 775dbd3aff73..5a0256e73f1e 100644 --- a/arch/x86/include/uapi/asm/mman.h +++ b/arch/x86/include/uapi/asm/mman.h @@ -3,6 +3,7 @@ #define _ASM_X86_MMAN_H =20 #define MAP_32BIT 0x40 /* only give out 32bit addresses */ +#define MAP_ABOVE4G 0x80 /* only map above 4GB */ =20 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS #define arch_calc_vm_prot_bits(prot, key) ( \ diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 8cc653ffdccd..c783aeb37dce 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file *filp, cons= t unsigned long addr0, =20 info.flags =3D VM_UNMAPPED_AREA_TOPDOWN; info.length =3D len; - info.low_limit =3D PAGE_SIZE; + if (!in_32bit_syscall() && (flags & MAP_ABOVE4G)) + info.low_limit =3D SZ_4G; + else + info.low_limit =3D PAGE_SIZE; + info.high_limit =3D get_mmap_base(0); =20 /* diff --git a/include/linux/mman.h b/include/linux/mman.h index cee1e4b566d8..40d94411d492 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -15,6 +15,9 @@ #ifndef MAP_32BIT #define MAP_32BIT 0 #endif +#ifndef MAP_ABOVE4G +#define MAP_ABOVE4G 0 +#endif #ifndef MAP_HUGE_2MB #define MAP_HUGE_2MB 0 #endif @@ -50,6 +53,7 @@ | MAP_STACK \ | MAP_HUGETLB \ | MAP_32BIT \ + | MAP_ABOVE4G \ | MAP_HUGE_2MB \ | MAP_HUGE_1GB) =20 --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F2A2C88CBB for ; Tue, 13 Jun 2023 00:15:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239256AbjFMAPu (ORCPT ); Mon, 12 Jun 2023 20:15:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52424 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233336AbjFMAOY (ORCPT ); Mon, 12 Jun 2023 20:14:24 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 17AA41FD4; Mon, 12 Jun 2023 17:12:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615176; x=1718151176; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=c7sunRbrNdSwt/1KmOuIzzKZkXv73NcNzDYIVaoBldo=; b=G03JU2S0Evi/fePKI8tofmFf6jCfbNuseYNk3pBFLsEjtODG0wjoLGX+ fTtSibEDt9OAP17er1I4Y2zOUV+gLWqXTiu7hsS6hUTYyhxkDSadT8Ihk C5CDBUZgCNPaY5h5CR/sbJg/4LKePeZkxee12wEJJxmjKh4tipSR+a9Qc w5tc+YhEQtoNcTjqPXbqzwshaVvGDWwcJzqPt9OrfXmTvLLO3tyz384MH 2E5LkLl5e2e07zNSx8KSt4z1kHIfUHdPl8zbXgCZdRIuKlsGxnubCYOtY P3ghlZM2+r1GM7YyFY84GurNIdlWtTMvT2smfRg87Qwe8bKcd9xoMsz4a w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557144" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557144" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671052" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671052" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:25 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 21/42] x86/mm: Teach pte_mkwrite() about stack memory Date: Mon, 12 Jun 2023 17:10:47 -0700 Message-Id: <20230613001108.3040476-22-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So when it is made writable with pte_mkwrite(), it should create shadow stack memory, not conventionally writable memory. Now that all the places where shadow stack memory might be created pass a VMA into pte_mkwrite(), it can know when it should do this. So make pte_mkwrite() create shadow stack memory when the VMA has the VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite(). This requires referencing VM_SHADOW_STACK in these functions, which are currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is located) can't be pulled in without causing problems for files that reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where they can safely reference VM_SHADOW_STACK. Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Acked-by: Deepak Gupta Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/mm/pgtable.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 101e721d74aa..c4b222d3b1b4 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -883,6 +883,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) =20 pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) { + if (vma->vm_flags & VM_SHADOW_STACK) + return pte_mkwrite_shstk(pte); + pte =3D pte_mkwrite_novma(pte); =20 return pte_clear_saveddirty(pte); @@ -890,6 +893,9 @@ pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) =20 pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) { + if (vma->vm_flags & VM_SHADOW_STACK) + return pmd_mkwrite_shstk(pmd); + pmd =3D pmd_mkwrite_novma(pmd); =20 return pmd_clear_saveddirty(pmd); --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 156D6C7EE2E for ; Tue, 13 Jun 2023 00:16:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239291AbjFMAP6 (ORCPT ); Mon, 12 Jun 2023 20:15:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238630AbjFMAPO (ORCPT ); Mon, 12 Jun 2023 20:15:14 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 236E81FE5; Mon, 12 Jun 2023 17:12:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615179; x=1718151179; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nq8FP2aUY00edE8aipvtg7A+Bog2Zlh3+3MlR+vKuX8=; b=VUA03PHEsKf7BGSuEYVz5E8l/PobAViMN6h/hju3XYNwYiRTtNZCV5AC x3LNNCJ13NUScqLV9UND5F2QFHvBf2/0O77HQrsBq7NF83s1B7jzlviKA G7yIGHLE/rXXSvj46HeXcd0JTRv8xmdNR8Rlq3md4QHaj3n0aFnzAzm+Y tgBDIPCs45SkOkTewLEkJHWRKWJU1nAY746E6y2ff/x46bFujzHEknfoh KdVA4n3jA1bkInpanNHtTZodo6mB1SrouubvfrLEArDgRgSMMpzX7MRAt aWp3omlLb1hLmZEf+7Eg7Qu4SfPLZP4rvp0FYOxggwz1OQp8jozOGkCJu g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557169" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557169" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671056" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671056" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:26 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 22/42] mm: Don't allow write GUPs to shadow stack memory Date: Mon, 12 Jun 2023 17:10:48 -0700 Message-Id: <20230613001108.3040476-23-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. In userspace, shadow stack memory is writable only in very specific, controlled ways. However, since userspace can, even in the limited ways, modify shadow stack contents, the kernel treats it as writable memory. As a result, without additional work there would remain many ways for userspace to trigger the kernel to write arbitrary data to shadow stacks via get_user_pages(, FOLL_WRITE) based operations. To help userspace protect their shadow stacks, make this a little less exposed by blocking writable get_user_pages() operations for shadow stack VMAs. Still allow FOLL_FORCE to write through shadow stack protections, as it does for read-only protections. This is required for debugging use cases. Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Acked-by: David Hildenbrand Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/pgtable.h | 5 +++++ mm/gup.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 5383f7282f89..fce35f5d4a4e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1630,6 +1630,11 @@ static inline bool __pte_access_permitted(unsigned l= ong pteval, bool write) { unsigned long need_pte_bits =3D _PAGE_PRESENT|_PAGE_USER; =20 + /* + * Write=3D0,Dirty=3D1 PTEs are shadow stack, which the kernel + * shouldn't generally allow access to, but since they + * are already Write=3D0, the below logic covers both cases. + */ if (write) need_pte_bits |=3D _PAGE_RW; =20 diff --git a/mm/gup.c b/mm/gup.c index bbe416236593..cc0dd5267509 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -978,7 +978,7 @@ static int check_vma_flags(struct vm_area_struct *vma, = unsigned long gup_flags) return -EFAULT; =20 if (write) { - if (!(vm_flags & VM_WRITE)) { + if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) { if (!(gup_flags & FOLL_FORCE)) return -EFAULT; /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */ --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FAEAC88CBB for ; Tue, 13 Jun 2023 00:16:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239356AbjFMAQZ (ORCPT ); Mon, 12 Jun 2023 20:16:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52406 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239231AbjFMAPg (ORCPT ); Mon, 12 Jun 2023 20:15:36 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 116F82681; Mon, 12 Jun 2023 17:13:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615199; x=1718151199; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=QMw9iVo5bgNcqwrdr19MOUpW+koohCohKti/401XfIw=; b=lysIj8lU3aMzLbUK29LV7QLkk/p0zxuNXCKzD8t3kBMUIZkY4ydgR0aH ZtYsKNAOcJB1cYIYPCyMDx9Rkz8gjC5EcGt3pvXl+JyaovqGGRBg1KN/m jJ+PCDL0Fkii6uPyoAZWDKJY9vK2Nw4YASS1mlToAscsvao34SyqsjD1F bfFlmRQjmSDu188/t9iXq/6iQtk8Zlv+mDlsCHrRzf8iArLL3+P8xiNen XodN2SvT+g18Xj5rO9Xxufww8VyN3SAJa7xmJXD9VFTXq5aujanviF5RQ 8EHK4jo8eUoyjxqKBIGZQr1R2MJIxGM+agsWwWrFo4ftkRN0D9OX5Q7F+ g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557195" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557195" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671062" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671062" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:27 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description Date: Mon, 12 Jun 2023 17:10:49 -0700 Message-Id: <20230613001108.3040476-24-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Introduce a new document on Control-flow Enforcement Technology (CET). Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Reviewed-by: Szabolcs Nagy Tested-bys coming for arm64 anyway. --- Documentation/arch/x86/index.rst | 1 + Documentation/arch/x86/shstk.rst | 169 +++++++++++++++++++++++++++++++ 2 files changed, 170 insertions(+) create mode 100644 Documentation/arch/x86/shstk.rst diff --git a/Documentation/arch/x86/index.rst b/Documentation/arch/x86/inde= x.rst index c73d133fd37c..8ac64d7de4dc 100644 --- a/Documentation/arch/x86/index.rst +++ b/Documentation/arch/x86/index.rst @@ -22,6 +22,7 @@ x86-specific Documentation mtrr pat intel-hfi + shstk iommu intel_txt amd-memory-encryption diff --git a/Documentation/arch/x86/shstk.rst b/Documentation/arch/x86/shst= k.rst new file mode 100644 index 000000000000..f09afa504ec0 --- /dev/null +++ b/Documentation/arch/x86/shstk.rst @@ -0,0 +1,169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D +Control-flow Enforcement Technology (CET) Shadow Stack +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D + +CET Background +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Control-flow Enforcement Technology (CET) covers several related x86 proce= ssor +features that provide protection against control flow hijacking attacks. C= ET +can protect both applications and the kernel. + +CET introduces shadow stack and indirect branch tracking (IBT). A shadow s= tack +is a secondary stack allocated from memory which cannot be directly modifi= ed by +applications. When executing a CALL instruction, the processor pushes the +return address to both the normal stack and the shadow stack. Upon +function return, the processor pops the shadow stack copy and compares it +to the normal stack copy. If the two differ, the processor raises a +control-protection fault. IBT verifies indirect CALL/JMP targets are inten= ded +as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Sh= adow +Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only users= pace +shadow stack and kernel IBT are supported. + +Requirements to use Shadow Stack +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D + +To use userspace shadow stack you need HW that supports it, a kernel +configured with it and userspace libraries compiled with it. + +The kernel Kconfig option is X86_USER_SHADOW_STACK. When compiled in, sha= dow +stacks can be disabled at runtime with the kernel parameter: nousershstk. + +To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or = later +are required. + +At run time, /proc/cpuinfo shows CET features if the processor supports +CET. "user_shstk" means that userspace shadow stack is supported on the cu= rrent +kernel and HW. + +Application Enabling +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +An application's CET capability is marked in its ELF note and can be verif= ied +from readelf/llvm-readelf output:: + + readelf -n | grep -a SHSTK + properties: x86 feature: SHSTK + +The kernel does not process these applications markers directly. Applicati= ons +or loaders must enable CET features using the interface described in secti= on 4. +Typically this would be done in dynamic loader or static runtime objects, = as is +the case in GLIBC. + +Enabling arch_prctl()'s +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Elf features should be enabled by the loader using the below arch_prctl's.= They +are only supported in 64 bit user applications. These operate on the featu= res +on a per-thread basis. The enablement status is inherited on clone, so if = the +feature is enabled on the first thread, it will propagate to all the threa= d's +in an app. + +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature) + Enable a single feature specified in 'feature'. Can only operate on + one feature at a time. + +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature) + Disable a single feature specified in 'feature'. Can only operate on + one feature at a time. + +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features) + Lock in features at their current enabled or disabled status. 'feature= s' + is a mask of all features to lock. All bits set are processed, unset b= its + are ignored. The mask is ORed with the existing value. So any feature = bits + set here cannot be enabled or disabled afterwards. + +The return values are as follows. On success, return 0. On error, errno can +be:: + + -EPERM if any of the passed feature are locked. + -ENOTSUPP if the feature is not supported by the hardware or + kernel. + -EINVAL arguments (non existing feature, etc) + +The feature's bits supported are:: + + ARCH_SHSTK_SHSTK - Shadow stack + ARCH_SHSTK_WRSS - WRSS + +Currently shadow stack and WRSS are supported via this interface. WRSS +can only be enabled with shadow stack, and is automatically disabled +if shadow stack is disabled. + +Proc Status +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +To check if an application is actually running with shadow stack, the +user can read the /proc/$PID/status. It will report "wrss" or "shstk" +depending on what is enabled. The lines look like this:: + + x86_Thread_features: shstk wrss + x86_Thread_features_locked: shstk wrss + +Implementation of the Shadow Stack +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Shadow Stack Size +----------------- + +A task's shadow stack is allocated from memory to a fixed size of +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to +the maximum size of the normal stack, but capped to 4 GB. In the case +of the clone3 syscall, there is a stack size passed in and shadow stack +uses this instead of the rlimit. + +Signal +------ + +The main program and its signal handlers use the same shadow stack. Because +the shadow stack stores only return addresses, a large shadow stack covers +the condition that both the program stack and the signal alternate stack r= un +out. + +When a signal happens, the old pre-signal state is pushed on the stack. Wh= en +shadow stack is enabled, the shadow stack specific state is pushed onto the +shadow stack. Today this is only the old SSP (shadow stack pointer), pushed +in a special format with bit 63 set. On sigreturn this old SSP token is +verified and restored by the kernel. The kernel will also push the normal +restorer address to the shadow stack to help userspace avoid a shadow stack +violation on the sigreturn path that goes through the restorer. + +So the shadow stack signal frame format is as follows:: + + |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format + (bit 63 set to 1) + | ...| - Other state may be added in the future + + +32 bit ABI signals are not supported in shadow stack processes. Linux prev= ents +32 bit execution while shadow stack is enabled by the allocating shadow st= acks +outside of the 32 bit address space. When execution enters 32 bit mode, ei= ther +via far call or returning to userspace, a #GP is generated by the hardware +which, will be delivered to the process as a segfault. When transitioning = to +userspace the register's state will be as if the userspace ip being return= ed to +caused the segfault. + +Fork +---- + +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a +shadow access triggers a page fault with the shadow stack access bit set +in the page fault error code. + +When a task forks a child, its shadow stack PTEs are copied and both the +parent's and the child's shadow stack PTEs are cleared of the dirty bit. +Upon the next shadow stack access, the resulting shadow stack page fault +is handled by page copy/re-use. + +When a pthread child is created, the kernel allocates a new shadow stack +for the new thread. New shadow stack creation behaves like mmap() with res= pect +to ASLR behavior. Similarly, on thread exit the thread's shadow stack is +disabled. + +Exec +---- + +On exec, shadow stack features are disabled by the kernel. At which point, +userspace can choose to re-enable, or lock them. --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A10FC7EE2E for ; Tue, 13 Jun 2023 00:16:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238926AbjFMAQ3 (ORCPT ); Mon, 12 Jun 2023 20:16:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53358 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238733AbjFMAPj (ORCPT ); Mon, 12 Jun 2023 20:15:39 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46A6B2686; Mon, 12 Jun 2023 17:13:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615199; x=1718151199; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=68vHo4+oSqhV8kF+lXavlD4FeK2AToHr3tcBPqqVDm8=; b=NDEwsYYHx40Ep9OMIk9W1/iP7dJnke4QePI7CdyeGzxMV5ajNdl62T88 Y6TCFnKkVf8pkOeKQLhie1xRNENYtsGWviupO47phzUnvwVzSs81yICvs XpDhafxbU9NMXvQMM+LbtiuE7QJlr57p2t0RpcuGaa8TVjkOzFXqIw/Of qcsSWAQC5kJUHCyBJan0ySMjJAY7XPsZlz2iyrQTbMw56qiLGKGKoNnW7 3Ua13iIKiCiBaWVyBK+IxMC4kBBA5hivZkLj5DXgXKBXOSX3EuP/baD5p KIVHX+EBQbi4uzaMbav0H4BXGgnSqnURIkoh2KdGaOFmULaf25C/bclcm Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557219" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557219" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671067" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671067" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:28 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 24/42] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Date: Mon, 12 Jun 2023 17:10:50 -0700 Message-Id: <20230613001108.3040476-25-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Shadow stack register state can be managed with XSAVE. The registers can logically be separated into two groups: * Registers controlling user-mode operation * Registers controlling kernel-mode operation The architecture has two new XSAVE state components: one for each group of those groups of registers. This lets an OS manage them separately if it chooses. Future patches for host userspace and KVM guests will only utilize the user-mode registers, so only configure XSAVE to save user-mode registers. This state will add 16 bytes to the xsave buffer size. Future patches will use the user-mode XSAVE area to save guest user-mode CET state. However, VMCS includes new fields for guest CET supervisor states. KVM can use these to save and restore guest supervisor state, so host supervisor XSAVE support is not required. Adding this exacerbates the already unwieldy if statement in check_xstate_against_struct() that handles warning about unimplemented xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when it actually check's the xfeature. This ends up exceeding 80 chars, but was better on balance than other options explored. Pass the bool as pointer to make it clear that XCHECK_SZ() can change the variable. While configuring user-mode XSAVE, clarify kernel-mode registers are not managed by XSAVE by defining the xfeature in XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT. This serves more of a documentation as code purpose, and functionally, only enables a few safety checks. Both XSAVE state components are supervisor states, even the state controlling user-mode operation. This is a departure from earlier features like protection keys where the PKRU state is a normal user (non-supervisor) state. Having the user state be supervisor-managed ensures there is no direct, unprivileged access to it, making it harder for an attacker to subvert CET. To facilitate this privileged access, define the two user-mode CET MSRs, and the bits defined in those MSRs relevant to future shadow stack enablement patches. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/fpu/types.h | 16 +++++- arch/x86/include/asm/fpu/xstate.h | 6 ++- arch/x86/kernel/fpu/xstate.c | 90 +++++++++++++++---------------- 3 files changed, 61 insertions(+), 51 deletions(-) diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/ty= pes.h index 7f6d858ff47a..eb810074f1e7 100644 --- a/arch/x86/include/asm/fpu/types.h +++ b/arch/x86/include/asm/fpu/types.h @@ -115,8 +115,8 @@ enum xfeature { XFEATURE_PT_UNIMPLEMENTED_SO_FAR, XFEATURE_PKRU, XFEATURE_PASID, - XFEATURE_RSRVD_COMP_11, - XFEATURE_RSRVD_COMP_12, + XFEATURE_CET_USER, + XFEATURE_CET_KERNEL_UNUSED, XFEATURE_RSRVD_COMP_13, XFEATURE_RSRVD_COMP_14, XFEATURE_LBR, @@ -138,6 +138,8 @@ enum xfeature { #define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR) #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU) #define XFEATURE_MASK_PASID (1 << XFEATURE_PASID) +#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER) +#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED) #define XFEATURE_MASK_LBR (1 << XFEATURE_LBR) #define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG) #define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA) @@ -252,6 +254,16 @@ struct pkru_state { u32 pad; } __packed; =20 +/* + * State component 11 is Control-flow Enforcement user states + */ +struct cet_user_state { + /* user control-flow settings */ + u64 user_cet; + /* user shadow stack pointer */ + u64 user_ssp; +}; + /* * State component 15: Architectural LBR configuration state. * The size of Arch LBR state depends on the number of LBRs (lbr_depth). diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/x= state.h index cd3dd170e23a..d4427b88ee12 100644 --- a/arch/x86/include/asm/fpu/xstate.h +++ b/arch/x86/include/asm/fpu/xstate.h @@ -50,7 +50,8 @@ #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA =20 /* All currently supported supervisor features */ -#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID) +#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \ + XFEATURE_MASK_CET_USER) =20 /* * A supervisor state component may not always contain valuable informatio= n, @@ -77,7 +78,8 @@ * Unsupported supervisor features. When a supervisor feature in this mask= is * supported in the future, move it to the supported supervisor feature ma= sk. */ -#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT) +#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \ + XFEATURE_MASK_CET_KERNEL) =20 /* All supervisor states including supported and unsupported states. */ #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED |= \ diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 0bab497c9436..4fa4751912d9 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -39,26 +39,26 @@ */ static const char *xfeature_names[] =3D { - "x87 floating point registers" , - "SSE registers" , - "AVX registers" , - "MPX bounds registers" , - "MPX CSR" , - "AVX-512 opmask" , - "AVX-512 Hi256" , - "AVX-512 ZMM_Hi256" , - "Processor Trace (unused)" , + "x87 floating point registers", + "SSE registers", + "AVX registers", + "MPX bounds registers", + "MPX CSR", + "AVX-512 opmask", + "AVX-512 Hi256", + "AVX-512 ZMM_Hi256", + "Processor Trace (unused)", "Protection Keys User registers", "PASID state", - "unknown xstate feature" , - "unknown xstate feature" , - "unknown xstate feature" , - "unknown xstate feature" , - "unknown xstate feature" , - "unknown xstate feature" , - "AMX Tile config" , - "AMX Tile data" , - "unknown xstate feature" , + "Control-flow User registers", + "Control-flow Kernel registers (unused)", + "unknown xstate feature", + "unknown xstate feature", + "unknown xstate feature", + "unknown xstate feature", + "AMX Tile config", + "AMX Tile data", + "unknown xstate feature", }; =20 static unsigned short xsave_cpuid_features[] __initdata =3D { @@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = =3D { [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] =3D X86_FEATURE_INTEL_PT, [XFEATURE_PKRU] =3D X86_FEATURE_PKU, [XFEATURE_PASID] =3D X86_FEATURE_ENQCMD, + [XFEATURE_CET_USER] =3D X86_FEATURE_SHSTK, [XFEATURE_XTILE_CFG] =3D X86_FEATURE_AMX_TILE, [XFEATURE_XTILE_DATA] =3D X86_FEATURE_AMX_TILE, }; @@ -276,6 +277,7 @@ static void __init print_xstate_features(void) print_xstate_feature(XFEATURE_MASK_Hi16_ZMM); print_xstate_feature(XFEATURE_MASK_PKRU); print_xstate_feature(XFEATURE_MASK_PASID); + print_xstate_feature(XFEATURE_MASK_CET_USER); print_xstate_feature(XFEATURE_MASK_XTILE_CFG); print_xstate_feature(XFEATURE_MASK_XTILE_DATA); } @@ -344,6 +346,7 @@ static __init void os_xrstor_booting(struct xregs_state= *xstate) XFEATURE_MASK_BNDREGS | \ XFEATURE_MASK_BNDCSR | \ XFEATURE_MASK_PASID | \ + XFEATURE_MASK_CET_USER | \ XFEATURE_MASK_XTILE) =20 /* @@ -446,14 +449,15 @@ static void __init __xstate_dump_leaves(void) } \ } while (0) =20 -#define XCHECK_SZ(sz, nr, nr_macro, __struct) do { \ - if ((nr =3D=3D nr_macro) && \ - WARN_ONCE(sz !=3D sizeof(__struct), \ - "%s: struct is %zu bytes, cpu state %d bytes\n", \ - __stringify(nr_macro), sizeof(__struct), sz)) { \ +#define XCHECK_SZ(sz, nr, __struct) ({ \ + if (WARN_ONCE(sz !=3D sizeof(__struct), \ + "[%s]: struct is %zu bytes, cpu state %d bytes\n", \ + xfeature_names[nr], sizeof(__struct), sz)) { \ __xstate_dump_leaves(); \ } \ -} while (0) + true; \ +}) + =20 /** * check_xtile_data_against_struct - Check tile data state size. @@ -527,36 +531,28 @@ static bool __init check_xstate_against_struct(int nr) * Ask the CPU for the size of the state. */ int sz =3D xfeature_size(nr); + /* * Match each CPU state with the corresponding software * structure. */ - XCHECK_SZ(sz, nr, XFEATURE_YMM, struct ymmh_struct); - XCHECK_SZ(sz, nr, XFEATURE_BNDREGS, struct mpx_bndreg_state); - XCHECK_SZ(sz, nr, XFEATURE_BNDCSR, struct mpx_bndcsr_state); - XCHECK_SZ(sz, nr, XFEATURE_OPMASK, struct avx_512_opmask_state); - XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state); - XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state); - XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state); - XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state); - XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg); - - /* The tile data size varies between implementations. */ - if (nr =3D=3D XFEATURE_XTILE_DATA) - check_xtile_data_against_struct(sz); - - /* - * Make *SURE* to add any feature numbers in below if - * there are "holes" in the xsave state component - * numbers. - */ - if ((nr < XFEATURE_YMM) || - (nr >=3D XFEATURE_MAX) || - (nr =3D=3D XFEATURE_PT_UNIMPLEMENTED_SO_FAR) || - ((nr >=3D XFEATURE_RSRVD_COMP_11) && (nr <=3D XFEATURE_RSRVD_COMP_16)= )) { + switch (nr) { + case XFEATURE_YMM: return XCHECK_SZ(sz, nr, struct ymmh_struct); + case XFEATURE_BNDREGS: return XCHECK_SZ(sz, nr, struct mpx_bndreg_state= ); + case XFEATURE_BNDCSR: return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state); + case XFEATURE_OPMASK: return XCHECK_SZ(sz, nr, struct avx_512_opmask_st= ate); + case XFEATURE_ZMM_Hi256: return XCHECK_SZ(sz, nr, struct avx_512_zmm_upp= ers_state); + case XFEATURE_Hi16_ZMM: return XCHECK_SZ(sz, nr, struct avx_512_hi16_st= ate); + case XFEATURE_PKRU: return XCHECK_SZ(sz, nr, struct pkru_state); + case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state); + case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg); + case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state); + case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return tru= e; + default: XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr); return false; } + return true; } =20 --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42A81C7EE43 for ; Tue, 13 Jun 2023 00:16:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239381AbjFMAQc (ORCPT ); Mon, 12 Jun 2023 20:16:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238914AbjFMAPq (ORCPT ); Mon, 12 Jun 2023 20:15:46 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E03EC269D; Mon, 12 Jun 2023 17:13:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615202; x=1718151202; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=L2W4szzCW+qvvywxGcp6uHsWCFDTi4NqV9MxxSUxhGY=; b=m2hAgSCIvSKQe7x+o4sP0P0iMTZ8eLiENIOoIE8RAIkvuslNZ2qYuvTF +yfZfxnyqMLDiH4LSV3hezfYsQoTKxUx6OVJEpST/sv1yMeeZUSiGbz2L liCpQVq4v6XmO4n91rEpMHUONW3w2v2cBAJlmAzK3VGeUU/a6TL4wB5gb ug+NDKdeVH6SdRMYAyMthq34o7fifKHJDOZaF/lN1h3yBiDbNYIu+g4mV MoBhjFnDP3SimoTpcO8FCtbBqqcFiK4hR+I4jDd38dlxZKTPERcvx7Zz2 J6wII4D/38ew+5WTtadegZ8G7XCiuYOj7sByjMp2SlLrqP80BdLwTT8K+ g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557243" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557243" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671072" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671072" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:28 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 25/42] x86/fpu: Add helper for modifying xstate Date: Mon, 12 Jun 2023 17:10:51 -0700 Message-Id: <20230613001108.3040476-26-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Just like user xfeatures, supervisor xfeatures can be active in the registers or present in the task FPU buffer. If the registers are active, the registers can be modified directly. If the registers are not active, the modification must be performed on the task FPU buffer. When the state is not active, the kernel could perform modifications directly to the buffer. But in order for it to do that, it needs to know where in the buffer the specific state it wants to modify is located. Doing this is not robust against optimizations that compact the FPU buffer, as each access would require computing where in the buffer it is. The easiest way to modify supervisor xfeature data is to force restore the registers and write directly to the MSRs. Often times this is just fine anyway as the registers need to be restored before returning to userspace. Do this for now, leaving buffer writing optimizations for the future. Add a new function fpregs_lock_and_load() that can simultaneously call fpregs_lock() and do this restore. Also perform some extra sanity checks in this function since this will be used in non-fpu focused code. Suggested-by: Thomas Gleixner Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/fpu/api.h | 9 +++++++++ arch/x86/kernel/fpu/core.c | 18 ++++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 503a577814b2..aadc6893dcaa 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -82,6 +82,15 @@ static inline void fpregs_unlock(void) preempt_enable(); } =20 +/* + * FPU state gets lazily restored before returning to userspace. So when i= n the + * kernel, the valid FPU state may be kept in the buffer. This function wi= ll force + * restore all the fpu state to the registers early if needed, and lock th= em from + * being automatically saved/restored. Then FPU state can be modified safe= ly in the + * registers, before unlocking with fpregs_unlock(). + */ +void fpregs_lock_and_load(void); + #ifdef CONFIG_X86_DEBUG_FPU extern void fpregs_assert_state_consistent(void); #else diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index caf33486dc5e..f851558b673f 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -753,6 +753,24 @@ void switch_fpu_return(void) } EXPORT_SYMBOL_GPL(switch_fpu_return); =20 +void fpregs_lock_and_load(void) +{ + /* + * fpregs_lock() only disables preemption (mostly). So modifying state + * in an interrupt could screw up some in progress fpregs operation. + * Warn about it. + */ + WARN_ON_ONCE(!irq_fpu_usable()); + WARN_ON_ONCE(current->flags & PF_KTHREAD); + + fpregs_lock(); + + fpregs_assert_state_consistent(); + + if (test_thread_flag(TIF_NEED_FPU_LOAD)) + fpregs_restore_userregs(); +} + #ifdef CONFIG_X86_DEBUG_FPU /* * If current FPU state according to its tracking (loaded FPU context on t= his --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D467AC88CBA for ; Tue, 13 Jun 2023 00:17:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239162AbjFMARA (ORCPT ); Mon, 12 Jun 2023 20:17:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52268 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238840AbjFMAQQ (ORCPT ); Mon, 12 Jun 2023 20:16:16 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 037C0173F; Mon, 12 Jun 2023 17:13:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615209; x=1718151209; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FpsnqDmskVtuVWeAVhUxFxlMXuwPJqlBrZTKKFCDs5Y=; b=R6xK0VOk/CsKnPanl4rMbdEeh3v8uOxWqtL/Ssrl1fxL6P8qC444KXCo VO64BplBrjLIBbY2uer5dHGOecGDsRrFpOW+bT+yJSO2vxXhNgguXfq1Z SIezcfqlI2nGtaNj4d3gQr/GmtxzNrMUi9wtkiuwneTP2OiTRMSdqnEQa EY5u6mpQqNAZeUoO5OqG4x/BRfsRI2CMSqROi+7oaz/xZJggAXi1A5SOJ laxjpkUHP0jsdKehMuVS27OzGthsbrEJwxmqLZL5hsPBQDytMkOXCXRAY qWGhnEJ2DIBg5mQxKOHNv7x0/jzWZeyJt4t8MR5as9TW4av/hNLpi3K0m w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557269" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557269" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671078" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671078" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:29 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 26/42] x86: Introduce userspace API for shadow stack Date: Mon, 12 Jun 2023 17:10:52 -0700 Message-Id: <20230613001108.3040476-27-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add three new arch_prctl() handles: - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified feature. Returns 0 on success or a negative value on error. - ARCH_SHSTK_LOCK prevents future disabling or enabling of the specified feature. Returns 0 on success or a negative value on error. The features are handled per-thread and inherited over fork(2)/clone(2), but reset on exec(). Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/processor.h | 6 +++++ arch/x86/include/asm/shstk.h | 21 +++++++++++++++ arch/x86/include/uapi/asm/prctl.h | 6 +++++ arch/x86/kernel/Makefile | 2 ++ arch/x86/kernel/process_64.c | 6 +++++ arch/x86/kernel/shstk.c | 44 +++++++++++++++++++++++++++++++ 6 files changed, 85 insertions(+) create mode 100644 arch/x86/include/asm/shstk.h create mode 100644 arch/x86/kernel/shstk.c diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/proces= sor.h index a1e4fa58b357..407d5551b6a7 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -28,6 +28,7 @@ struct vm86; #include #include #include +#include =20 #include #include @@ -475,6 +476,11 @@ struct thread_struct { */ u32 pkru; =20 +#ifdef CONFIG_X86_USER_SHADOW_STACK + unsigned long features; + unsigned long features_locked; +#endif + /* Floating point and extended processor state */ struct fpu fpu; /* diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h new file mode 100644 index 000000000000..ec753809f074 --- /dev/null +++ b/arch/x86/include/asm/shstk.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_SHSTK_H +#define _ASM_X86_SHSTK_H + +#ifndef __ASSEMBLY__ +#include + +struct task_struct; + +#ifdef CONFIG_X86_USER_SHADOW_STACK +long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res); +void reset_thread_features(void); +#else +static inline long shstk_prctl(struct task_struct *task, int option, + unsigned long arg2) { return -EINVAL; } +static inline void reset_thread_features(void) {} +#endif /* CONFIG_X86_USER_SHADOW_STACK */ + +#endif /* __ASSEMBLY__ */ + +#endif /* _ASM_X86_SHSTK_H */ diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/= prctl.h index e8d7ebbca1a4..1cd44ecc9ce0 100644 --- a/arch/x86/include/uapi/asm/prctl.h +++ b/arch/x86/include/uapi/asm/prctl.h @@ -23,9 +23,15 @@ #define ARCH_MAP_VDSO_32 0x2002 #define ARCH_MAP_VDSO_64 0x2003 =20 +/* Don't use 0x3001-0x3004 because of old glibcs */ + #define ARCH_GET_UNTAG_MASK 0x4001 #define ARCH_ENABLE_TAGGED_ADDR 0x4002 #define ARCH_GET_MAX_TAG_BITS 0x4003 #define ARCH_FORCE_TAGGED_SVA 0x4004 =20 +#define ARCH_SHSTK_ENABLE 0x5001 +#define ARCH_SHSTK_DISABLE 0x5002 +#define ARCH_SHSTK_LOCK 0x5003 + #endif /* _ASM_X86_PRCTL_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index abee0564b750..6b6bf47652ee 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -147,6 +147,8 @@ obj-$(CONFIG_CALL_THUNKS) +=3D callthunks.o =20 obj-$(CONFIG_X86_CET) +=3D cet.o =20 +obj-$(CONFIG_X86_USER_SHADOW_STACK) +=3D shstk.o + ### # 64 bit specific files ifeq ($(CONFIG_X86_64),y) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 3d181c16a2f6..0f89aa0186d1 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -515,6 +515,8 @@ start_thread_common(struct pt_regs *regs, unsigned long= new_ip, load_gs_index(__USER_DS); } =20 + reset_thread_features(); + loadsegment(fs, 0); loadsegment(es, _ds); loadsegment(ds, _ds); @@ -894,6 +896,10 @@ long do_arch_prctl_64(struct task_struct *task, int op= tion, unsigned long arg2) else return put_user(LAM_U57_BITS, (unsigned long __user *)arg2); #endif + case ARCH_SHSTK_ENABLE: + case ARCH_SHSTK_DISABLE: + case ARCH_SHSTK_LOCK: + return shstk_prctl(task, option, arg2); default: ret =3D -EINVAL; break; diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c new file mode 100644 index 000000000000..41ed6552e0a5 --- /dev/null +++ b/arch/x86/kernel/shstk.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * shstk.c - Intel shadow stack support + * + * Copyright (c) 2021, Intel Corporation. + * Yu-cheng Yu + */ + +#include +#include +#include + +void reset_thread_features(void) +{ + current->thread.features =3D 0; + current->thread.features_locked =3D 0; +} + +long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res) +{ + if (option =3D=3D ARCH_SHSTK_LOCK) { + task->thread.features_locked |=3D features; + return 0; + } + + /* Don't allow via ptrace */ + if (task !=3D current) + return -EINVAL; + + /* Do not allow to change locked features */ + if (features & task->thread.features_locked) + return -EPERM; + + /* Only support enabling/disabling one feature at a time. */ + if (hweight_long(features) > 1) + return -EINVAL; + + if (option =3D=3D ARCH_SHSTK_DISABLE) { + return -EINVAL; + } + + /* Handle ARCH_SHSTK_ENABLE */ + return -EINVAL; +} --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64952C88CB8 for ; Tue, 13 Jun 2023 00:17:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239437AbjFMARD (ORCPT ); Mon, 12 Jun 2023 20:17:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239282AbjFMAQR (ORCPT ); Mon, 12 Jun 2023 20:16:17 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C7C9F2957; Mon, 12 Jun 2023 17:13:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615210; x=1718151210; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WchKfqQNnYFXXNMxX3TqxAMzB4S9WkAXS/aRm3qIr7o=; b=Al0xAI+PUFsEVN9pTNbLZHllXr+EHXB/PJSOsmYJLEvdTGYC3TbIlpAW 2hsIMbn0kst0HAZL4+s3OS1bnAkD08OMjMzrFTbBc6dkIcETtFXA9Hkqc 2Zsu7jSgXvRuV1tBiUuv0bFUv2tfniDxMQjajOmnIy5fWHvEwctHXRF7g xcpBMZ9TUByTRHVWZtn3iMZX4qKTSrbUXMHRZXNvDmNPNHhvMUjWI/ri4 6QRLgNI0BpLF2v93dkuQbSctgOzmpYPtLDa+1WlB6OTOQWCtT8ylfHIqh hPygRV21tSvBgQUHpEJLxgaF5oshbpcwUqxoaA/h8jHvFsIXvU9RANpoL A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557292" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557292" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:31 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671087" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671087" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:30 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 27/42] x86/shstk: Add user control-protection fault handler Date: Mon, 12 Jun 2023 17:10:53 -0700 Message-Id: <20230613001108.3040476-28-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" A control-protection fault is triggered when a control-flow transfer attempt violates Shadow Stack or Indirect Branch Tracking constraints. For example, the return address for a RET instruction differs from the copy on the shadow stack. There already exists a control-protection fault handler for handling kernel IBT faults. Refactor this fault handler into separate user and kernel handlers, like the page fault handler. Add a control-protection handler for usermode. To avoid ifdeffery, put them both in a new file cet.c, which is compiled in the case of either of the two CET features supported in the kernel: kernel IBT or user mode shadow stack. Move some static inline functions from traps.c into a header so they can be used in cet.c. Opportunistically fix a comment in the kernel IBT part of the fault handler that is on the end of the line instead of preceding it. Keep the same behavior for the kernel side of the fault handler, except for converting a BUG to a WARN in the case of a #CP happening when the feature is missing. This unifies the behavior with the new shadow stack code, and also prevents the kernel from crashing under this situation which is potentially recoverable. The control-protection fault handler works in a similar way as the general protection fault handler. It provides the si_code SEGV_CPERR to the signal handler. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/arm/kernel/signal.c | 2 +- arch/arm64/kernel/signal.c | 2 +- arch/arm64/kernel/signal32.c | 2 +- arch/sparc/kernel/signal32.c | 2 +- arch/sparc/kernel/signal_64.c | 2 +- arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 2 +- arch/x86/include/asm/traps.h | 12 +++ arch/x86/kernel/cet.c | 94 +++++++++++++++++++++--- arch/x86/kernel/idt.c | 2 +- arch/x86/kernel/signal_32.c | 2 +- arch/x86/kernel/signal_64.c | 2 +- arch/x86/kernel/traps.c | 12 --- arch/x86/xen/enlighten_pv.c | 2 +- arch/x86/xen/xen-asm.S | 2 +- include/uapi/asm-generic/siginfo.h | 3 +- 16 files changed, 117 insertions(+), 34 deletions(-) diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index e07f359254c3..9a3c9de5ac5e 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs) */ static_assert(NSIGILL =3D=3D 11); static_assert(NSIGFPE =3D=3D 15); -static_assert(NSIGSEGV =3D=3D 9); +static_assert(NSIGSEGV =3D=3D 10); static_assert(NSIGBUS =3D=3D 5); static_assert(NSIGTRAP =3D=3D 6); static_assert(NSIGCHLD =3D=3D 6); diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 2cfc810d0a5b..06d31731c8ed 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -1343,7 +1343,7 @@ void __init minsigstksz_setup(void) */ static_assert(NSIGILL =3D=3D 11); static_assert(NSIGFPE =3D=3D 15); -static_assert(NSIGSEGV =3D=3D 9); +static_assert(NSIGSEGV =3D=3D 10); static_assert(NSIGBUS =3D=3D 5); static_assert(NSIGTRAP =3D=3D 6); static_assert(NSIGCHLD =3D=3D 6); diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c index 4700f8522d27..bbd542704730 100644 --- a/arch/arm64/kernel/signal32.c +++ b/arch/arm64/kernel/signal32.c @@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs) */ static_assert(NSIGILL =3D=3D 11); static_assert(NSIGFPE =3D=3D 15); -static_assert(NSIGSEGV =3D=3D 9); +static_assert(NSIGSEGV =3D=3D 10); static_assert(NSIGBUS =3D=3D 5); static_assert(NSIGTRAP =3D=3D 6); static_assert(NSIGCHLD =3D=3D 6); diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c index dad38960d1a8..82da8a2d769d 100644 --- a/arch/sparc/kernel/signal32.c +++ b/arch/sparc/kernel/signal32.c @@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_oss= ptr, unsigned long sp) */ static_assert(NSIGILL =3D=3D 11); static_assert(NSIGFPE =3D=3D 15); -static_assert(NSIGSEGV =3D=3D 9); +static_assert(NSIGSEGV =3D=3D 10); static_assert(NSIGBUS =3D=3D 5); static_assert(NSIGTRAP =3D=3D 6); static_assert(NSIGCHLD =3D=3D 6); diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c index 570e43e6fda5..b4e410976e0d 100644 --- a/arch/sparc/kernel/signal_64.c +++ b/arch/sparc/kernel/signal_64.c @@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned lo= ng orig_i0, unsigned long */ static_assert(NSIGILL =3D=3D 11); static_assert(NSIGFPE =3D=3D 15); -static_assert(NSIGSEGV =3D=3D 9); +static_assert(NSIGSEGV =3D=3D 10); static_assert(NSIGBUS =3D=3D 5); static_assert(NSIGTRAP =3D=3D 6); static_assert(NSIGCHLD =3D=3D 6); diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/as= m/disabled-features.h index b9c7eae2e70f..702d93fdd10e 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -111,6 +111,12 @@ #define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31)) #endif =20 +#ifdef CONFIG_X86_KERNEL_IBT +#define DISABLE_IBT 0 +#else +#define DISABLE_IBT (1 << (X86_FEATURE_IBT & 31)) +#endif + /* * Make sure to add features to the correct mask */ @@ -134,7 +140,7 @@ #define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UM= IP| \ DISABLE_ENQCMD) #define DISABLED_MASK17 0 -#define DISABLED_MASK18 0 +#define DISABLED_MASK18 (DISABLE_IBT) #define DISABLED_MASK19 0 #define DISABLED_MASK20 0 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS !=3D 21) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentr= y.h index b241af4ce9b4..61e0e6301f09 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -614,7 +614,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF, xenpv_exc_d= ouble_fault); #endif =20 /* #CP */ -#ifdef CONFIG_X86_KERNEL_IBT +#ifdef CONFIG_X86_CET DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection); #endif =20 diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 47ecfff2c83d..75e0dabf0c45 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *re= gs, struct stack_info *info); #endif =20 +static inline void cond_local_irq_enable(struct pt_regs *regs) +{ + if (regs->flags & X86_EFLAGS_IF) + local_irq_enable(); +} + +static inline void cond_local_irq_disable(struct pt_regs *regs) +{ + if (regs->flags & X86_EFLAGS_IF) + local_irq_disable(); +} + #endif /* _ASM_X86_TRAPS_H */ diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c index 7ad22b705b64..cc10d8be9d74 100644 --- a/arch/x86/kernel/cet.c +++ b/arch/x86/kernel/cet.c @@ -4,10 +4,6 @@ #include #include =20 -static __ro_after_init bool ibt_fatal =3D true; - -extern void ibt_selftest_ip(void); /* code label defined in asm below */ - enum cp_error_code { CP_EC =3D (1 << 15) - 1, =20 @@ -20,15 +16,80 @@ enum cp_error_code { CP_ENCL =3D 1 << 15, }; =20 -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection) +static const char cp_err[][10] =3D { + [0] =3D "unknown", + [1] =3D "near ret", + [2] =3D "far/iret", + [3] =3D "endbranch", + [4] =3D "rstorssp", + [5] =3D "setssbsy", +}; + +static const char *cp_err_string(unsigned long error_code) { - if (!cpu_feature_enabled(X86_FEATURE_IBT)) { - pr_err("Unexpected #CP\n"); - BUG(); + unsigned int cpec =3D error_code & CP_EC; + + if (cpec >=3D ARRAY_SIZE(cp_err)) + cpec =3D 0; + return cp_err[cpec]; +} + +static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_cod= e) +{ + WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n", + user_mode(regs) ? "user mode" : "kernel mode", + cp_err_string(error_code)); +} + +static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + +static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_cod= e) +{ + struct task_struct *tsk; + unsigned long ssp; + + /* + * An exception was just taken from userspace. Since interrupts are disab= led + * here, no scheduling should have messed with the registers yet and they + * will be whatever is live in userspace. So read the SSP before enabling + * interrupts so locking the fpregs to do it later is not required. + */ + rdmsrl(MSR_IA32_PL3_SSP, ssp); + + cond_local_irq_enable(regs); + + tsk =3D current; + tsk->thread.error_code =3D error_code; + tsk->thread.trap_nr =3D X86_TRAP_CP; + + /* Ratelimit to prevent log spamming. */ + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) && + __ratelimit(&cpf_rate)) { + pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%= s", + tsk->comm, task_pid_nr(tsk), + regs->ip, regs->sp, ssp, error_code, + cp_err_string(error_code), + error_code & CP_ENCL ? " in enclave" : ""); + print_vma_addr(KERN_CONT " in ", regs->ip); + pr_cont("\n"); } =20 - if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) !=3D CP_ENDBR)) + force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0); + cond_local_irq_disable(regs); +} + +static __ro_after_init bool ibt_fatal =3D true; + +/* code label defined in asm below */ +extern void ibt_selftest_ip(void); + +static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_c= ode) +{ + if ((error_code & CP_EC) !=3D CP_ENDBR) { + do_unexpected_cp(regs, error_code); return; + } =20 if (unlikely(regs->ip =3D=3D (unsigned long)&ibt_selftest_ip)) { regs->ax =3D 0; @@ -74,3 +135,18 @@ static int __init ibt_setup(char *str) } =20 __setup("ibt=3D", ibt_setup); + +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection) +{ + if (user_mode(regs)) { + if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + do_user_cp_fault(regs, error_code); + else + do_unexpected_cp(regs, error_code); + } else { + if (cpu_feature_enabled(X86_FEATURE_IBT)) + do_kernel_cp_fault(regs, error_code); + else + do_unexpected_cp(regs, error_code); + } +} diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index a58c6bc1cd68..5074b8420359 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] =3D= { ISTG(X86_TRAP_MC, asm_exc_machine_check, IST_INDEX_MCE), #endif =20 -#ifdef CONFIG_X86_KERNEL_IBT +#ifdef CONFIG_X86_CET INTG(X86_TRAP_CP, asm_exc_control_protection), #endif =20 diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c index 9027fc088f97..c12624bc82a3 100644 --- a/arch/x86/kernel/signal_32.c +++ b/arch/x86/kernel/signal_32.c @@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt= _regs *regs) */ static_assert(NSIGILL =3D=3D 11); static_assert(NSIGFPE =3D=3D 15); -static_assert(NSIGSEGV =3D=3D 9); +static_assert(NSIGSEGV =3D=3D 10); static_assert(NSIGBUS =3D=3D 5); static_assert(NSIGTRAP =3D=3D 6); static_assert(NSIGCHLD =3D=3D 6); diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c index 13a1e6083837..0e808c72bf7e 100644 --- a/arch/x86/kernel/signal_64.c +++ b/arch/x86/kernel/signal_64.c @@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, stru= ct k_sigaction *oact) */ static_assert(NSIGILL =3D=3D 11); static_assert(NSIGFPE =3D=3D 15); -static_assert(NSIGSEGV =3D=3D 9); +static_assert(NSIGSEGV =3D=3D 10); static_assert(NSIGBUS =3D=3D 5); static_assert(NSIGTRAP =3D=3D 6); static_assert(NSIGCHLD =3D=3D 6); diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 6f666dfa97de..f358350624b2 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -77,18 +77,6 @@ =20 DECLARE_BITMAP(system_vectors, NR_VECTORS); =20 -static inline void cond_local_irq_enable(struct pt_regs *regs) -{ - if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); -} - -static inline void cond_local_irq_disable(struct pt_regs *regs) -{ - if (regs->flags & X86_EFLAGS_IF) - local_irq_disable(); -} - __always_inline int is_valid_bugaddr(unsigned long addr) { if (addr < TASK_SIZE_MAX) diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index 093b78c8bbec..5a034a994682 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] =3D { TRAP_ENTRY(exc_coprocessor_error, false ), TRAP_ENTRY(exc_alignment_check, false ), TRAP_ENTRY(exc_simd_coprocessor_error, false ), -#ifdef CONFIG_X86_KERNEL_IBT +#ifdef CONFIG_X86_CET TRAP_ENTRY(exc_control_protection, false ), #endif }; diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S index 08f1ceb9eb81..9e5e68008785 100644 --- a/arch/x86/xen/xen-asm.S +++ b/arch/x86/xen/xen-asm.S @@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault xen_pv_trap asm_exc_spurious_interrupt_bug xen_pv_trap asm_exc_coprocessor_error xen_pv_trap asm_exc_alignment_check -#ifdef CONFIG_X86_KERNEL_IBT +#ifdef CONFIG_X86_CET xen_pv_trap asm_exc_control_protection #endif #ifdef CONFIG_X86_MCE diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/= siginfo.h index ffbe4cec9f32..0f52d0ac47c5 100644 --- a/include/uapi/asm-generic/siginfo.h +++ b/include/uapi/asm-generic/siginfo.h @@ -242,7 +242,8 @@ typedef struct siginfo { #define SEGV_ADIPERR 7 /* Precise MCD exception */ #define SEGV_MTEAERR 8 /* Asynchronous ARM MTE error */ #define SEGV_MTESERR 9 /* Synchronous ARM MTE exception */ -#define NSIGSEGV 9 +#define SEGV_CPERR 10 /* Control protection fault */ +#define NSIGSEGV 10 =20 /* * SIGBUS si_codes --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37B7EC7EE43 for ; Tue, 13 Jun 2023 00:17:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239446AbjFMARH (ORCPT ); Mon, 12 Jun 2023 20:17:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51902 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239064AbjFMAQS (ORCPT ); Mon, 12 Jun 2023 20:16:18 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 47D632959; Mon, 12 Jun 2023 17:13:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615211; x=1718151211; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=BugAl8HBv2gDjezo8BApRaTr6a0cYuyYBztBA+16s0k=; b=mpCLsY8TSF9hf/sLmIsPKiD8huZmdyRhzYOMig4HbAnedB0CabA+ST70 118AvCIZ518tWZ0k41NXJNYbSyKSHL0BW0Kp7Q6+4kqfSq3xHljhxN0vI 0KR0FbY4cnEG6UHbdalvPFGOIS56tsKhQyJU7ApiHRmtQi0XA/ZujqDif JgAkGZ5srqLg9DgNslzYF6PlvmpknF8RPFb0UCDS2NYUJsJ9+bH9pxHkp Z0D97iVpaqg50+0SHdMoXGApqBV66wuC1jPD7sx4XCO+e+gCyY+QqfmoA eRCQoHy/GeFoZI2ZME67HY21Zgcs0gsni4lUvoWlHNk2whTvzqq1Be7Lm g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557315" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557315" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:32 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671093" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671093" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:31 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support Date: Mon, 12 Jun 2023 17:10:54 -0700 Message-Id: <20230613001108.3040476-29-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Introduce basic shadow stack enabling/disabling/allocation routines. A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag and has a fixed size of min(RLIMIT_STACK, 4GB). Keep the task's shadow stack address and size in thread_struct. This will be copied when cloning new threads, but needs to be cleared during exec, so add a function to do this. 32 bit shadow stack is not expected to have many users and it will complicate the signal implementation. So do not support IA32 emulation or x32. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/processor.h | 2 + arch/x86/include/asm/shstk.h | 7 ++ arch/x86/include/uapi/asm/prctl.h | 3 + arch/x86/kernel/shstk.c | 145 ++++++++++++++++++++++++++++++ 4 files changed, 157 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/proces= sor.h index 407d5551b6a7..2a5ec5750ba7 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -479,6 +479,8 @@ struct thread_struct { #ifdef CONFIG_X86_USER_SHADOW_STACK unsigned long features; unsigned long features_locked; + + struct thread_shstk shstk; #endif =20 /* Floating point and extended processor state */ diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h index ec753809f074..2b1f7c9b9995 100644 --- a/arch/x86/include/asm/shstk.h +++ b/arch/x86/include/asm/shstk.h @@ -8,12 +8,19 @@ struct task_struct; =20 #ifdef CONFIG_X86_USER_SHADOW_STACK +struct thread_shstk { + u64 base; + u64 size; +}; + long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res); void reset_thread_features(void); +void shstk_free(struct task_struct *p); #else static inline long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) { return -EINVAL; } static inline void reset_thread_features(void) {} +static inline void shstk_free(struct task_struct *p) {} #endif /* CONFIG_X86_USER_SHADOW_STACK */ =20 #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/= prctl.h index 1cd44ecc9ce0..6a8e0e1bff4a 100644 --- a/arch/x86/include/uapi/asm/prctl.h +++ b/arch/x86/include/uapi/asm/prctl.h @@ -34,4 +34,7 @@ #define ARCH_SHSTK_DISABLE 0x5002 #define ARCH_SHSTK_LOCK 0x5003 =20 +/* ARCH_SHSTK_ features bits */ +#define ARCH_SHSTK_SHSTK (1ULL << 0) + #endif /* _ASM_X86_PRCTL_H */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 41ed6552e0a5..3cb85224d856 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -8,14 +8,159 @@ =20 #include #include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include #include =20 +static bool features_enabled(unsigned long features) +{ + return current->thread.features & features; +} + +static void features_set(unsigned long features) +{ + current->thread.features |=3D features; +} + +static void features_clr(unsigned long features) +{ + current->thread.features &=3D ~features; +} + +static unsigned long alloc_shstk(unsigned long size) +{ + int flags =3D MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; + struct mm_struct *mm =3D current->mm; + unsigned long addr, unused; + + mmap_write_lock(mm); + addr =3D do_mmap(NULL, addr, size, PROT_READ, flags, + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); + + mmap_write_unlock(mm); + + return addr; +} + +static unsigned long adjust_shstk_size(unsigned long size) +{ + if (size) + return PAGE_ALIGN(size); + + return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G)); +} + +static void unmap_shadow_stack(u64 base, u64 size) +{ + while (1) { + int r; + + r =3D vm_munmap(base, size); + + /* + * vm_munmap() returns -EINTR when mmap_lock is held by + * something else, and that lock should not be held for a + * long time. Retry it for the case. + */ + if (r =3D=3D -EINTR) { + cond_resched(); + continue; + } + + /* + * For all other types of vm_munmap() failure, either the + * system is out of memory or there is bug. + */ + WARN_ON_ONCE(r); + break; + } +} + +static int shstk_setup(void) +{ + struct thread_shstk *shstk =3D ¤t->thread.shstk; + unsigned long addr, size; + + /* Already enabled */ + if (features_enabled(ARCH_SHSTK_SHSTK)) + return 0; + + /* Also not supported for 32 bit and x32 */ + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall()) + return -EOPNOTSUPP; + + size =3D adjust_shstk_size(0); + addr =3D alloc_shstk(size); + if (IS_ERR_VALUE(addr)) + return PTR_ERR((void *)addr); + + fpregs_lock_and_load(); + wrmsrl(MSR_IA32_PL3_SSP, addr + size); + wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN); + fpregs_unlock(); + + shstk->base =3D addr; + shstk->size =3D size; + features_set(ARCH_SHSTK_SHSTK); + + return 0; +} + void reset_thread_features(void) { + memset(¤t->thread.shstk, 0, sizeof(struct thread_shstk)); current->thread.features =3D 0; current->thread.features_locked =3D 0; } =20 +void shstk_free(struct task_struct *tsk) +{ + struct thread_shstk *shstk =3D &tsk->thread.shstk; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || + !features_enabled(ARCH_SHSTK_SHSTK)) + return; + + if (!tsk->mm) + return; + + unmap_shadow_stack(shstk->base, shstk->size); +} + +static int shstk_disable(void) +{ + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + return -EOPNOTSUPP; + + /* Already disabled? */ + if (!features_enabled(ARCH_SHSTK_SHSTK)) + return 0; + + fpregs_lock_and_load(); + /* Disable WRSS too when disabling shadow stack */ + wrmsrl(MSR_IA32_U_CET, 0); + wrmsrl(MSR_IA32_PL3_SSP, 0); + fpregs_unlock(); + + shstk_free(current); + features_clr(ARCH_SHSTK_SHSTK); + + return 0; +} + long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res) { if (option =3D=3D ARCH_SHSTK_LOCK) { --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B48FC7EE43 for ; Tue, 13 Jun 2023 00:17:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239459AbjFMARL (ORCPT ); Mon, 12 Jun 2023 20:17:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52382 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238500AbjFMAQT (ORCPT ); Mon, 12 Jun 2023 20:16:19 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 27C93295E; Mon, 12 Jun 2023 17:13:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615213; x=1718151213; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZRAh96yHcl8Psj9OI1FtC3p64r7joGZ816QWoOgXk8M=; b=jR39Ku9U1UHMY40GLHwezd7NoEtNGQqQiNryPFKAXQRq0vrx7mD7gK1X rra6XZzU63Au2ucRB52lmNSTOfxHMzAH8TmzfkBCVYk7Fakw1m+J05JHR vkFrOj8CRiXkQdyynaqxOqiIfPT+W7/9BT6Gvyi5/SoYAFQIZdKuT7Adi WyaEvdSVY7Lwr7oTNNVrDqMQBtdY7eFShvcsA6rlMaah3f1bRzkVLt6T3 3rjRdtRy1Ighg2JD+rava7IUhJ5I2yCAeOpMBCRhT16lnaaM86Ns90dRz Kw8TBiAHUzzLJzV/vFco1k6Obm2vEoz1utsSwhW+Y7nxebZ9ssqcoklOe Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557342" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557342" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671099" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671099" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:32 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 29/42] x86/shstk: Handle thread shadow stack Date: Mon, 12 Jun 2023 17:10:55 -0700 Message-Id: <20230613001108.3040476-30-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When a process is duplicated, but the child shares the address space with the parent, there is potential for the threads sharing a single stack to cause conflicts for each other. In the normal non-CET case this is handled in two ways. With regular CLONE_VM a new stack is provided by userspace such that the parent and child have different stacks. For vfork, the parent is suspended until the child exits. So as long as the child doesn't return from the vfork()/CLONE_VFORK calling function and sticks to a limited set of operations, the parent and child can share the same stack. For shadow stack, these scenarios present similar sharing problems. For the CLONE_VM case, the child and the parent must have separate shadow stacks. Instead of changing clone to take a shadow stack, have the kernel just allocate one and switch to it. Use stack_size passed from clone3() syscall for thread shadow stack size. A compat-mode thread shadow stack size is further reduced to 1/4. This allows more threads to run in a 32-bit address space. The clone() does not pass stack_size, which was added to clone3(). In that case, use RLIMIT_STACK size and cap to 4 GB. For shadow stack enabled vfork(), the parent and child can share the same shadow stack, like they can share a normal stack. Since the parent is suspended until the child terminates, the child will not interfere with the parent while executing as long as it doesn't return from the vfork() and overwrite up the shadow stack. The child can safely overwrite down the shadow stack, as the parent can just overwrite this later. So CET does not add any additional limitations for vfork(). Free the shadow stack on thread exit by doing it in mm_release(). Skip this when exiting a vfork() child since the stack is shared in the parent. During this operation, the shadow stack pointer of the new thread needs to be updated to point to the newly allocated shadow stack. Since the ability to do this is confined to the FPU subsystem, change fpu_clone() to take the new shadow stack pointer, and update it internally inside the FPU subsystem. This part was suggested by Thomas Gleixner. Suggested-by: Thomas Gleixner Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/fpu/sched.h | 3 ++- arch/x86/include/asm/mmu_context.h | 2 ++ arch/x86/include/asm/shstk.h | 5 ++++ arch/x86/kernel/fpu/core.c | 36 +++++++++++++++++++++++++- arch/x86/kernel/process.c | 21 ++++++++++++++- arch/x86/kernel/shstk.c | 41 ++++++++++++++++++++++++++++-- 6 files changed, 103 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sc= hed.h index c2d6cd78ed0c..3c2903bbb456 100644 --- a/arch/x86/include/asm/fpu/sched.h +++ b/arch/x86/include/asm/fpu/sched.h @@ -11,7 +11,8 @@ =20 extern void save_fpregs_to_fpstate(struct fpu *fpu); extern void fpu__drop(struct fpu *fpu); -extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, = bool minimal); +extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, = bool minimal, + unsigned long shstk_addr); extern void fpu_flush_thread(void); =20 /* diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_= context.h index 1d29dc791f5a..416901d406f8 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -186,6 +186,8 @@ do { \ #else #define deactivate_mm(tsk, mm) \ do { \ + if (!tsk->vfork_done) \ + shstk_free(tsk); \ load_gs_index(0); \ loadsegment(fs, 0); \ } while (0) diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h index 2b1f7c9b9995..d4a5c7b10cb5 100644 --- a/arch/x86/include/asm/shstk.h +++ b/arch/x86/include/asm/shstk.h @@ -15,11 +15,16 @@ struct thread_shstk { =20 long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res); void reset_thread_features(void); +unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned lon= g clone_flags, + unsigned long stack_size); void shstk_free(struct task_struct *p); #else static inline long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) { return -EINVAL; } static inline void reset_thread_features(void) {} +static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p, + unsigned long clone_flags, + unsigned long stack_size) { return 0; } static inline void shstk_free(struct task_struct *p) {} #endif /* CONFIG_X86_USER_SHADOW_STACK */ =20 diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index f851558b673f..aa4856b236b8 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -552,8 +552,36 @@ static inline void fpu_inherit_perms(struct fpu *dst_f= pu) } } =20 +/* A passed ssp of zero will not cause any update */ +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp) +{ +#ifdef CONFIG_X86_USER_SHADOW_STACK + struct cet_user_state *xstate; + + /* If ssp update is not needed. */ + if (!ssp) + return 0; + + xstate =3D get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave, + XFEATURE_CET_USER); + + /* + * If there is a non-zero ssp, then 'dst' must be configured with a shadow + * stack and the fpu state should be up to date since it was just copied + * from the parent in fpu_clone(). So there must be a valid non-init CET + * state location in the buffer. + */ + if (WARN_ON_ONCE(!xstate)) + return 1; + + xstate->user_ssp =3D (u64)ssp; +#endif + return 0; +} + /* Clone current's FPU state on fork */ -int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool min= imal) +int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool min= imal, + unsigned long ssp) { struct fpu *src_fpu =3D ¤t->thread.fpu; struct fpu *dst_fpu =3D &dst->thread.fpu; @@ -613,6 +641,12 @@ int fpu_clone(struct task_struct *dst, unsigned long c= lone_flags, bool minimal) if (use_xsave()) dst_fpu->fpstate->regs.xsave.header.xfeatures &=3D ~XFEATURE_MASK_PASID; =20 + /* + * Update shadow stack pointer, in case it changed during clone. + */ + if (update_fpu_shstk(dst, ssp)) + return 1; + trace_x86_fpu_copy_src(src_fpu); trace_x86_fpu_copy_dst(dst_fpu); =20 diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index dac41a0072ea..3ab62ac98c2c 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -50,6 +50,7 @@ #include #include #include +#include =20 #include "process.h" =20 @@ -121,6 +122,7 @@ void exit_thread(struct task_struct *tsk) =20 free_vm86(t); =20 + shstk_free(tsk); fpu__drop(fpu); } =20 @@ -142,6 +144,7 @@ int copy_thread(struct task_struct *p, const struct ker= nel_clone_args *args) struct inactive_task_frame *frame; struct fork_frame *fork_frame; struct pt_regs *childregs; + unsigned long new_ssp; int ret =3D 0; =20 childregs =3D task_pt_regs(p); @@ -179,7 +182,16 @@ int copy_thread(struct task_struct *p, const struct ke= rnel_clone_args *args) frame->flags =3D X86_EFLAGS_FIXED; #endif =20 - fpu_clone(p, clone_flags, args->fn); + /* + * Allocate a new shadow stack for thread if needed. If shadow stack, + * is disabled, new_ssp will remain 0, and fpu_clone() will know not to + * update it. + */ + new_ssp =3D shstk_alloc_thread_stack(p, clone_flags, args->stack_size); + if (IS_ERR_VALUE(new_ssp)) + return PTR_ERR((void *)new_ssp); + + fpu_clone(p, clone_flags, args->fn, new_ssp); =20 /* Kernel thread ? */ if (unlikely(p->flags & PF_KTHREAD)) { @@ -225,6 +237,13 @@ int copy_thread(struct task_struct *p, const struct ke= rnel_clone_args *args) if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP))) io_bitmap_share(p); =20 + /* + * If copy_thread() if failing, don't leak the shadow stack possibly + * allocated in shstk_alloc_thread_stack() above. + */ + if (ret) + shstk_free(p); + return ret; } =20 diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 3cb85224d856..bd9cdc3a7338 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -47,7 +47,7 @@ static unsigned long alloc_shstk(unsigned long size) unsigned long addr, unused; =20 mmap_write_lock(mm); - addr =3D do_mmap(NULL, addr, size, PROT_READ, flags, + addr =3D do_mmap(NULL, 0, size, PROT_READ, flags, VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); =20 mmap_write_unlock(mm); @@ -126,6 +126,37 @@ void reset_thread_features(void) current->thread.features_locked =3D 0; } =20 +unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned l= ong clone_flags, + unsigned long stack_size) +{ + struct thread_shstk *shstk =3D &tsk->thread.shstk; + unsigned long addr, size; + + /* + * If shadow stack is not enabled on the new thread, skip any + * switch to a new shadow stack. + */ + if (!features_enabled(ARCH_SHSTK_SHSTK)) + return 0; + + /* + * For CLONE_VM, except vfork, the child needs a separate shadow + * stack. + */ + if ((clone_flags & (CLONE_VFORK | CLONE_VM)) !=3D CLONE_VM) + return 0; + + size =3D adjust_shstk_size(stack_size); + addr =3D alloc_shstk(size); + if (IS_ERR_VALUE(addr)) + return addr; + + shstk->base =3D addr; + shstk->size =3D size; + + return addr + size; +} + void shstk_free(struct task_struct *tsk) { struct thread_shstk *shstk =3D &tsk->thread.shstk; @@ -134,7 +165,13 @@ void shstk_free(struct task_struct *tsk) !features_enabled(ARCH_SHSTK_SHSTK)) return; =20 - if (!tsk->mm) + /* + * When fork() with CLONE_VM fails, the child (tsk) already has a + * shadow stack allocated, and exit_thread() calls this function to + * free it. In this case the parent (current) and the child share + * the same mm struct. + */ + if (!tsk->mm || tsk->mm !=3D current->mm) return; =20 unmap_shadow_stack(shstk->base, shstk->size); --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F297EC88CB5 for ; Tue, 13 Jun 2023 00:17:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239482AbjFMARk (ORCPT ); Mon, 12 Jun 2023 20:17:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238922AbjFMAQd (ORCPT ); Mon, 12 Jun 2023 20:16:33 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D349E2D52; Mon, 12 Jun 2023 17:13:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615222; x=1718151222; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=JRlODCr89ZW4tpE3v7Z5UBl8Adk88XgPugy9MazljtY=; b=TuwYIIuiLLu17rlsf1KPe1S2z0CKrbLXdnrtqChFKDBsp/Jd6h0FN435 78+KNK9LwB2oLRIPK6rzzk/VJ3SkGESBJqTm3Y4Q60Ifh7csCdFtQWFce +TvsWQa40rZmXA3RqsW7dSjVmC4DnwZ7ECZodTSDNvxkTpbaUfzlAFCHA 0YTDe4yO4ZUhptejG7xjA9Q0IG0wLruYH8UPivkGoKn6HxL3xeVBbzEXv HgLEw+gSHVVdmeOyXBeiH42mk7BVytPZHobTCnD4zV/9mQ1XkHngKxtHp NkUil2uSY335dxRlhPrR7IFCOSZOZSXK+TlvusbnSvK4Z7UV81dhKNQRJ w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557364" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557364" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:34 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671105" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671105" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:33 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 30/42] x86/shstk: Introduce routines modifying shstk Date: Mon, 12 Jun 2023 17:10:56 -0700 Message-Id: <20230613001108.3040476-31-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Shadow stacks are normally written to via CALL/RET or specific CET instructions like RSTORSSP/SAVEPREVSSP. However, sometimes the kernel will need to write to the shadow stack directly using the ring-0 only WRUSS instruction. A shadow stack restore token marks a restore point of the shadow stack, and the address in a token must point directly above the token, which is within the same shadow stack. This is distinctively different from other pointers on the shadow stack, since those pointers point to executable code area. Introduce token setup and verify routines. Also introduce WRUSS, which is a kernel-mode instruction but writes directly to user shadow stack. In future patches that enable shadow stack to work with signals, the kernel will need something to denote the point in the stack where sigreturn may be called. This will prevent attackers calling sigreturn at arbitrary places in the stack, in order to help prevent SROP attacks. To do this, something that can only be written by the kernel needs to be placed on the shadow stack. This can be accomplished by setting bit 63 in the frame written to the shadow stack. Userspace return addresses can't have this bit set as it is in the kernel range. It also can't be a valid restore token. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/special_insns.h | 13 +++++ arch/x86/kernel/shstk.c | 75 ++++++++++++++++++++++++++++ 2 files changed, 88 insertions(+) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/sp= ecial_insns.h index de48d1389936..d6cd9344f6c7 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -202,6 +202,19 @@ static inline void clwb(volatile void *__p) : [pax] "a" (p)); } =20 +#ifdef CONFIG_X86_USER_SHADOW_STACK +static inline int write_user_shstk_64(u64 __user *addr, u64 val) +{ + asm_volatile_goto("1: wrussq %[val], (%[addr])\n" + _ASM_EXTABLE(1b, %l[fail]) + :: [addr] "r" (addr), [val] "r" (val) + :: fail); + return 0; +fail: + return -EFAULT; +} +#endif /* CONFIG_X86_USER_SHADOW_STACK */ + #define nop() asm volatile ("nop") =20 static inline void serialize(void) diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index bd9cdc3a7338..e22928c63ffc 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -25,6 +25,8 @@ #include #include =20 +#define SS_FRAME_SIZE 8 + static bool features_enabled(unsigned long features) { return current->thread.features & features; @@ -40,6 +42,35 @@ static void features_clr(unsigned long features) current->thread.features &=3D ~features; } =20 +/* + * Create a restore token on the shadow stack. A token is always 8-byte + * and aligned to 8. + */ +static int create_rstor_token(unsigned long ssp, unsigned long *token_addr) +{ + unsigned long addr; + + /* Token must be aligned */ + if (!IS_ALIGNED(ssp, 8)) + return -EINVAL; + + addr =3D ssp - SS_FRAME_SIZE; + + /* + * SSP is aligned, so reserved bits and mode bit are a zero, just mark + * the token 64-bit. + */ + ssp |=3D BIT(0); + + if (write_user_shstk_64((u64 __user *)addr, (u64)ssp)) + return -EFAULT; + + if (token_addr) + *token_addr =3D addr; + + return 0; +} + static unsigned long alloc_shstk(unsigned long size) { int flags =3D MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; @@ -157,6 +188,50 @@ unsigned long shstk_alloc_thread_stack(struct task_str= uct *tsk, unsigned long cl return addr + size; } =20 +static unsigned long get_user_shstk_addr(void) +{ + unsigned long long ssp; + + fpregs_lock_and_load(); + + rdmsrl(MSR_IA32_PL3_SSP, ssp); + + fpregs_unlock(); + + return ssp; +} + +#define SHSTK_DATA_BIT BIT(63) + +static int put_shstk_data(u64 __user *addr, u64 data) +{ + if (WARN_ON_ONCE(data & SHSTK_DATA_BIT)) + return -EINVAL; + + /* + * Mark the high bit so that the sigframe can't be processed as a + * return address. + */ + if (write_user_shstk_64(addr, data | SHSTK_DATA_BIT)) + return -EFAULT; + return 0; +} + +static int get_shstk_data(unsigned long *data, unsigned long __user *addr) +{ + unsigned long ldata; + + if (unlikely(get_user(ldata, addr))) + return -EFAULT; + + if (!(ldata & SHSTK_DATA_BIT)) + return -EINVAL; + + *data =3D ldata & ~SHSTK_DATA_BIT; + + return 0; +} + void shstk_free(struct task_struct *tsk) { struct thread_shstk *shstk =3D &tsk->thread.shstk; --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29EB4C7EE2E for ; Tue, 13 Jun 2023 00:17:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239490AbjFMARn (ORCPT ); Mon, 12 Jun 2023 20:17:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238929AbjFMAQs (ORCPT ); Mon, 12 Jun 2023 20:16:48 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8DDA430D2; Mon, 12 Jun 2023 17:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615226; x=1718151226; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5PTalCu64urUNwCF3tCp/oZ88l0SJTltpYFQdwbjc9Y=; b=OwZ6peWipklmmu+7RE3G7V95NHEZpPAQWJ/VVDHmSyr4KC2S2pdTviQV wNjkfEqFJeDUaetYMci9QsJySWLJJ62L20Fca96YxFrRZDDzoHDq5DszD trylP+Q1npg9BzUfXA0jAE1CyRtZ453tGhTi+hreSGebmOSiANpbiQ94A V3knDybfMstsNkqNS0BziyN8I+pvtwzeZMlmdcVi/XlS35j8jR/Wmuxc7 2nTUUBEdt2EyK7wlHZcexq8ZNIuoLerP9mzKWwuo1PiwTjAcTrrfFWRmG k2PEHwQA5ddv5JDyYN6aV25YNKyKLhit82TT9W+IDTtwD+clDE8F2PM29 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557387" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557387" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671109" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671109" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:34 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 31/42] x86/shstk: Handle signals for shadow stack Date: Mon, 12 Jun 2023 17:10:57 -0700 Message-Id: <20230613001108.3040476-32-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When a signal is handled, the context is pushed to the stack before handling it. For shadow stacks, since the shadow stack only tracks return addresses, there isn't any state that needs to be pushed. However, there are still a few things that need to be done. These things are visible to userspace and which will be kernel ABI for shadow stacks. One is to make sure the restorer address is written to shadow stack, since the signal handler (if not changing ucontext) returns to the restorer, and the restorer calls sigreturn. So add the restorer on the shadow stack before handling the signal, so there is not a conflict when the signal handler returns to the restorer. The other thing to do is to place some type of checkable token on the thread's shadow stack before handling the signal and check it during sigreturn. This is an extra layer of protection to hamper attackers calling sigreturn manually as in SROP-like attacks. For this token the shadow stack data format defined earlier can be used. Have the data pushed be the previous SSP. In the future the sigreturn might want to return back to a different stack. Storing the SSP (instead of a restore offset or something) allows for future functionality that may want to restore to a different stack. So, when handling a signal push - the SSP pointing in the shadow stack data format - the restorer address below the restore token. In sigreturn, verify SSP is stored in the data format and pop the shadow stack. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/asm/shstk.h | 5 ++ arch/x86/kernel/shstk.c | 95 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/signal.c | 1 + arch/x86/kernel/signal_64.c | 6 +++ 4 files changed, 107 insertions(+) diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h index d4a5c7b10cb5..ecb23a8ca47d 100644 --- a/arch/x86/include/asm/shstk.h +++ b/arch/x86/include/asm/shstk.h @@ -6,6 +6,7 @@ #include =20 struct task_struct; +struct ksignal; =20 #ifdef CONFIG_X86_USER_SHADOW_STACK struct thread_shstk { @@ -18,6 +19,8 @@ void reset_thread_features(void); unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned lon= g clone_flags, unsigned long stack_size); void shstk_free(struct task_struct *p); +int setup_signal_shadow_stack(struct ksignal *ksig); +int restore_signal_shadow_stack(void); #else static inline long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) { return -EINVAL; } @@ -26,6 +29,8 @@ static inline unsigned long shstk_alloc_thread_stack(stru= ct task_struct *p, unsigned long clone_flags, unsigned long stack_size) { return 0; } static inline void shstk_free(struct task_struct *p) {} +static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return= 0; } +static inline int restore_signal_shadow_stack(void) { return 0; } #endif /* CONFIG_X86_USER_SHADOW_STACK */ =20 #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index e22928c63ffc..f02e8ea4f1b5 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -232,6 +232,101 @@ static int get_shstk_data(unsigned long *data, unsign= ed long __user *addr) return 0; } =20 +static int shstk_push_sigframe(unsigned long *ssp) +{ + unsigned long target_ssp =3D *ssp; + + /* Token must be aligned */ + if (!IS_ALIGNED(target_ssp, 8)) + return -EINVAL; + + *ssp -=3D SS_FRAME_SIZE; + if (put_shstk_data((void *__user)*ssp, target_ssp)) + return -EFAULT; + + return 0; +} + +static int shstk_pop_sigframe(unsigned long *ssp) +{ + unsigned long token_addr; + int err; + + err =3D get_shstk_data(&token_addr, (unsigned long __user *)*ssp); + if (unlikely(err)) + return err; + + /* Restore SSP aligned? */ + if (unlikely(!IS_ALIGNED(token_addr, 8))) + return -EINVAL; + + /* SSP in userspace? */ + if (unlikely(token_addr >=3D TASK_SIZE_MAX)) + return -EINVAL; + + *ssp =3D token_addr; + + return 0; +} + +int setup_signal_shadow_stack(struct ksignal *ksig) +{ + void __user *restorer =3D ksig->ka.sa.sa_restorer; + unsigned long ssp; + int err; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || + !features_enabled(ARCH_SHSTK_SHSTK)) + return 0; + + if (!restorer) + return -EINVAL; + + ssp =3D get_user_shstk_addr(); + if (unlikely(!ssp)) + return -EINVAL; + + err =3D shstk_push_sigframe(&ssp); + if (unlikely(err)) + return err; + + /* Push restorer address */ + ssp -=3D SS_FRAME_SIZE; + err =3D write_user_shstk_64((u64 __user *)ssp, (u64)restorer); + if (unlikely(err)) + return -EFAULT; + + fpregs_lock_and_load(); + wrmsrl(MSR_IA32_PL3_SSP, ssp); + fpregs_unlock(); + + return 0; +} + +int restore_signal_shadow_stack(void) +{ + unsigned long ssp; + int err; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || + !features_enabled(ARCH_SHSTK_SHSTK)) + return 0; + + ssp =3D get_user_shstk_addr(); + if (unlikely(!ssp)) + return -EINVAL; + + err =3D shstk_pop_sigframe(&ssp); + if (unlikely(err)) + return err; + + fpregs_lock_and_load(); + wrmsrl(MSR_IA32_PL3_SSP, ssp); + fpregs_unlock(); + + return 0; +} + void shstk_free(struct task_struct *tsk) { struct thread_shstk *shstk =3D &tsk->thread.shstk; diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 004cb30b7419..356253e85ce9 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 static inline int is_ia32_compat_frame(struct ksignal *ksig) { diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c index 0e808c72bf7e..cacf2ede6217 100644 --- a/arch/x86/kernel/signal_64.c +++ b/arch/x86/kernel/signal_64.c @@ -175,6 +175,9 @@ int x64_setup_rt_frame(struct ksignal *ksig, struct pt_= regs *regs) frame =3D get_sigframe(ksig, regs, sizeof(struct rt_sigframe), &fp); uc_flags =3D frame_uc_flags(regs); =20 + if (setup_signal_shadow_stack(ksig)) + return -EFAULT; + if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; =20 @@ -260,6 +263,9 @@ SYSCALL_DEFINE0(rt_sigreturn) if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags)) goto badframe; =20 + if (restore_signal_shadow_stack()) + goto badframe; + if (restore_altstack(&frame->uc.uc_stack)) goto badframe; =20 --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5E2FC88CB8 for ; Tue, 13 Jun 2023 00:18:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238469AbjFMASD (ORCPT ); Mon, 12 Jun 2023 20:18:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53208 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238930AbjFMAQv (ORCPT ); Mon, 12 Jun 2023 20:16:51 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8C84830E6; Mon, 12 Jun 2023 17:13:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615228; x=1718151228; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qj0G4UwqkNNGQlR4Z3X67pikfZ5WOGk4HokEGvo8sro=; b=JAetgpkgtOB33AnSsLsZ6l+U4GY9rrlYZYi4fpJkFOIa7Kr9KAjCNHBY SV/PxNHe8T509ZqE4kY6TLxDxCPSjnm+/PiboySFlE+BGU14sWnH+iyJ3 1kSvKqdatfafqbNOrVnumH79hDvXzDG/7IDRCytLvnR4L7L5iwPK/FrX7 tZ29VfZcOZwPBQTF4eYNNOCLV6AahxgWbIlwDwj7j89ccJFJm0wl9t9O1 lE8wFzjrGRhmeM3UfPl9kY8ZuzEd9opmzaCuybQJjdryPargw0Q7Srwiq puDrwQ0mc0+eCy5ThXSDfw8yLTgpJSz9TiulNRpfUpM9ng1hTGrunyRm0 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557411" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557411" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:36 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671113" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671113" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:35 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com Subject: [PATCH v9 32/42] x86/shstk: Check that SSP is aligned on sigreturn Date: Mon, 12 Jun 2023 17:10:58 -0700 Message-Id: <20230613001108.3040476-33-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The shadow stack signal frame is read by the kernel on sigreturn. It relies on shadow stack memory protections to prevent forgeries of this signal frame (which included the pre-signal SSP). It also relies on the shadow stack signal frame to have bit 63 set. Since this bit would not be set via typical shadow stack operations, so the kernel can assume it was a value it placed there. However, in order to support 32 bit shadow stack, the INCSSPD instruction can increment the shadow stack by 4 bytes. In this case SSP might be pointing to a region spanning two 8 byte shadow stack frames. It could confuse the checks described above. Since the kernel only supports shadow stack in 64 bit, just check that the SSP is 8 byte aligned in the sigreturn path. Signed-off-by: Rick Edgecombe --- v9: - New patch --- arch/x86/kernel/shstk.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index f02e8ea4f1b5..a8705f7d966c 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -252,6 +252,9 @@ static int shstk_pop_sigframe(unsigned long *ssp) unsigned long token_addr; int err; =20 + if (!IS_ALIGNED(*ssp, 8)) + return -EINVAL; + err =3D get_shstk_data(&token_addr, (unsigned long __user *)*ssp); if (unlikely(err)) return err; --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26BC0C7EE43 for ; Tue, 13 Jun 2023 00:18:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239252AbjFMASK (ORCPT ); Mon, 12 Jun 2023 20:18:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51914 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239019AbjFMAQz (ORCPT ); Mon, 12 Jun 2023 20:16:55 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C39530F1; Mon, 12 Jun 2023 17:13:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615230; x=1718151230; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=T7gc3JiJMbmruUv2dTeFaPrm+ED2TBUgFQFH5yJf4mA=; b=VtwaUGGb52YlAWh8RBwn4COZxQR9V4cvBL0nuOeTgM0SrP74fEC6EW3Y NMK+5Qm3Ie0kdtQz8rzORUZQEPQXjzP03L6kIiTbi4f+wpmn/eKJV2K2e HGEXyqIURRLcS9CkhV4hm/GPqZP1HmTxQtBtuGfV4eVKPx+PWRefSxjHp 7doUuumQApGo+zlwyBonyBy84tDUw94TtCKjaqAcl16Yrqws6NcsEy9zs +9yeMQNhUNF/KTb8sZwIiI5gZSVKcqGH/dnWNt4xIXR3Dn/aok2hkOBQ0 JbbSrEAv+3B234bj4Q6pelx8q2oHc4lVfEmdCQmNmeCJFvUY9rf/Y5PtB Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557436" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557436" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671121" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671121" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:36 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com Subject: [PATCH v9 33/42] x86/shstk: Check that signal frame is shadow stack mem Date: Mon, 12 Jun 2023 17:10:59 -0700 Message-Id: <20230613001108.3040476-34-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The shadow stack signal frame is read by the kernel on sigreturn. It relies on shadow stack memory protections to prevent forgeries of this signal frame (which included the pre-signal SSP). This behavior helps userspace protect itself. However, using the INCSSP instruction userspace can adjust the SSP to 8 bytes beyond the end of a shadow stack. INCSSP performs shadow stack reads to make sure it doesn=E2=80=99t increment off o= f the shadow stack, but on the end position it actually reads 8 bytes below the new SSP. For the shadow stack HW operations, this situation (INCSSP off the end of a shadow stack by 8 bytes) would be fine. If the a RET is executed, the push to the shadow stack would fail to write to the shadow stack. If a CALL is executed, the SSP will be incremented back onto the stack and the return address will be written successfully to the very end. That is expected behavior around shadow stack underflow. However, the kernel doesn=E2=80=99t have a way to read shadow stack memory = using shadow stack accesses. WRUSS can write to shadow stack memory with a shadow stack access which ensures the access is to shadow stack memory. But unfortunately for this case, there is no equivalent instruction for shadow stack reads. So when reading the shadow stack signal frames, the kernel currently assumes the SSP is pointing to the shadow stack and uses a normal read. The SSP pointing to shadow stack memory will be true in most cases, but as described above, in can be untrue by 8 bytes. So lookup the VMA of the shadow stack sigframe being read to verify it is shadow stack. Since the SSP can only be beyond the shadow stack by 8 bytes, and shadow stack memory is page aligned, this check only needs to be done when this type of relative position to a page boundary is encountered. So skip the extra work otherwise. Signed-off-by: Rick Edgecombe --- v9: - New patch --- arch/x86/kernel/shstk.c | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index a8705f7d966c..50733a510446 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -249,15 +249,38 @@ static int shstk_push_sigframe(unsigned long *ssp) =20 static int shstk_pop_sigframe(unsigned long *ssp) { + struct vm_area_struct *vma; unsigned long token_addr; - int err; + bool need_to_check_vma; + int err =3D 1; =20 + /* + * It is possible for the SSP to be off the end of a shadow stack by 4 + * or 8 bytes. If the shadow stack is at the start of a page or 4 bytes + * before it, it might be this case, so check that the address being + * read is actually shadow stack. + */ if (!IS_ALIGNED(*ssp, 8)) return -EINVAL; =20 + need_to_check_vma =3D PAGE_ALIGN(*ssp) =3D=3D *ssp; + + if (need_to_check_vma) + mmap_read_lock_killable(current->mm); + err =3D get_shstk_data(&token_addr, (unsigned long __user *)*ssp); if (unlikely(err)) - return err; + goto out_err; + + if (need_to_check_vma) { + vma =3D find_vma(current->mm, *ssp); + if (!vma || !(vma->vm_flags & VM_SHADOW_STACK)) { + err =3D -EFAULT; + goto out_err; + } + + mmap_read_unlock(current->mm); + } =20 /* Restore SSP aligned? */ if (unlikely(!IS_ALIGNED(token_addr, 8))) @@ -270,6 +293,10 @@ static int shstk_pop_sigframe(unsigned long *ssp) *ssp =3D token_addr; =20 return 0; +out_err: + if (need_to_check_vma) + mmap_read_unlock(current->mm); + return err; } =20 int setup_signal_shadow_stack(struct ksignal *ksig) --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0B06C7EE2E for ; Tue, 13 Jun 2023 00:18:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239510AbjFMASO (ORCPT ); Mon, 12 Jun 2023 20:18:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239151AbjFMAQ4 (ORCPT ); Mon, 12 Jun 2023 20:16:56 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9F33930F8; Mon, 12 Jun 2023 17:13:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615231; x=1718151231; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=q/smTC4JyzbSbIdcW3lB/5FNt1IRicAxT9isRFlZelE=; b=TynvUVM7+REZHysTsxCjKulPBu1USkfx2Y5M1Ir7sMcLaUVgutwIEnP5 ff06+l+NsRzlOpUUZ9SFAMRV+2g7WXw2Nwi/LQ6kP53BJc1+G0gjwl8VT RxEWcFV34KtKFgsFfrb6YywDrM2JEYjam91XEE69zPwTAHr3Dp60fqrKU Kvm79yoLzx2cLNuqQw9RlL771Y/oxOT17emjX2cL9SHCFzJl1seg0OBp7 9LP0eCN9nYdx85I/rGAxLUCrQW2K2bk20f1GPvCC6PaNgFLKEeLTEmZKK JPZLHdbE+iIXZtiAc94fQB3W+UvM9HEcKnGkks2JO0O/Qxk35OTDJBqNe g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557464" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557464" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671125" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671125" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:36 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 34/42] x86/shstk: Introduce map_shadow_stack syscall Date: Mon, 12 Jun 2023 17:11:00 -0700 Message-Id: <20230613001108.3040476-35-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When operating with shadow stacks enabled, the kernel will automatically allocate shadow stacks for new threads, however in some cases userspace will need additional shadow stacks. The main example of this is the ucontext family of functions, which require userspace allocating and pivoting to userspace managed stacks. Unlike most other user memory permissions, shadow stacks need to be provisioned with special data in order to be useful. They need to be setup with a restore token so that userspace can pivot to them via the RSTORSSP instruction. But, the security design of shadow stacks is that they should not be written to except in limited circumstances. This presents a problem for userspace, as to how userspace can provision this special data, without allowing for the shadow stack to be generally writable. Previously, a new PROT_SHADOW_STACK was attempted, which could be mprotect()ed from RW permissions after the data was provisioned. This was found to not be secure enough, as other threads could write to the shadow stack during the writable window. The kernel can use a special instruction, WRUSS, to write directly to userspace shadow stacks. So the solution can be that memory can be mapped as shadow stack permissions from the beginning (never generally writable in userspace), and the kernel itself can write the restore token. First, a new madvise() flag was explored, which could operate on the PROT_SHADOW_STACK memory. This had a couple of downsides: 1. Extra checks were needed in mprotect() to prevent writable memory from ever becoming PROT_SHADOW_STACK. 2. Extra checks/vma state were needed in the new madvise() to prevent restore tokens being written into the middle of pre-used shadow stacks. It is ideal to prevent restore tokens being added at arbitrary locations, so the check was to make sure the shadow stack had never been written to. 3. It stood out from the rest of the madvise flags, as more of direct action than a hint at future desired behavior. So rather than repurpose two existing syscalls (mmap, madvise) that don't quite fit, just implement a new map_shadow_stack syscall to allow userspace to map and setup new shadow stacks in one step. While ucontext is the primary motivator, userspace may have other unforeseen reasons to setup its own shadow stacks using the WRSS instruction. Towards this provide a flag so that stacks can be optionally setup securely for the common case of ucontext without enabling WRSS. Or potentially have the kernel set up the shadow stack in some new way. The following example demonstrates how to create a new shadow stack with map_shadow_stack: void *shstk =3D map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/uapi/asm/mman.h | 3 ++ arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++---- include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 2 +- kernel/sys_ni.c | 1 + 6 files changed, 58 insertions(+), 9 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index c84d12608cd2..f65c671ce3b1 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 64 map_shadow_stack sys_map_shadow_stack =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/m= man.h index 5a0256e73f1e..8148bdddbd2c 100644 --- a/arch/x86/include/uapi/asm/mman.h +++ b/arch/x86/include/uapi/asm/mman.h @@ -13,6 +13,9 @@ ((key) & 0x8 ? VM_PKEY_BIT3 : 0)) #endif =20 +/* Flags for map_shadow_stack(2) */ +#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in th= e shadow stack */ + #include =20 #endif /* _ASM_X86_MMAN_H */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 50733a510446..04c37b33a625 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsign= ed long *token_addr) return 0; } =20 -static unsigned long alloc_shstk(unsigned long size) +static unsigned long alloc_shstk(unsigned long addr, unsigned long size, + unsigned long token_offset, bool set_res_tok) { int flags =3D MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; struct mm_struct *mm =3D current->mm; - unsigned long addr, unused; + unsigned long mapped_addr, unused; + + if (addr) + flags |=3D MAP_FIXED_NOREPLACE; =20 mmap_write_lock(mm); - addr =3D do_mmap(NULL, 0, size, PROT_READ, flags, - VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); - + mapped_addr =3D do_mmap(NULL, addr, size, PROT_READ, flags, + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); mmap_write_unlock(mm); =20 - return addr; + if (!set_res_tok || IS_ERR_VALUE(mapped_addr)) + goto out; + + if (create_rstor_token(mapped_addr + token_offset, NULL)) { + vm_munmap(mapped_addr, size); + return -EINVAL; + } + +out: + return mapped_addr; } =20 static unsigned long adjust_shstk_size(unsigned long size) @@ -134,7 +147,7 @@ static int shstk_setup(void) return -EOPNOTSUPP; =20 size =3D adjust_shstk_size(0); - addr =3D alloc_shstk(size); + addr =3D alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return PTR_ERR((void *)addr); =20 @@ -178,7 +191,7 @@ unsigned long shstk_alloc_thread_stack(struct task_stru= ct *tsk, unsigned long cl return 0; =20 size =3D adjust_shstk_size(stack_size); - addr =3D alloc_shstk(size); + addr =3D alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return addr; =20 @@ -398,6 +411,36 @@ static int shstk_disable(void) return 0; } =20 +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size= , unsigned int, flags) +{ + bool set_tok =3D flags & SHADOW_STACK_SET_TOKEN; + unsigned long aligned_size; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + return -EOPNOTSUPP; + + if (flags & ~SHADOW_STACK_SET_TOKEN) + return -EINVAL; + + /* If there isn't space for a token */ + if (set_tok && size < 8) + return -ENOSPC; + + if (addr && addr < SZ_4G) + return -ERANGE; + + /* + * An overflow would result in attempting to write the restore token + * to the wrong location. Not catastrophic, but just return the right + * error code and block it. + */ + aligned_size =3D PAGE_ALIGN(size); + if (aligned_size < size) + return -EOVERFLOW; + + return alloc_shstk(addr, aligned_size, size, set_tok); +} + long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res) { if (option =3D=3D ARCH_SHSTK_LOCK) { diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 33a0ee3bcb2e..392dc11e3556 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned = long len, unsigned long home_node, unsigned long flags); +asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long siz= e, unsigned int flags); =20 /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 45fa180cc56a..b12940ec5926 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) =20 #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 =20 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..cb9aebd34646 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -381,6 +381,7 @@ COND_SYSCALL(vm86old); COND_SYSCALL(modify_ldt); COND_SYSCALL(vm86); COND_SYSCALL(kexec_file_load); +COND_SYSCALL(map_shadow_stack); =20 /* s390 */ COND_SYSCALL(s390_pci_mmio_read); --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 699D2C88CB7 for ; Tue, 13 Jun 2023 00:19:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239544AbjFMAS4 (ORCPT ); Mon, 12 Jun 2023 20:18:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53358 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239081AbjFMARe (ORCPT ); Mon, 12 Jun 2023 20:17:34 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B91AF0; Mon, 12 Jun 2023 17:14:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615247; x=1718151247; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Src9+kIB+Tfk1ReeJh4vEiYrNG1G01r+lIh28P15+jA=; b=QgMsEJA5QLErJ9NSGDjEfYUTb4UJdi6QHUJiFIV9ZPaamI/8fZoskvLg EGQkV6D6Ons2i/3Cbxa7uKZlv9LPdIZjFiOdUs/8zMth95uT2TEQ40Cdm jGvsuKB04wUHNK/CQKlLUE/gICny38QWl8ekxu+jSYNRshJLPzpHyfvz6 AWC6oSsbGWDydV0BrjfOcSOPafkmiVh1nB/6uBcV8M3S9KEmkDuTJ9pV+ bz59jqqvK1l8jYtqIIjXCtknhx9mckcllIReBjBUaPEKa/JkGZPCIQ2+c kIw91sYppsJeoQu6M2r+25HotvQFjSKE2un8UpoVJrgitrljEUU3SCWME w==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557487" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557487" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671131" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671131" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:37 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 35/42] x86/shstk: Support WRSS for userspace Date: Mon, 12 Jun 2023 17:11:01 -0700 Message-Id: <20230613001108.3040476-36-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For the current shadow stack implementation, shadow stacks contents can't easily be provisioned with arbitrary data. This property helps apps protect themselves better, but also restricts any potential apps that may want to do exotic things at the expense of a little security. The x86 shadow stack feature introduces a new instruction, WRSS, which can be enabled to write directly to shadow stack memory from userspace. Allow it to get enabled via the prctl interface. Only enable the userspace WRSS instruction, which allows writes to userspace shadow stacks from userspace. Do not allow it to be enabled independently of shadow stack, as HW does not support using WRSS when shadow stack is disabled. >From a fault handler perspective, WRSS will behave very similar to WRUSS, which is treated like a user access from a #PF err code perspective. Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/include/uapi/asm/prctl.h | 1 + arch/x86/kernel/shstk.c | 43 ++++++++++++++++++++++++++++++- 2 files changed, 43 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/= prctl.h index 6a8e0e1bff4a..eedfde3b63be 100644 --- a/arch/x86/include/uapi/asm/prctl.h +++ b/arch/x86/include/uapi/asm/prctl.h @@ -36,5 +36,6 @@ =20 /* ARCH_SHSTK_ features bits */ #define ARCH_SHSTK_SHSTK (1ULL << 0) +#define ARCH_SHSTK_WRSS (1ULL << 1) =20 #endif /* _ASM_X86_PRCTL_H */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 04c37b33a625..ea0bf113f9cf 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -390,6 +390,47 @@ void shstk_free(struct task_struct *tsk) unmap_shadow_stack(shstk->base, shstk->size); } =20 +static int wrss_control(bool enable) +{ + u64 msrval; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + return -EOPNOTSUPP; + + /* + * Only enable WRSS if shadow stack is enabled. If shadow stack is not + * enabled, WRSS will already be disabled, so don't bother clearing it + * when disabling. + */ + if (!features_enabled(ARCH_SHSTK_SHSTK)) + return -EPERM; + + /* Already enabled/disabled? */ + if (features_enabled(ARCH_SHSTK_WRSS) =3D=3D enable) + return 0; + + fpregs_lock_and_load(); + rdmsrl(MSR_IA32_U_CET, msrval); + + if (enable) { + features_set(ARCH_SHSTK_WRSS); + msrval |=3D CET_WRSS_EN; + } else { + features_clr(ARCH_SHSTK_WRSS); + if (!(msrval & CET_WRSS_EN)) + goto unlock; + + msrval &=3D ~CET_WRSS_EN; + } + + wrmsrl(MSR_IA32_U_CET, msrval); + +unlock: + fpregs_unlock(); + + return 0; +} + static int shstk_disable(void) { if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) @@ -406,7 +447,7 @@ static int shstk_disable(void) fpregs_unlock(); =20 shstk_free(current); - features_clr(ARCH_SHSTK_SHSTK); + features_clr(ARCH_SHSTK_SHSTK | ARCH_SHSTK_WRSS); =20 return 0; } --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78906C7EE2E for ; Tue, 13 Jun 2023 00:19:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239571AbjFMATK (ORCPT ); Mon, 12 Jun 2023 20:19:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52418 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239183AbjFMARe (ORCPT ); Mon, 12 Jun 2023 20:17:34 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C8447198D; Mon, 12 Jun 2023 17:14:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615249; x=1718151249; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=x/pco2Xet/ihWAT7hysHs+5knwsBlg8L8hAHOHmdlgo=; b=MN4bLAP1UKupjLheS/bfz31PM8cXj7TLiVHKl7HIS1Hhdi369CtYqpgL MaIMhOAe9E3OjGCL9txRFET4Kkmf47SZXzdUZgM9IAV9l24NR8pU8y6x5 k7/oVLF5Cr0w1GiG+x94Oz8nq0Ys3qf0EBd4GrQnxjcOBu4X+wsO+jxeq 9qE0ySXyO0gSLmh0SCRE+26hnqRvTQhc0P6eW0G8fK2gyPrU9dctGHOlo Ed5Nwji1Y7Yi3b8uFq5vnP1rm2dRelLZSCk0K86qrwSgivl/NW1xNujNu khoRkBWvRXq7wbOKsUc+n8iwE7YdMGo9mRtMTIGDgI72GZhT3M9QZt9w2 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557511" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557511" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:39 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671140" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671140" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:38 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 36/42] x86: Expose thread features in /proc/$PID/status Date: Mon, 12 Jun 2023 17:11:02 -0700 Message-Id: <20230613001108.3040476-37-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Applications and loaders can have logic to decide whether to enable shadow stack. They usually don't report whether shadow stack has been enabled or not, so there is no way to verify whether an application actually is protected by shadow stack. Add two lines in /proc/$PID/status to report enabled and locked features. Since, this involves referring to arch specific defines in asm/prctl.h, implement an arch breakout to emit the feature lines. [Switched to CET, added to commit log] Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/kernel/cpu/proc.c | 23 +++++++++++++++++++++++ fs/proc/array.c | 6 ++++++ include/linux/proc_fs.h | 2 ++ 3 files changed, 31 insertions(+) diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index 099b6f0d96bd..31c0e68f6227 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -4,6 +4,8 @@ #include #include #include +#include +#include =20 #include "cpu.h" =20 @@ -175,3 +177,24 @@ const struct seq_operations cpuinfo_op =3D { .stop =3D c_stop, .show =3D show_cpuinfo, }; + +#ifdef CONFIG_X86_USER_SHADOW_STACK +static void dump_x86_features(struct seq_file *m, unsigned long features) +{ + if (features & ARCH_SHSTK_SHSTK) + seq_puts(m, "shstk "); + if (features & ARCH_SHSTK_WRSS) + seq_puts(m, "wrss "); +} + +void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct = *task) +{ + seq_puts(m, "x86_Thread_features:\t"); + dump_x86_features(m, task->thread.features); + seq_putc(m, '\n'); + + seq_puts(m, "x86_Thread_features_locked:\t"); + dump_x86_features(m, task->thread.features_locked); + seq_putc(m, '\n'); +} +#endif /* CONFIG_X86_USER_SHADOW_STACK */ diff --git a/fs/proc/array.c b/fs/proc/array.c index d35bbf35a874..2c2efbe685d8 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -431,6 +431,11 @@ static inline void task_untag_mask(struct seq_file *m,= struct mm_struct *mm) seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm)); } =20 +__weak void arch_proc_pid_thread_features(struct seq_file *m, + struct task_struct *task) +{ +} + int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { @@ -455,6 +460,7 @@ int proc_pid_status(struct seq_file *m, struct pid_name= space *ns, task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); + arch_proc_pid_thread_features(m, task); return 0; } =20 diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 0260f5ea98fe..80ff8e533cbd 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -158,6 +158,8 @@ int proc_pid_arch_status(struct seq_file *m, struct pid= _namespace *ns, struct pid *pid, struct task_struct *task); #endif /* CONFIG_PROC_PID_ARCH_STATUS */ =20 +void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct = *task); + #else /* CONFIG_PROC_FS */ =20 static inline void proc_root_init(void) --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CA24C7EE2E for ; Tue, 13 Jun 2023 00:19:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239190AbjFMATG (ORCPT ); Mon, 12 Jun 2023 20:19:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52424 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239327AbjFMARg (ORCPT ); Mon, 12 Jun 2023 20:17:36 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5F87F199E; Mon, 12 Jun 2023 17:14:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615253; x=1718151253; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xI4pbs3PiOhFQEbuDUq+4GewOhti2Nle1TQJwXiYqGs=; b=hu8F+oRkUA9SufUWd3i53q2VvFFsKCZw7iZl7c/I4VeIQoJKiyTlfduS F2YiwOgJvJOgmJpBVZ6VLsDMRjLJXhitb4CkdAZzs+ebTSgG7Rg/41+It Xx9GdUMWm1jfvOrHJLnSRhh8YZzN3+Bu9PmFs10hWoS0f/DKx/4qx9gWk WOgO+2Yduwx7KtzTwK0NnZ7qkRu9g7VKQjqPoIWmdgTTqSN043dfzDtRJ CB41KWAYViTKvcGK7TzR3I+cW+dGhxa+mfhRMvlSP/AeB3pru90PrCA2S sC6AxQDYieCiP7oi7/fKGBhKGbo1g+c9K5Q38YglvS/myT1D0j6Sb21V/ A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557540" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557540" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:40 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671152" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671152" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:39 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 37/42] x86/shstk: Wire in shadow stack interface Date: Mon, 12 Jun 2023 17:11:03 -0700 Message-Id: <20230613001108.3040476-38-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The kernel now has the main shadow stack functionality to support applications. Wire in the WRSS and shadow stack enable/disable functions into the existing shadow stack API skeleton. Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/kernel/shstk.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index ea0bf113f9cf..d723cdc93474 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -502,9 +502,17 @@ long shstk_prctl(struct task_struct *task, int option,= unsigned long features) return -EINVAL; =20 if (option =3D=3D ARCH_SHSTK_DISABLE) { + if (features & ARCH_SHSTK_WRSS) + return wrss_control(false); + if (features & ARCH_SHSTK_SHSTK) + return shstk_disable(); return -EINVAL; } =20 /* Handle ARCH_SHSTK_ENABLE */ + if (features & ARCH_SHSTK_SHSTK) + return shstk_setup(); + if (features & ARCH_SHSTK_WRSS) + return wrss_control(true); return -EINVAL; } --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8105C88CBB for ; Tue, 13 Jun 2023 00:19:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239608AbjFMATR (ORCPT ); Mon, 12 Jun 2023 20:19:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52884 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238376AbjFMARi (ORCPT ); Mon, 12 Jun 2023 20:17:38 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7BB2E3A98; Mon, 12 Jun 2023 17:14:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615255; x=1718151255; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Umo485lkaXi+aij5vBJ9ejxN8cdryMJacLabCk9dyEI=; b=MWPkJUdmtyjntK5lDrs/JDccnFDWtHqzqfZbFawreZbEia0neiUKvyIs L/Xkukjx4tFeYXalOUMN3smyv9Hko+wyJIj7AYHQc6bJtsg1yyIYI/2jD OuWmyqJbmkq6gLSvG2I4N0YBQ4szoPKEwEgT671iJ5M/5yO0y1C7pJ6Vp 9x2Fk+U+4IXfi8jETSAdXN5CMfJhfKzM2UapONY5T7z3ziGbbB+Tru8G/ ZRQ0HBefpxT/K+Su1gSkvM1uNwanzh+d9IlvQtaROm5K/ooGMfjLXBnUg ABrllXCJao/a3M5WjBMvVFo3z/TZjPbZhFsFhyB+b5NWWY23xpfKgIwWl g==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557564" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557564" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671162" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671162" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:40 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 38/42] x86/cpufeatures: Enable CET CR4 bit for shadow stack Date: Mon, 12 Jun 2023 17:11:04 -0700 Message-Id: <20230613001108.3040476-39-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Setting CR4.CET is a prerequisite for utilizing any CET features, most of which also require setting MSRs. Kernel IBT already enables the CET CR4 bit when it detects IBT HW support and is configured with kernel IBT. However, future patches that enable userspace shadow stack support will need the bit set as well. So change the logic to enable it in either case. Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- arch/x86/kernel/cpu/common.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 80710a68ef7d..3ea06b0b4570 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -601,27 +601,43 @@ __noendbr void ibt_restore(u64 save) =20 static __always_inline void setup_cet(struct cpuinfo_x86 *c) { - u64 msr =3D CET_ENDBR_EN; + bool user_shstk, kernel_ibt; =20 - if (!HAS_KERNEL_IBT || - !cpu_feature_enabled(X86_FEATURE_IBT)) + if (!IS_ENABLED(CONFIG_X86_CET)) return; =20 - wrmsrl(MSR_IA32_S_CET, msr); + kernel_ibt =3D HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT); + user_shstk =3D cpu_feature_enabled(X86_FEATURE_SHSTK) && + IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK); + + if (!kernel_ibt && !user_shstk) + return; + + if (user_shstk) + set_cpu_cap(c, X86_FEATURE_USER_SHSTK); + + if (kernel_ibt) + wrmsrl(MSR_IA32_S_CET, CET_ENDBR_EN); + else + wrmsrl(MSR_IA32_S_CET, 0); + cr4_set_bits(X86_CR4_CET); =20 - if (!ibt_selftest()) { + if (kernel_ibt && !ibt_selftest()) { pr_err("IBT selftest: Failed!\n"); wrmsrl(MSR_IA32_S_CET, 0); setup_clear_cpu_cap(X86_FEATURE_IBT); - return; } } =20 __noendbr void cet_disable(void) { - if (cpu_feature_enabled(X86_FEATURE_IBT)) - wrmsrl(MSR_IA32_S_CET, 0); + if (!(cpu_feature_enabled(X86_FEATURE_IBT) || + cpu_feature_enabled(X86_FEATURE_SHSTK))) + return; + + wrmsrl(MSR_IA32_S_CET, 0); + wrmsrl(MSR_IA32_U_CET, 0); } =20 /* @@ -1483,6 +1499,9 @@ static void __init cpu_parse_early_param(void) if (cmdline_find_option_bool(boot_command_line, "noxsaves")) setup_clear_cpu_cap(X86_FEATURE_XSAVES); =20 + if (cmdline_find_option_bool(boot_command_line, "nousershstk")) + setup_clear_cpu_cap(X86_FEATURE_USER_SHSTK); + arglen =3D cmdline_find_option(boot_command_line, "clearcpuid", arg, size= of(arg)); if (arglen <=3D 0) return; --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D87CC88CB2 for ; Tue, 13 Jun 2023 00:19:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239621AbjFMATU (ORCPT ); Mon, 12 Jun 2023 20:19:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56358 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238592AbjFMARw (ORCPT ); Mon, 12 Jun 2023 20:17:52 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CA783AA5; Mon, 12 Jun 2023 17:14:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615258; x=1718151258; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=BODs5YgrIYiUG9wa8tbCiXPQczkVVrNo7gO7I9Fm7W8=; b=CBsEWX+sWwTDuKBnaqL9IjQlVpnalydt+qBdk18c2u1/DuppBWs7dRI4 kPA+kSLM5BetH/RprxPBKAf06anLwIz8REUH6WexByMmJMF9aRcU7aRsE We7pR6MoH6W/kkZs7n+RQjHNPD1R1WYjqNmNeIzlvYF9HDKkpx10jDi7c js4MH6CLuRHqh1kZ7r9CQwpbZLi888C5Ne27VwN05q+rfPJqZrmHNmD90 /BiEuICecypWdp4bepLx/OkQ+oBv/u7mTDi4exk+p/ujqAfO8fyS3k94T 3LGmSfYXI8cbRW/j0f8fO6CJtwDv4pVPFmM86pRwUuskjIiqTZ5x1LpYV A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557588" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557588" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:42 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671173" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671173" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:41 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 39/42] selftests/x86: Add shadow stack test Date: Mon, 12 Jun 2023 17:11:05 -0700 Message-Id: <20230613001108.3040476-40-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add a simple selftest for exercising some shadow stack behavior: - map_shadow_stack syscall and pivot - Faulting in shadow stack memory - Handling shadow stack violations - GUP of shadow stack memory - mprotect() of shadow stack memory - Userfaultfd on shadow stack memory - 32 bit segmentation - Guard gap test - Ptrace test Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- v9: - Fix race in userfaultfd test - Add guard gap test - Add ptrace test --- tools/testing/selftests/x86/Makefile | 2 +- .../testing/selftests/x86/test_shadow_stack.c | 884 ++++++++++++++++++ 2 files changed, 885 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests= /x86/Makefile index 598135d3162b..7e8c937627dd 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY :=3D entry_from_vm86 test_syscall_vd= so unwind_vdso \ test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer TARGETS_C_64BIT_ONLY :=3D fsgsbase sysret_rip syscall_numbering \ - corrupt_xstate_header amx lam + corrupt_xstate_header amx lam test_shadow_stack # Some selftests require 32bit support enabled also on 64bit systems TARGETS_C_32BIT_NEEDED :=3D ldt_gdt ptrace_syscall =20 diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testin= g/selftests/x86/test_shadow_stack.c new file mode 100644 index 000000000000..5ab788dc1ce6 --- /dev/null +++ b/tools/testing/selftests/x86/test_shadow_stack.c @@ -0,0 +1,884 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This program test's basic kernel shadow stack support. It enables shadow + * stack manual via the arch_prctl(), instead of relying on glibc. It's + * Makefile doesn't compile with shadow stack support, so it doesn't rely = on + * any particular glibc. As a result it can't do any operations that requi= re + * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just + * stick to the basics and hope the compiler doesn't do anything strange. + */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Define the ABI defines if needed, so people can run the tests + * without building the headers. + */ +#ifndef __NR_map_shadow_stack +#define __NR_map_shadow_stack 451 + +#define SHADOW_STACK_SET_TOKEN (1ULL << 0) + +#define ARCH_SHSTK_ENABLE 0x5001 +#define ARCH_SHSTK_DISABLE 0x5002 +#define ARCH_SHSTK_LOCK 0x5003 +#define ARCH_SHSTK_UNLOCK 0x5004 +#define ARCH_SHSTK_STATUS 0x5005 + +#define ARCH_SHSTK_SHSTK (1ULL << 0) +#define ARCH_SHSTK_WRSS (1ULL << 1) + +#define NT_X86_SHSTK 0x204 +#endif + +#define SS_SIZE 0x200000 +#define PAGE_SIZE 0x1000 + +#if (__GNUC__ < 8) || (__GNUC__ =3D=3D 8 && __GNUC_MINOR__ < 5) +int main(int argc, char *argv[]) +{ + printf("[SKIP]\tCompiler does not support CET.\n"); + return 0; +} +#else +void write_shstk(unsigned long *addr, unsigned long val) +{ + asm volatile("wrssq %[val], (%[addr])\n" + : "=3Dm" (addr) + : [addr] "r" (addr), [val] "r" (val)); +} + +static inline unsigned long __attribute__((always_inline)) get_ssp(void) +{ + unsigned long ret =3D 0; + + asm volatile("xor %0, %0; rdsspq %0" : "=3Dr" (ret)); + return ret; +} + +/* + * For use in inline enablement of shadow stack. + * + * The program can't return from the point where shadow stack gets enabled + * because there will be no address on the shadow stack. So it can't use + * syscall() for enablement, since it is a function. + * + * Based on code from nolibc.h. Keep a copy here because this can't pull i= n all + * of nolibc.h. + */ +#define ARCH_PRCTL(arg1, arg2) \ +({ \ + long _ret; \ + register long _num asm("eax") =3D __NR_arch_prctl; \ + register long _arg1 asm("rdi") =3D (long)(arg1); \ + register long _arg2 asm("rsi") =3D (long)(arg2); \ + \ + asm volatile ( \ + "syscall\n" \ + : "=3Da"(_ret) \ + : "r"(_arg1), "r"(_arg2), \ + "0"(_num) \ + : "rcx", "r11", "memory", "cc" \ + ); \ + _ret; \ +}) + +void *create_shstk(void *addr) +{ + return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK= _SET_TOKEN); +} + +void *create_normal_mem(void *addr) +{ + return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); +} + +void free_shstk(void *shstk) +{ + munmap(shstk, SS_SIZE); +} + +int reset_shstk(void *shstk) +{ + return madvise(shstk, SS_SIZE, MADV_DONTNEED); +} + +void try_shstk(unsigned long new_ssp) +{ + unsigned long ssp; + + printf("[INFO]\tnew_ssp =3D %lx, *new_ssp =3D %lx\n", + new_ssp, *((unsigned long *)new_ssp)); + + ssp =3D get_ssp(); + printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp); + + asm volatile("rstorssp (%0)\n":: "r" (new_ssp)); + asm volatile("saveprevssp"); + printf("[INFO]\tssp is now %lx\n", get_ssp()); + + /* Switch back to original shadow stack */ + ssp -=3D 8; + asm volatile("rstorssp (%0)\n":: "r" (ssp)); + asm volatile("saveprevssp"); +} + +int test_shstk_pivot(void) +{ + void *shstk =3D create_shstk(0); + + if (shstk =3D=3D MAP_FAILED) { + printf("[FAIL]\tError creating shadow stack: %d\n", errno); + return 1; + } + try_shstk((unsigned long)shstk + SS_SIZE - 8); + free_shstk(shstk); + + printf("[OK]\tShadow stack pivot\n"); + return 0; +} + +int test_shstk_faults(void) +{ + unsigned long *shstk =3D create_shstk(0); + + /* Read shadow stack, test if it's zero to not get read optimized out */ + if (*shstk !=3D 0) + goto err; + + /* Wrss memory that was already read. */ + write_shstk(shstk, 1); + if (*shstk !=3D 1) + goto err; + + /* Page out memory, so we can wrss it again. */ + if (reset_shstk((void *)shstk)) + goto err; + + write_shstk(shstk, 1); + if (*shstk !=3D 1) + goto err; + + printf("[OK]\tShadow stack faults\n"); + return 0; + +err: + return 1; +} + +unsigned long saved_ssp; +unsigned long saved_ssp_val; +volatile bool segv_triggered; + +void __attribute__((noinline)) violate_ss(void) +{ + saved_ssp =3D get_ssp(); + saved_ssp_val =3D *(unsigned long *)saved_ssp; + + /* Corrupt shadow stack */ + printf("[INFO]\tCorrupting shadow stack\n"); + write_shstk((void *)saved_ssp, 0); +} + +void segv_handler(int signum, siginfo_t *si, void *uc) +{ + printf("[INFO]\tGenerated shadow stack violation successfully\n"); + + segv_triggered =3D true; + + /* Fix shadow stack */ + write_shstk((void *)saved_ssp, saved_ssp_val); +} + +int test_shstk_violation(void) +{ + struct sigaction sa =3D {}; + + sa.sa_sigaction =3D segv_handler; + sa.sa_flags =3D SA_SIGINFO; + if (sigaction(SIGSEGV, &sa, NULL)) + return 1; + + segv_triggered =3D false; + + /* Make sure segv_triggered is set before violate_ss() */ + asm volatile("" : : : "memory"); + + violate_ss(); + + signal(SIGSEGV, SIG_DFL); + + printf("[OK]\tShadow stack violation test\n"); + + return !segv_triggered; +} + +/* Gup test state */ +#define MAGIC_VAL 0x12345678 +bool is_shstk_access; +void *shstk_ptr; +int fd; + +void reset_test_shstk(void *addr) +{ + if (shstk_ptr) + free_shstk(shstk_ptr); + shstk_ptr =3D create_shstk(addr); +} + +void test_access_fix_handler(int signum, siginfo_t *si, void *uc) +{ + printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : = "normal write"); + + segv_triggered =3D true; + + /* Fix shadow stack */ + if (is_shstk_access) { + reset_test_shstk(shstk_ptr); + return; + } + + free_shstk(shstk_ptr); + create_normal_mem(shstk_ptr); +} + +bool test_shstk_access(void *ptr) +{ + is_shstk_access =3D true; + segv_triggered =3D false; + write_shstk(ptr, MAGIC_VAL); + + asm volatile("" : : : "memory"); + + return segv_triggered; +} + +bool test_write_access(void *ptr) +{ + is_shstk_access =3D false; + segv_triggered =3D false; + *(unsigned long *)ptr =3D MAGIC_VAL; + + asm volatile("" : : : "memory"); + + return segv_triggered; +} + +bool gup_write(void *ptr) +{ + unsigned long val; + + lseek(fd, (unsigned long)ptr, SEEK_SET); + if (write(fd, &val, sizeof(val)) < 0) + return 1; + + return 0; +} + +bool gup_read(void *ptr) +{ + unsigned long val; + + lseek(fd, (unsigned long)ptr, SEEK_SET); + if (read(fd, &val, sizeof(val)) < 0) + return 1; + + return 0; +} + +int test_gup(void) +{ + struct sigaction sa =3D {}; + int status; + pid_t pid; + + sa.sa_sigaction =3D test_access_fix_handler; + sa.sa_flags =3D SA_SIGINFO; + if (sigaction(SIGSEGV, &sa, NULL)) + return 1; + + segv_triggered =3D false; + + fd =3D open("/proc/self/mem", O_RDWR); + if (fd =3D=3D -1) + return 1; + + reset_test_shstk(0); + if (gup_read(shstk_ptr)) + return 1; + if (test_shstk_access(shstk_ptr)) + return 1; + printf("[INFO]\tGup read -> shstk access success\n"); + + reset_test_shstk(0); + if (gup_write(shstk_ptr)) + return 1; + if (test_shstk_access(shstk_ptr)) + return 1; + printf("[INFO]\tGup write -> shstk access success\n"); + + reset_test_shstk(0); + if (gup_read(shstk_ptr)) + return 1; + if (!test_write_access(shstk_ptr)) + return 1; + printf("[INFO]\tGup read -> write access success\n"); + + reset_test_shstk(0); + if (gup_write(shstk_ptr)) + return 1; + if (!test_write_access(shstk_ptr)) + return 1; + printf("[INFO]\tGup write -> write access success\n"); + + close(fd); + + /* COW/gup test */ + reset_test_shstk(0); + pid =3D fork(); + if (!pid) { + fd =3D open("/proc/self/mem", O_RDWR); + if (fd =3D=3D -1) + exit(1); + + if (gup_write(shstk_ptr)) { + close(fd); + exit(1); + } + close(fd); + exit(0); + } + waitpid(pid, &status, 0); + if (WEXITSTATUS(status)) { + printf("[FAIL]\tWrite in child failed\n"); + return 1; + } + if (*(unsigned long *)shstk_ptr =3D=3D MAGIC_VAL) { + printf("[FAIL]\tWrite in child wrote through to shared memory\n"); + return 1; + } + + printf("[INFO]\tCow gup write -> write access success\n"); + + free_shstk(shstk_ptr); + + signal(SIGSEGV, SIG_DFL); + + printf("[OK]\tShadow gup test\n"); + + return 0; +} + +int test_mprotect(void) +{ + struct sigaction sa =3D {}; + + sa.sa_sigaction =3D test_access_fix_handler; + sa.sa_flags =3D SA_SIGINFO; + if (sigaction(SIGSEGV, &sa, NULL)) + return 1; + + segv_triggered =3D false; + + /* mprotect a shadow stack as read only */ + reset_test_shstk(0); + if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) { + printf("[FAIL]\tmprotect(PROT_READ) failed\n"); + return 1; + } + + /* try to wrss it and fail */ + if (!test_shstk_access(shstk_ptr)) { + printf("[FAIL]\tShadow stack access to read-only memory succeeded\n"); + return 1; + } + + /* + * The shadow stack was reset above to resolve the fault, make the new one + * read-only. + */ + if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) { + printf("[FAIL]\tmprotect(PROT_READ) failed\n"); + return 1; + } + + /* then back to writable */ + if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) { + printf("[FAIL]\tmprotect(PROT_WRITE) failed\n"); + return 1; + } + + /* then wrss to it and succeed */ + if (test_shstk_access(shstk_ptr)) { + printf("[FAIL]\tShadow stack access to mprotect() writable memory failed= \n"); + return 1; + } + + free_shstk(shstk_ptr); + + signal(SIGSEGV, SIG_DFL); + + printf("[OK]\tmprotect() test\n"); + + return 0; +} + +char zero[4096]; + +static void *uffd_thread(void *arg) +{ + struct uffdio_copy req; + int uffd =3D *(int *)arg; + struct uffd_msg msg; + int ret; + + while (1) { + ret =3D read(uffd, &msg, sizeof(msg)); + if (ret > 0) + break; + else if (errno =3D=3D EAGAIN) + continue; + return (void *)1; + } + + req.dst =3D msg.arg.pagefault.address; + req.src =3D (__u64)zero; + req.len =3D 4096; + req.mode =3D 0; + + if (ioctl(uffd, UFFDIO_COPY, &req)) + return (void *)1; + + return (void *)0; +} + +int test_userfaultfd(void) +{ + struct uffdio_register uffdio_register; + struct uffdio_api uffdio_api; + struct sigaction sa =3D {}; + pthread_t thread; + void *res; + int uffd; + + sa.sa_sigaction =3D test_access_fix_handler; + sa.sa_flags =3D SA_SIGINFO; + if (sigaction(SIGSEGV, &sa, NULL)) + return 1; + + uffd =3D syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) { + printf("[SKIP]\tUserfaultfd unavailable.\n"); + return 0; + } + + reset_test_shstk(0); + + uffdio_api.api =3D UFFD_API; + uffdio_api.features =3D 0; + if (ioctl(uffd, UFFDIO_API, &uffdio_api)) + goto err; + + uffdio_register.range.start =3D (__u64)shstk_ptr; + uffdio_register.range.len =3D 4096; + uffdio_register.mode =3D UFFDIO_REGISTER_MODE_MISSING; + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + goto err; + + if (pthread_create(&thread, NULL, &uffd_thread, &uffd)) + goto err; + + reset_shstk(shstk_ptr); + test_shstk_access(shstk_ptr); + + if (pthread_join(thread, &res)) + goto err; + + if (test_shstk_access(shstk_ptr)) + goto err; + + free_shstk(shstk_ptr); + + signal(SIGSEGV, SIG_DFL); + + if (!res) + printf("[OK]\tUserfaultfd test\n"); + return !!res; +err: + free_shstk(shstk_ptr); + close(uffd); + signal(SIGSEGV, SIG_DFL); + return 1; +} + +/* Simple linked list for keeping track of mappings in test_guard_gap() */ +struct node { + struct node *next; + void *mapping; +}; + +/* + * This tests whether mmap will place other mappings in a shadow stack's g= uard + * gap. The steps are: + * 1. Finds an empty place by mapping and unmapping something. + * 2. Map a shadow stack in the middle of the known empty area. + * 3. Map a bunch of PAGE_SIZE mappings. These will use the search down + * direction, filling any gaps until it encounters the shadow stack's + * guard gap. + * 4. When a mapping lands below the shadow stack from step 2, then all + * of the above gaps are filled. The search down algorithm will have + * looked at the shadow stack gaps. + * 5. See if it landed in the gap. + */ +int test_guard_gap(void) +{ + void *free_area, *shstk, *test_map =3D (void *)0xFFFFFFFFFFFFFFFF; + struct node *head =3D NULL, *cur; + + free_area =3D mmap(0, SS_SIZE * 3, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(free_area, SS_SIZE * 3); + + shstk =3D create_shstk(free_area + SS_SIZE); + if (shstk =3D=3D MAP_FAILED) + return 1; + + while (test_map > shstk) { + test_map =3D mmap(0, PAGE_SIZE, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (test_map =3D=3D MAP_FAILED) + return 1; + cur =3D malloc(sizeof(*cur)); + cur->mapping =3D test_map; + + cur->next =3D head; + head =3D cur; + } + + while (head) { + cur =3D head; + head =3D cur->next; + munmap(cur->mapping, PAGE_SIZE); + free(cur); + } + + free_shstk(shstk); + + if (shstk - test_map - PAGE_SIZE !=3D PAGE_SIZE) + return 1; + + printf("[OK]\tGuard gap test\n"); + + return 0; +} + +/* + * Too complicated to pull it out of the 32 bit header, but also get the + * 64 bit one needed above. Just define a copy here. + */ +#define __NR_compat_sigaction 67 + +/* + * Call 32 bit signal handler to get 32 bit signals ABI. Make sure + * to push the registers that will get clobbered. + */ +int sigaction32(int signum, const struct sigaction *restrict act, + struct sigaction *restrict oldact) +{ + register long syscall_reg asm("eax") =3D __NR_compat_sigaction; + register long signum_reg asm("ebx") =3D signum; + register long act_reg asm("ecx") =3D (long)act; + register long oldact_reg asm("edx") =3D (long)oldact; + int ret =3D 0; + + asm volatile ("int $0x80;" + : "=3Da"(ret), "=3Dm"(oldact) + : "r"(syscall_reg), "r"(signum_reg), "r"(act_reg), + "r"(oldact_reg) + : "r8", "r9", "r10", "r11" + ); + + return ret; +} + +sigjmp_buf jmp_buffer; + +void segv_gp_handler(int signum, siginfo_t *si, void *uc) +{ + segv_triggered =3D true; + + /* + * To work with old glibc, this can't rely on siglongjmp working with + * shadow stack enabled, so disable shadow stack before siglongjmp(). + */ + ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK); + siglongjmp(jmp_buffer, -1); +} + +/* + * Transition to 32 bit mode and check that a #GP triggers a segfault. + */ +int test_32bit(void) +{ + struct sigaction sa =3D {}; + struct sigaction *sa32; + + /* Create sigaction in 32 bit address range */ + sa32 =3D mmap(0, 4096, PROT_READ | PROT_WRITE, + MAP_32BIT | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); + sa32->sa_flags =3D SA_SIGINFO; + + sa.sa_sigaction =3D segv_gp_handler; + sa.sa_flags =3D SA_SIGINFO; + if (sigaction(SIGSEGV, &sa, NULL)) + return 1; + + + segv_triggered =3D false; + + /* Make sure segv_triggered is set before triggering the #GP */ + asm volatile("" : : : "memory"); + + /* + * Set handler to somewhere in 32 bit address space + */ + sa32->sa_handler =3D (void *)sa32; + if (sigaction32(SIGUSR1, sa32, NULL)) + return 1; + + if (!sigsetjmp(jmp_buffer, 1)) + raise(SIGUSR1); + + if (segv_triggered) + printf("[OK]\t32 bit test\n"); + + return !segv_triggered; +} + +void segv_handler_ptrace(int signum, siginfo_t *si, void *uc) +{ + /* The SSP adjustment caused a segfault. */ + exit(0); +} + +int test_ptrace(void) +{ + unsigned long saved_ssp, ssp =3D 0; + struct sigaction sa=3D {}; + struct iovec iov; + int status; + int pid; + + iov.iov_base =3D &ssp; + iov.iov_len =3D sizeof(ssp); + + pid =3D fork(); + if (!pid) { + ssp =3D get_ssp(); + + sa.sa_sigaction =3D segv_handler_ptrace; + sa.sa_flags =3D SA_SIGINFO; + if (sigaction(SIGSEGV, &sa, NULL)) + return 1; + + ptrace(PTRACE_TRACEME, NULL, NULL, NULL); + /* + * The parent will tweak the SSP and return from this function + * will #CP. + */ + raise(SIGTRAP); + + exit(1); + } + + while (waitpid(pid, &status, 0) !=3D -1 && WSTOPSIG(status) !=3D SIGTRAP); + + if (ptrace(PTRACE_GETREGSET, pid, NT_X86_SHSTK, &iov)) { + printf("[INFO]\tFailed to PTRACE_GETREGS\n"); + goto out_kill; + } + + if (!ssp) { + printf("[INFO]\tPtrace child SSP was 0\n"); + goto out_kill; + } + + saved_ssp =3D ssp; + + iov.iov_len =3D 0; + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { + printf("[INFO]\tToo small size accepted via PTRACE_SETREGS\n"); + goto out_kill; + } + + iov.iov_len =3D sizeof(ssp) + 1; + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { + printf("[INFO]\tToo large size accepted via PTRACE_SETREGS\n"); + goto out_kill; + } + + ssp +=3D 1; + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { + printf("[INFO]\tUnaligned SSP written via PTRACE_SETREGS\n"); + goto out_kill; + } + + ssp =3D 0xFFFFFFFFFFFF0000; + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { + printf("[INFO]\tKernel range SSP written via PTRACE_SETREGS\n"); + goto out_kill; + } + + /* + * Tweak the SSP so the child with #CP when it resumes and returns + * from raise() + */ + ssp =3D saved_ssp + 8; + iov.iov_len =3D sizeof(ssp); + if (ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { + printf("[INFO]\tFailed to PTRACE_SETREGS\n"); + goto out_kill; + } + + if (ptrace(PTRACE_DETACH, pid, NULL, NULL)) { + printf("[INFO]\tFailed to PTRACE_DETACH\n"); + goto out_kill; + } + + waitpid(pid, &status, 0); + if (WEXITSTATUS(status)) + return 1; + + printf("[OK]\tPtrace test\n"); + return 0; + +out_kill: + kill(pid, SIGKILL); + return 1; +} + +int main(int argc, char *argv[]) +{ + int ret =3D 0; + + if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) { + printf("[SKIP]\tCould not enable Shadow stack\n"); + return 1; + } + + if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) { + ret =3D 1; + printf("[FAIL]\tDisabling shadow stack failed\n"); + } + + if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) { + printf("[SKIP]\tCould not re-enable Shadow stack\n"); + return 1; + } + + if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_WRSS)) { + printf("[SKIP]\tCould not enable WRSS\n"); + ret =3D 1; + goto out; + } + + /* Should have succeeded if here, but this is a test, so double check. */ + if (!get_ssp()) { + printf("[FAIL]\tShadow stack disabled\n"); + return 1; + } + + if (test_shstk_pivot()) { + ret =3D 1; + printf("[FAIL]\tShadow stack pivot\n"); + goto out; + } + + if (test_shstk_faults()) { + ret =3D 1; + printf("[FAIL]\tShadow stack fault test\n"); + goto out; + } + + if (test_shstk_violation()) { + ret =3D 1; + printf("[FAIL]\tShadow stack violation test\n"); + goto out; + } + + if (test_gup()) { + ret =3D 1; + printf("[FAIL]\tShadow shadow stack gup\n"); + goto out; + } + + if (test_mprotect()) { + ret =3D 1; + printf("[FAIL]\tShadow shadow mprotect test\n"); + goto out; + } + + if (test_userfaultfd()) { + ret =3D 1; + printf("[FAIL]\tUserfaultfd test\n"); + goto out; + } + + if (test_guard_gap()) { + ret =3D 1; + printf("[FAIL]\tGuard gap test\n"); + goto out; + } + + if (test_ptrace()) { + ret =3D 1; + printf("[FAIL]\tptrace test\n"); + } + + if (test_32bit()) { + ret =3D 1; + printf("[FAIL]\t32 bit test\n"); + goto out; + } + + return ret; + +out: + /* + * Disable shadow stack before the function returns, or there will be a + * shadow stack violation. + */ + if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) { + ret =3D 1; + printf("[FAIL]\tDisabling shadow stack failed\n"); + } + + return ret; +} +#endif --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4B97C7EE43 for ; Tue, 13 Jun 2023 00:19:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239630AbjFMATX (ORCPT ); Mon, 12 Jun 2023 20:19:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56462 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239116AbjFMAR5 (ORCPT ); Mon, 12 Jun 2023 20:17:57 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7AB319BF; Mon, 12 Jun 2023 17:14:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615261; x=1718151261; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YlC9UAklwmTYEYA/IeAnmisSABN50M2A88ovR11LHkA=; b=JM13llmtFLVqxpUU/TalEJO7CymFGuC8WreiN5kmRlThp6PBegZsDYoM NtaYi1vmDEJfgdYPCW+DJJzuFprnxjxPxX4yXQoewOVyUSNJKjFBQzS+U o3MClS5pvD+uBJYfAUVtZcnaJgvbpRuKVxk0faqWbqT16vbNfU2SJUdYg dDR56quH1gYbl1nO9saYAy0F3Gmv9j2HKntdjY7EVuvnLPV9xOOD8hniw PLg8WwDjxfRWk1OA4OJk7uirl5upAnu6NHv2wh2VerHZMhJ3Odt9mhOzi evljsAEjMJL4uafiLMEIVdujI7iTPKhHdkFZHDnmQbVDLp12FAO/I3QD2 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557610" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557610" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:43 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671179" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671179" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:42 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu , Pengfei Xu Subject: [PATCH v9 40/42] x86: Add PTRACE interface for shadow stack Date: Mon, 12 Jun 2023 17:11:06 -0700 Message-Id: <20230613001108.3040476-41-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Some applications (like GDB) would like to tweak shadow stack state via ptrace. This allows for existing functionality to continue to work for seized shadow stack applications. Provide a regset interface for manipulating the shadow stack pointer (SSP). There is already ptrace functionality for accessing xstate, but this does not include supervisor xfeatures. So there is not a completely clear place for where to put the shadow stack state. Adding it to the user xfeatures regset would complicate that code, as it currently shares logic with signals which should not have supervisor features. Don't add a general supervisor xfeature regset like the user one, because it is better to maintain flexibility for other supervisor xfeatures to define their own interface. For example, an xfeature may decide not to expose all of it's state to userspace, as is actually the case for shadow stack ptrace functionality. A lot of enum values remain to be used, so just put it in dedicated shadow stack regset. The only downside to not having a generic supervisor xfeature regset, is that apps need to be enlightened of any new supervisor xfeature exposed this way (i.e. they can't try to have generic save/restore logic). But maybe that is a good thing, because they have to think through each new xfeature instead of encountering issues when a new supervisor xfeature was added. By adding a shadow stack regset, it also has the effect of including the shadow stack state in a core dump, which could be useful for debugging. The shadow stack specific xstate includes the SSP, and the shadow stack and WRSS enablement status. Enabling shadow stack or WRSS in the kernel involves more than just flipping the bit. The kernel is made aware that it has to do extra things when cloning or handling signals. That logic is triggered off of separate feature enablement state kept in the task struct. So the flipping on HW shadow stack enforcement without notifying the kernel to change its behavior would severely limit what an application could do without crashing, and the results would depend on kernel internal implementation details. There is also no known use for controlling this state via ptrace today. So only expose the SSP, which is something that userspace already has indirect control over. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- v9: - Squash "Enforce only whole copies for ssp_set()" fix that previously was in tip. --- arch/x86/include/asm/fpu/regset.h | 7 +-- arch/x86/kernel/fpu/regset.c | 81 +++++++++++++++++++++++++++++++ arch/x86/kernel/ptrace.c | 12 +++++ include/uapi/linux/elf.h | 2 + 4 files changed, 99 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/r= egset.h index 4f928d6a367b..697b77e96025 100644 --- a/arch/x86/include/asm/fpu/regset.h +++ b/arch/x86/include/asm/fpu/regset.h @@ -7,11 +7,12 @@ =20 #include =20 -extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_a= ctive; +extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_a= ctive, + ssp_active; extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get, - xstateregs_get; + xstateregs_get, ssp_get; extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set, - xstateregs_set; + xstateregs_set, ssp_set; =20 /* * xstateregs_active =3D=3D regset_fpregs_active. Please refer to the comm= ent diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c index 6d056b68f4ed..6bc1eb2a21bd 100644 --- a/arch/x86/kernel/fpu/regset.c +++ b/arch/x86/kernel/fpu/regset.c @@ -8,6 +8,7 @@ #include #include #include +#include =20 #include "context.h" #include "internal.h" @@ -174,6 +175,86 @@ int xstateregs_set(struct task_struct *target, const s= truct user_regset *regset, return ret; } =20 +#ifdef CONFIG_X86_USER_SHADOW_STACK +int ssp_active(struct task_struct *target, const struct user_regset *regse= t) +{ + if (target->thread.features & ARCH_SHSTK_SHSTK) + return regset->n; + + return 0; +} + +int ssp_get(struct task_struct *target, const struct user_regset *regset, + struct membuf to) +{ + struct fpu *fpu =3D &target->thread.fpu; + struct cet_user_state *cetregs; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + return -ENODEV; + + sync_fpstate(fpu); + cetregs =3D get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER); + if (WARN_ON(!cetregs)) { + /* + * This shouldn't ever be NULL because shadow stack was + * verified to be enabled above. This means + * MSR_IA32_U_CET.CET_SHSTK_EN should be 1 and so + * XFEATURE_CET_USER should not be in the init state. + */ + return -ENODEV; + } + + return membuf_write(&to, (unsigned long *)&cetregs->user_ssp, + sizeof(cetregs->user_ssp)); +} + +int ssp_set(struct task_struct *target, const struct user_regset *regset, + unsigned int pos, unsigned int count, + const void *kbuf, const void __user *ubuf) +{ + struct fpu *fpu =3D &target->thread.fpu; + struct xregs_state *xsave =3D &fpu->fpstate->regs.xsave; + struct cet_user_state *cetregs; + unsigned long user_ssp; + int r; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || + !ssp_active(target, regset)) + return -ENODEV; + + if (pos !=3D 0 || count !=3D sizeof(user_ssp)) + return -EINVAL; + + r =3D user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_ssp, 0, -1); + if (r) + return r; + + /* + * Some kernel instructions (IRET, etc) can cause exceptions in the case + * of disallowed CET register values. Just prevent invalid values. + */ + if (user_ssp >=3D TASK_SIZE_MAX || !IS_ALIGNED(user_ssp, 8)) + return -EINVAL; + + fpu_force_restore(fpu); + + cetregs =3D get_xsave_addr(xsave, XFEATURE_CET_USER); + if (WARN_ON(!cetregs)) { + /* + * This shouldn't ever be NULL because shadow stack was + * verified to be enabled above. This means + * MSR_IA32_U_CET.CET_SHSTK_EN should be 1 and so + * XFEATURE_CET_USER should not be in the init state. + */ + return -ENODEV; + } + + cetregs->user_ssp =3D user_ssp; + return 0; +} +#endif /* CONFIG_X86_USER_SHADOW_STACK */ + #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION =20 /* diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index dfaa270a7cc9..095f04bdabdc 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -58,6 +58,7 @@ enum x86_regset_64 { REGSET64_FP, REGSET64_IOPERM, REGSET64_XSTATE, + REGSET64_SSP, }; =20 #define REGSET_GENERAL \ @@ -1267,6 +1268,17 @@ static struct user_regset x86_64_regsets[] __ro_afte= r_init =3D { .active =3D ioperm_active, .regset_get =3D ioperm_get }, +#ifdef CONFIG_X86_USER_SHADOW_STACK + [REGSET64_SSP] =3D { + .core_note_type =3D NT_X86_SHSTK, + .n =3D 1, + .size =3D sizeof(u64), + .align =3D sizeof(u64), + .active =3D ssp_active, + .regset_get =3D ssp_get, + .set =3D ssp_set + }, +#endif }; =20 static const struct user_regset_view user_x86_64_view =3D { diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index ac3da855fb19..fa1ceeae2596 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -406,6 +406,8 @@ typedef struct elf64_shdr { #define NT_386_TLS 0x200 /* i386 TLS slots (struct user_desc) */ #define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=3Ddeny) */ #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */ +/* Old binutils treats 0x203 as a CET state */ +#define NT_X86_SHSTK 0x204 /* x86 SHSTK state */ #define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */ #define NT_S390_TIMER 0x301 /* s390 timer register */ #define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */ --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2CABC88CB2 for ; Tue, 13 Jun 2023 00:20:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239021AbjFMAUp (ORCPT ); Mon, 12 Jun 2023 20:20:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239426AbjFMAS3 (ORCPT ); Mon, 12 Jun 2023 20:18:29 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2D3A53C15; Mon, 12 Jun 2023 17:14:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615272; x=1718151272; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hFHVnW9lxN9HUD3QIC9fY4x2IPOgeGJUGkzddQmtaZw=; b=MqwCNVj5T/NU7ZoDUQmmzV1sTTUzvs7OWVUAWXwB+/0uMoVL6Xyjoj/8 u/vgj+yaLVKVIBWmIOeXda8u7r+GW1RDCCrouC0JN8b95OK7/xKlzQz6l gpWjxA/uSlMNLmJf42V4ehtdtwZRFLapnariIo3Nm/Velx9WLADXonBsh WfIN4reP9wjmitmF/RUCzkFceAZJd0i5YjBBnaDZLZMe22W3AB9P2tyUl mGbF5cnFuVUF1c0SVwP2LWySl6yiMM+Iu5CFpxFpxZKGoaUyrVXxDKMsS 0+cMsfuA1Bh9wnPTbGkCmlpJGh0HhJvS9SSIZt3+CJcdIqdwpYczATJop A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557635" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557635" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:44 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671183" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671183" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:43 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Mike Rapoport , Pengfei Xu Subject: [PATCH v9 41/42] x86/shstk: Add ARCH_SHSTK_UNLOCK Date: Mon, 12 Jun 2023 17:11:07 -0700 Message-Id: <20230613001108.3040476-42-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Mike Rapoport Userspace loaders may lock features before a CRIU restore operation has the chance to set them to whatever state is required by the process being restored. Allow a way for CRIU to unlock features. Add it as an arch_prctl() like the other shadow stack operations, but restrict it being called by the ptrace arch_pctl() interface. [Merged into recent API changes, added commit log and docs] Signed-off-by: Mike Rapoport Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- Documentation/arch/x86/shstk.rst | 4 ++++ arch/x86/include/uapi/asm/prctl.h | 1 + arch/x86/kernel/process_64.c | 1 + arch/x86/kernel/shstk.c | 9 +++++++-- 4 files changed, 13 insertions(+), 2 deletions(-) diff --git a/Documentation/arch/x86/shstk.rst b/Documentation/arch/x86/shst= k.rst index f09afa504ec0..f3553cc8c758 100644 --- a/Documentation/arch/x86/shstk.rst +++ b/Documentation/arch/x86/shstk.rst @@ -75,6 +75,10 @@ arch_prctl(ARCH_SHSTK_LOCK, unsigned long features) are ignored. The mask is ORed with the existing value. So any feature = bits set here cannot be enabled or disabled afterwards. =20 +arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features) + Unlock features. 'features' is a mask of all features to unlock. All + bits set are processed, unset bits are ignored. Only works via ptrace. + The return values are as follows. On success, return 0. On error, errno can be:: =20 diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/= prctl.h index eedfde3b63be..3189c4a96468 100644 --- a/arch/x86/include/uapi/asm/prctl.h +++ b/arch/x86/include/uapi/asm/prctl.h @@ -33,6 +33,7 @@ #define ARCH_SHSTK_ENABLE 0x5001 #define ARCH_SHSTK_DISABLE 0x5002 #define ARCH_SHSTK_LOCK 0x5003 +#define ARCH_SHSTK_UNLOCK 0x5004 =20 /* ARCH_SHSTK_ features bits */ #define ARCH_SHSTK_SHSTK (1ULL << 0) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 0f89aa0186d1..e6db21c470aa 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -899,6 +899,7 @@ long do_arch_prctl_64(struct task_struct *task, int opt= ion, unsigned long arg2) case ARCH_SHSTK_ENABLE: case ARCH_SHSTK_DISABLE: case ARCH_SHSTK_LOCK: + case ARCH_SHSTK_UNLOCK: return shstk_prctl(task, option, arg2); default: ret =3D -EINVAL; diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index d723cdc93474..d43b7a9c57ce 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -489,9 +489,14 @@ long shstk_prctl(struct task_struct *task, int option,= unsigned long features) return 0; } =20 - /* Don't allow via ptrace */ - if (task !=3D current) + /* Only allow via ptrace */ + if (task !=3D current) { + if (option =3D=3D ARCH_SHSTK_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_REST= ORE)) { + task->thread.features_locked &=3D ~features; + return 0; + } return -EINVAL; + } =20 /* Do not allow to change locked features */ if (features & task->thread.features_locked) --=20 2.34.1 From nobody Sat May 18 09:01:20 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E86CC7EE43 for ; Tue, 13 Jun 2023 00:20:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239183AbjFMAUr (ORCPT ); Mon, 12 Jun 2023 20:20:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239447AbjFMASa (ORCPT ); Mon, 12 Jun 2023 20:18:30 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1AC003C1B; Mon, 12 Jun 2023 17:14:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686615273; x=1718151273; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MIQ1pWvE/RoBE8pKawMdEBQM6Q57ao+KDpsn7QGopL8=; b=QQ9X1yvEzmGzuyIHxFih972HlZMFv/ydI62a38WGtKVXsRctmlFgC+kM x3p9Z6E/pTCwhwVTuvamy4j4G5YxtLsz00mec9bkGVisvezERjAE41ny4 LnBYVirsMrB0DN3IDKRm22i45rN9qc/Y2JUBNq/5hNTr6sNIeHfk9tykd Keztd6faDV/GqJ9qKUFLDM1Mw/KdMEmTpX+m6yqxoYYq8r6jceUZSSkzN Ym0hnGOxRLue5gmD9IRkNGRu6SecQmNMWoc70nxt8a748hJK03HgC4Elz kjamzrkpqYdlaas38sZCXKCHBKh1A7IJh7U3cIQXp/cvip05wXKS5fGPE A==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="361557660" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="361557660" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:44 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="835671189" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="835671189" Received: from almeisch-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.amr.corp.intel.com) ([10.209.42.242]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 17:12:43 -0700 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com, szabolcs.nagy@arm.com, torvalds@linux-foundation.org, broonie@kernel.org Cc: rick.p.edgecombe@intel.com, Pengfei Xu Subject: [PATCH v9 42/42] x86/shstk: Add ARCH_SHSTK_STATUS Date: Mon, 12 Jun 2023 17:11:08 -0700 Message-Id: <20230613001108.3040476-43-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> References: <20230613001108.3040476-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" CRIU and GDB need to get the current shadow stack and WRSS enablement status. This information is already available via /proc/pid/status, but this is inconvenient for CRIU because it involves parsing the text output in an area of the code where this is difficult. Provide a status arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a userspace address, and make the new arch_prctl simply copy the features out to userspace. Suggested-by: Mike Rapoport Signed-off-by: Rick Edgecombe Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook --- Documentation/arch/x86/shstk.rst | 6 ++++++ arch/x86/include/asm/shstk.h | 2 +- arch/x86/include/uapi/asm/prctl.h | 1 + arch/x86/kernel/process_64.c | 1 + arch/x86/kernel/shstk.c | 8 +++++++- 5 files changed, 16 insertions(+), 2 deletions(-) diff --git a/Documentation/arch/x86/shstk.rst b/Documentation/arch/x86/shst= k.rst index f3553cc8c758..60260e809baf 100644 --- a/Documentation/arch/x86/shstk.rst +++ b/Documentation/arch/x86/shstk.rst @@ -79,6 +79,11 @@ arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features) Unlock features. 'features' is a mask of all features to unlock. All bits set are processed, unset bits are ignored. Only works via ptrace. =20 +arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr) + Copy the currently enabled features to the address passed in addr. The + features are described using the bits passed into the others in + 'features'. + The return values are as follows. On success, return 0. On error, errno can be:: =20 @@ -86,6 +91,7 @@ be:: -ENOTSUPP if the feature is not supported by the hardware or kernel. -EINVAL arguments (non existing feature, etc) + -EFAULT if could not copy information back to userspace =20 The feature's bits supported are:: =20 diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h index ecb23a8ca47d..42fee8959df7 100644 --- a/arch/x86/include/asm/shstk.h +++ b/arch/x86/include/asm/shstk.h @@ -14,7 +14,7 @@ struct thread_shstk { u64 size; }; =20 -long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res); +long shstk_prctl(struct task_struct *task, int option, unsigned long arg2); void reset_thread_features(void); unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned lon= g clone_flags, unsigned long stack_size); diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/= prctl.h index 3189c4a96468..384e2cc6ac19 100644 --- a/arch/x86/include/uapi/asm/prctl.h +++ b/arch/x86/include/uapi/asm/prctl.h @@ -34,6 +34,7 @@ #define ARCH_SHSTK_DISABLE 0x5002 #define ARCH_SHSTK_LOCK 0x5003 #define ARCH_SHSTK_UNLOCK 0x5004 +#define ARCH_SHSTK_STATUS 0x5005 =20 /* ARCH_SHSTK_ features bits */ #define ARCH_SHSTK_SHSTK (1ULL << 0) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index e6db21c470aa..33b268747bb7 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -900,6 +900,7 @@ long do_arch_prctl_64(struct task_struct *task, int opt= ion, unsigned long arg2) case ARCH_SHSTK_DISABLE: case ARCH_SHSTK_LOCK: case ARCH_SHSTK_UNLOCK: + case ARCH_SHSTK_STATUS: return shstk_prctl(task, option, arg2); default: ret =3D -EINVAL; diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index d43b7a9c57ce..b26810c7cd1c 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -482,8 +482,14 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr,= unsigned long, size, unsi return alloc_shstk(addr, aligned_size, size, set_tok); } =20 -long shstk_prctl(struct task_struct *task, int option, unsigned long featu= res) +long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) { + unsigned long features =3D arg2; + + if (option =3D=3D ARCH_SHSTK_STATUS) { + return put_user(task->thread.features, (unsigned long __user *)arg2); + } + if (option =3D=3D ARCH_SHSTK_LOCK) { task->thread.features_locked |=3D features; return 0; --=20 2.34.1