arch/x86/include/asm/kfence.h | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-)
For native, the choice of PTE is fine. There's real memory backing the
non-present PTE. However, for XenPV, Xen complains:
(XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing
To explain, some background on XenPV pagetables:
Xen PV guests are control their own pagetables; they choose the new PTE
value, and use hypercalls to make changes so Xen can audit for safety.
In addition to a regular reference count, Xen also maintains a type
reference count. e.g. SegDesc (referenced by vGDT/vLDT),
Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a
lower pagetable level). This is in order to prevent e.g. a page being
inserted into the pagetables for which the guest has a writable mapping.
For non-present mappings, all other bits become software accessible, and
typically contain metadata rather a real frame address. There is nothing
that a reference count could sensibly be tied to. As such, even if Xen
could recognise the address as currently safe, nothing would prevent that
frame from changing owner to another VM in the future.
When Xen detects a PV guest writing a L1TF-PTE, it responds by activating
shadow paging. This is normally only used for the live phase of
migration, and comes with a reasonable overhead.
KFENCE only cares about getting #PF to catch wild accesses; it doesn't care
about the value for non-present mappings. Use a fully inverted PTE, to
avoid hitting the slow path when running under Xen.
While adjusting the logic, take the opportunity to skip all actions if the
PTE is already in the right state, half the number PVOps callouts, and skip
TLB maintenance on a !P -> P transition which benefits non-Xen cases too.
Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
Tested-by: Marco Elver <elver@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Alexander Potapenko <glider@google.com>
CC: Marco Elver <elver@google.com>
CC: Dmitry Vyukov <dvyukov@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: x86@kernel.org
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jann Horn <jannh@google.com>
CC: kasan-dev@googlegroups.com
CC: linux-kernel@vger.kernel.org
v1:
* First public posting. This went to security@ first just in case, and
then I got districted with other things ahead of public posting.
---
arch/x86/include/asm/kfence.h | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
index ff5c7134a37a..acf9ffa1a171 100644
--- a/arch/x86/include/asm/kfence.h
+++ b/arch/x86/include/asm/kfence.h
@@ -42,10 +42,34 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
+ pteval_t val;
if (WARN_ON(!pte || level != PG_LEVEL_4K))
return false;
+ val = pte_val(*pte);
+
+ /*
+ * protect requires making the page not-present. If the PTE is
+ * already in the right state, there's nothing to do.
+ */
+ if (protect != !!(val & _PAGE_PRESENT))
+ return true;
+
+ /*
+ * Otherwise, invert the entire PTE. This avoids writing out an
+ * L1TF-vulnerable PTE (not present, without the high address bits
+ * set).
+ */
+ set_pte(pte, __pte(~val));
+
+ /*
+ * If the page was protected (non-present) and we're making it
+ * present, there is no need to flush the TLB at all.
+ */
+ if (!protect)
+ return true;
+
/*
* We need to avoid IPIs, as we may get KFENCE allocations or faults
* with interrupts disabled. Therefore, the below is best-effort, and
@@ -53,11 +77,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
* lazy fault handling takes care of faults after the page is PRESENT.
*/
- if (protect)
- set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
- else
- set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
-
/*
* Flush this CPU's TLB, assuming whoever did the allocation/free is
* likely to continue running on this CPU.
base-commit: 7f98ab9da046865d57c102fd3ca9669a29845f67
--
2.39.5
On Tue, 6 Jan 2026 18:04:26 +0000 Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> For native, the choice of PTE is fine. There's real memory backing the
> non-present PTE. However, for XenPV, Xen complains:
>
> (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing
>
> To explain, some background on XenPV pagetables:
>
> Xen PV guests are control their own pagetables; they choose the new PTE
> value, and use hypercalls to make changes so Xen can audit for safety.
>
> In addition to a regular reference count, Xen also maintains a type
> reference count. e.g. SegDesc (referenced by vGDT/vLDT),
> Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a
> lower pagetable level). This is in order to prevent e.g. a page being
> inserted into the pagetables for which the guest has a writable mapping.
>
> For non-present mappings, all other bits become software accessible, and
> typically contain metadata rather a real frame address. There is nothing
> that a reference count could sensibly be tied to. As such, even if Xen
> could recognise the address as currently safe, nothing would prevent that
> frame from changing owner to another VM in the future.
>
> When Xen detects a PV guest writing a L1TF-PTE, it responds by activating
> shadow paging. This is normally only used for the live phase of
> migration, and comes with a reasonable overhead.
>
> KFENCE only cares about getting #PF to catch wild accesses; it doesn't care
> about the value for non-present mappings. Use a fully inverted PTE, to
> avoid hitting the slow path when running under Xen.
>
> While adjusting the logic, take the opportunity to skip all actions if the
> PTE is already in the right state, half the number PVOps callouts, and skip
> TLB maintenance on a !P -> P transition which benefits non-Xen cases too.
>
> Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
Seems that I sent 1dc0da6e9ec0 upstream so thanks, I'll grab this. If
an x86 person chooses to handle it then I'll drop the mm.git version.
I'll add a cc:stable to the mm.git copy, just to be sure.
> Tested-by: Marco Elver <elver@google.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
That "^---$" tells tooling "changelog stops here".
> CC: Alexander Potapenko <glider@google.com>
> CC: Marco Elver <elver@google.com>
> CC: Dmitry Vyukov <dvyukov@google.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: Dave Hansen <dave.hansen@linux.intel.com>
> CC: x86@kernel.org
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Jann Horn <jannh@google.com>
> CC: kasan-dev@googlegroups.com
> CC: linux-kernel@vger.kernel.org
>
> v1:
> * First public posting. This went to security@ first just in case, and
> then I got districted with other things ahead of public posting.
> ---
That "^---$" would be better placed above the versioning info.
>
> ...
>
Hi All,
I am reporting a boot regression in v6.19-rc7 on an x86_32
environment. The kernel hangs immediately after "Booting the kernel"
and does not produce any early console output.
A git bisect identified the following commit as the first bad commit:
b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
Environment and Config:
- Guest Arch: x86_32 (one of my test VMs)
- Memory Config: # CONFIG_X86_PAE is not set
- KFENCE Config: CONFIG_KFENCE=y
- Host/Hypervisor: x86_64 host running KVM
The system fails to boot at a very early stage. I have confirmed that
reverting commit b505f1944535 on top of v6.19-rc7 completely resolves
the issue, and the kernel boots normally.
Could you please verify if this change is compatible with x86_32
(non-PAE) configurations?
I am happy to provide my full .config or test any potential fixes.
Best regards,
Ryusuke Konishi
On Tue, 27 Jan 2026 04:07:04 +0900 Ryusuke Konishi <konishi.ryusuke@gmail.com> wrote:
> Hi All,
>
> I am reporting a boot regression in v6.19-rc7 on an x86_32
> environment. The kernel hangs immediately after "Booting the kernel"
> and does not produce any early console output.
>
> A git bisect identified the following commit as the first bad commit:
> b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
Thanks. b505f1944535 had cc:stable so let's add some cc's to alert
-stable maintainers.
I see that b505f1944535 prevented a Xen warning, but did it have any
other runtime effects? If not, a prompt revert may be the way to
proceed for now.
> Environment and Config:
> - Guest Arch: x86_32 (one of my test VMs)
> - Memory Config: # CONFIG_X86_PAE is not set
> - KFENCE Config: CONFIG_KFENCE=y
> - Host/Hypervisor: x86_64 host running KVM
>
> The system fails to boot at a very early stage. I have confirmed that
> reverting commit b505f1944535 on top of v6.19-rc7 completely resolves
> the issue, and the kernel boots normally.
>
> Could you please verify if this change is compatible with x86_32
> (non-PAE) configurations?
> I am happy to provide my full .config or test any potential fixes.
>
> Best regards,
> Ryusuke Konishi
On 1/26/26 12:24, Andrew Morton wrote: > I see that b505f1944535 prevented a Xen warning, but did it have any > other runtime effects? If not, a prompt revert may be the way to > proceed for now. Yeah, that's fine. At the same time ... KFENCE folks: I wonder if you've been testing on highmem and/or 32-bit x86 builds or if there's much value to keeping KFENCE maintained there.
On Tue, Jan 27, 2026 at 04:07:04AM +0900, Ryusuke Konishi wrote:
> Hi All,
>
> I am reporting a boot regression in v6.19-rc7 on an x86_32
> environment. The kernel hangs immediately after "Booting the kernel"
> and does not produce any early console output.
>
> A git bisect identified the following commit as the first bad commit:
> b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
I can confirm the same - my 32-bit laptop experiences the same. The guest
splat looks like this:
[ 0.173437] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.175172] ------------[ cut here ]------------
[ 0.176066] kernel BUG at arch/x86/mm/physaddr.c:70!
[ 0.177037] Oops: invalid opcode: 0000 [#1] SMP
[ 0.177914] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc7+ #1 PREEMPT(full)
[ 0.179509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 0.181363] EIP: __phys_addr+0x78/0x90
[ 0.182089] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
[ 0.185723] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
[ 0.186972] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
[ 0.188182] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
[ 0.189503] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
[ 0.191045] Call Trace:
[ 0.191518] kfence_init+0x3a/0x94
[ 0.192177] start_kernel+0x4ea/0x62c
[ 0.192894] i386_start_kernel+0x65/0x68
[ 0.193653] startup_32_smp+0x151/0x154
[ 0.194397] Modules linked in:
[ 0.194987] ---[ end trace 0000000000000000 ]---
[ 0.195879] EIP: __phys_addr+0x78/0x90
[ 0.196610] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
[ 0.200231] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
[ 0.201452] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
[ 0.202693] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
[ 0.204011] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
[ 0.205235] Kernel panic - not syncing: Attempted to kill the idle task!
[ 0.206897] ---[ end Kernel panic - not syncing: Attempted to: kill the idle task! ]---
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 1/26/26 11:54, Borislav Petkov wrote: > [ 0.173437] rcu: srcu_init: Setting srcu_struct sizes based on contention. > [ 0.175172] ------------[ cut here ]------------ > [ 0.176066] kernel BUG at arch/x86/mm/physaddr.c:70! Take a look at kfence_init_pool_early(). It's riddled with __pa() which calls down to __phys_addr() => slow_virt_to_phys(). The plain !present PTE is fine, but the inverted one trips up slow_virt_to_phys(), I bet. The slow_virt_to_phys() only gets called on when highmem is enabled (not when the memory is highmem) which is why this is blowing up on 32-bit only. The easiest hack/fix would be to just turn off kfence on 32-bit. I guess the better fix would be to make kfence do its __pa() before it mucks with the PTEs. The other option would be to either comprehend or ignore those inverted PTEs. Ugh.
On 26/01/2026 7:54 pm, Borislav Petkov wrote:
> On Tue, Jan 27, 2026 at 04:07:04AM +0900, Ryusuke Konishi wrote:
>> Hi All,
>>
>> I am reporting a boot regression in v6.19-rc7 on an x86_32
>> environment. The kernel hangs immediately after "Booting the kernel"
>> and does not produce any early console output.
>>
>> A git bisect identified the following commit as the first bad commit:
>> b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
> I can confirm the same - my 32-bit laptop experiences the same. The guest
> splat looks like this:
>
> [ 0.173437] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [ 0.175172] ------------[ cut here ]------------
> [ 0.176066] kernel BUG at arch/x86/mm/physaddr.c:70!
> [ 0.177037] Oops: invalid opcode: 0000 [#1] SMP
> [ 0.177914] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc7+ #1 PREEMPT(full)
> [ 0.179509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 0.181363] EIP: __phys_addr+0x78/0x90
> [ 0.182089] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
> [ 0.185723] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
> [ 0.186972] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
> [ 0.188182] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
> [ 0.189503] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
> [ 0.191045] Call Trace:
> [ 0.191518] kfence_init+0x3a/0x94
> [ 0.192177] start_kernel+0x4ea/0x62c
> [ 0.192894] i386_start_kernel+0x65/0x68
> [ 0.193653] startup_32_smp+0x151/0x154
> [ 0.194397] Modules linked in:
> [ 0.194987] ---[ end trace 0000000000000000 ]---
> [ 0.195879] EIP: __phys_addr+0x78/0x90
> [ 0.196610] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
> [ 0.200231] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
> [ 0.201452] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
> [ 0.202693] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
> [ 0.204011] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
> [ 0.205235] Kernel panic - not syncing: Attempted to kill the idle task!
> [ 0.206897] ---[ end Kernel panic - not syncing: Attempted to: kill the idle task! ]---
Ok, we're hitting a BUG, not a TLB flushing problem. That's:
BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
so it's obviously to do with the inverted pte. pgtable-2level.h has
/* No inverted PFNs on 2 level page tables */
and that was definitely an oversight on my behalf. Sorry.
Does this help?
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
index acf9ffa1a171..310e0193d731 100644
--- a/arch/x86/include/asm/kfence.h
+++ b/arch/x86/include/asm/kfence.h
@@ -42,7 +42,7 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
- pteval_t val;
+ pteval_t val, new;
if (WARN_ON(!pte || level != PG_LEVEL_4K))
return false;
@@ -61,7 +61,8 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
* L1TF-vulnerable PTE (not present, without the high address bits
* set).
*/
- set_pte(pte, __pte(~val));
+ new = val ^ _PAGE_PRESENT;
+ set_pte(pte, __pte(flip_protnone_guard(val, new, PTE_PFN_MASK)));
/*
* If the page was protected (non-present) and we're making it
Only compile tested. flip_protnone_guard() seems the helper which is a
nop on 2-level paging.
~Andrew
On Tue, Jan 27, 2026 at 5:22 AM Andrew Cooper wrote:
>
> On 26/01/2026 7:54 pm, Borislav Petkov wrote:
> > On Tue, Jan 27, 2026 at 04:07:04AM +0900, Ryusuke Konishi wrote:
> >> Hi All,
> >>
> >> I am reporting a boot regression in v6.19-rc7 on an x86_32
> >> environment. The kernel hangs immediately after "Booting the kernel"
> >> and does not produce any early console output.
> >>
> >> A git bisect identified the following commit as the first bad commit:
> >> b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
> > I can confirm the same - my 32-bit laptop experiences the same. The guest
> > splat looks like this:
> >
> > [ 0.173437] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> > [ 0.175172] ------------[ cut here ]------------
> > [ 0.176066] kernel BUG at arch/x86/mm/physaddr.c:70!
> > [ 0.177037] Oops: invalid opcode: 0000 [#1] SMP
> > [ 0.177914] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc7+ #1 PREEMPT(full)
> > [ 0.179509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > [ 0.181363] EIP: __phys_addr+0x78/0x90
> > [ 0.182089] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
> > [ 0.185723] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
> > [ 0.186972] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
> > [ 0.188182] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
> > [ 0.189503] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
> > [ 0.191045] Call Trace:
> > [ 0.191518] kfence_init+0x3a/0x94
> > [ 0.192177] start_kernel+0x4ea/0x62c
> > [ 0.192894] i386_start_kernel+0x65/0x68
> > [ 0.193653] startup_32_smp+0x151/0x154
> > [ 0.194397] Modules linked in:
> > [ 0.194987] ---[ end trace 0000000000000000 ]---
> > [ 0.195879] EIP: __phys_addr+0x78/0x90
> > [ 0.196610] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
> > [ 0.200231] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
> > [ 0.201452] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
> > [ 0.202693] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
> > [ 0.204011] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
> > [ 0.205235] Kernel panic - not syncing: Attempted to kill the idle task!
> > [ 0.206897] ---[ end Kernel panic - not syncing: Attempted to: kill the idle task! ]---
>
> Ok, we're hitting a BUG, not a TLB flushing problem. That's:
>
> BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
>
> so it's obviously to do with the inverted pte. pgtable-2level.h has
>
> /* No inverted PFNs on 2 level page tables */
>
> and that was definitely an oversight on my behalf. Sorry.
>
> Does this help?
>
> diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
> index acf9ffa1a171..310e0193d731 100644
> --- a/arch/x86/include/asm/kfence.h
> +++ b/arch/x86/include/asm/kfence.h
> @@ -42,7 +42,7 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
> {
> unsigned int level;
> pte_t *pte = lookup_address(addr, &level);
> - pteval_t val;
> + pteval_t val, new;
>
> if (WARN_ON(!pte || level != PG_LEVEL_4K))
> return false;
> @@ -61,7 +61,8 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
> * L1TF-vulnerable PTE (not present, without the high address bits
> * set).
> */
> - set_pte(pte, __pte(~val));
> + new = val ^ _PAGE_PRESENT;
> + set_pte(pte, __pte(flip_protnone_guard(val, new, PTE_PFN_MASK)));
>
> /*
> * If the page was protected (non-present) and we're making it
>
>
>
> Only compile tested. flip_protnone_guard() seems the helper which is a
> nop on 2-level paging.
>
> ~Andrew
Yes, after applying this, it started booting.
Leaving aside the discussion of the fix, I'll just share the test
result for now.
Regards,
Ryusuke Konishi
On 26/01/2026 8:41 pm, Ryusuke Konishi wrote:
> On Tue, Jan 27, 2026 at 5:22 AM Andrew Cooper wrote:
>> On 26/01/2026 7:54 pm, Borislav Petkov wrote:
>>> On Tue, Jan 27, 2026 at 04:07:04AM +0900, Ryusuke Konishi wrote:
>>>> Hi All,
>>>>
>>>> I am reporting a boot regression in v6.19-rc7 on an x86_32
>>>> environment. The kernel hangs immediately after "Booting the kernel"
>>>> and does not produce any early console output.
>>>>
>>>> A git bisect identified the following commit as the first bad commit:
>>>> b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
>>> I can confirm the same - my 32-bit laptop experiences the same. The guest
>>> splat looks like this:
>>>
>>> [ 0.173437] rcu: srcu_init: Setting srcu_struct sizes based on contention.
>>> [ 0.175172] ------------[ cut here ]------------
>>> [ 0.176066] kernel BUG at arch/x86/mm/physaddr.c:70!
>>> [ 0.177037] Oops: invalid opcode: 0000 [#1] SMP
>>> [ 0.177914] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc7+ #1 PREEMPT(full)
>>> [ 0.179509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>>> [ 0.181363] EIP: __phys_addr+0x78/0x90
>>> [ 0.182089] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
>>> [ 0.185723] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
>>> [ 0.186972] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
>>> [ 0.188182] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
>>> [ 0.189503] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
>>> [ 0.191045] Call Trace:
>>> [ 0.191518] kfence_init+0x3a/0x94
>>> [ 0.192177] start_kernel+0x4ea/0x62c
>>> [ 0.192894] i386_start_kernel+0x65/0x68
>>> [ 0.193653] startup_32_smp+0x151/0x154
>>> [ 0.194397] Modules linked in:
>>> [ 0.194987] ---[ end trace 0000000000000000 ]---
>>> [ 0.195879] EIP: __phys_addr+0x78/0x90
>>> [ 0.196610] Code: 89 c8 5b 5d c3 2e 8d 74 26 00 0f 0b 8d b6 00 00 00 00 89 45 f8 e8 08 a4 1d 00 84 c0 8b 55 f8 74 b0 0f 0b 8d b4 26 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 0f 0b 66 90 8d 74 26 00 2e 8d b4 26 00 00
>>> [ 0.200231] EAX: ce383000 EBX: 00031c7c ECX: 31c7c000 EDX: 034ec000
>>> [ 0.201452] ESI: c1ed3eec EDI: f21fd101 EBP: c2055f78 ESP: c2055f70
>>> [ 0.202693] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086
>>> [ 0.204011] CR0: 80050033 CR2: ffd98000 CR3: 029cf000 CR4: 00000090
>>> [ 0.205235] Kernel panic - not syncing: Attempted to kill the idle task!
>>> [ 0.206897] ---[ end Kernel panic - not syncing: Attempted to: kill the idle task! ]---
>> Ok, we're hitting a BUG, not a TLB flushing problem. That's:
>>
>> BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
>>
>> so it's obviously to do with the inverted pte. pgtable-2level.h has
>>
>> /* No inverted PFNs on 2 level page tables */
>>
>> and that was definitely an oversight on my behalf. Sorry.
>>
>> Does this help?
>>
>> diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
>> index acf9ffa1a171..310e0193d731 100644
>> --- a/arch/x86/include/asm/kfence.h
>> +++ b/arch/x86/include/asm/kfence.h
>> @@ -42,7 +42,7 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
>> {
>> unsigned int level;
>> pte_t *pte = lookup_address(addr, &level);
>> - pteval_t val;
>> + pteval_t val, new;
>>
>> if (WARN_ON(!pte || level != PG_LEVEL_4K))
>> return false;
>> @@ -61,7 +61,8 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
>> * L1TF-vulnerable PTE (not present, without the high address bits
>> * set).
>> */
>> - set_pte(pte, __pte(~val));
>> + new = val ^ _PAGE_PRESENT;
>> + set_pte(pte, __pte(flip_protnone_guard(val, new, PTE_PFN_MASK)));
>>
>> /*
>> * If the page was protected (non-present) and we're making it
>>
>>
>>
>> Only compile tested. flip_protnone_guard() seems the helper which is a
>> nop on 2-level paging.
>>
>> ~Andrew
> Yes, after applying this, it started booting.
> Leaving aside the discussion of the fix, I'll just share the test
> result for now.
Thanks. I'll put together a proper patch.
~Andrew
On 26/01/2026 7:07 pm, Ryusuke Konishi wrote:
> Hi All,
>
> I am reporting a boot regression in v6.19-rc7 on an x86_32
> environment. The kernel hangs immediately after "Booting the kernel"
> and does not produce any early console output.
>
> A git bisect identified the following commit as the first bad commit:
> b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
>
> Environment and Config:
> - Guest Arch: x86_32 (one of my test VMs)
> - Memory Config: # CONFIG_X86_PAE is not set
> - KFENCE Config: CONFIG_KFENCE=y
> - Host/Hypervisor: x86_64 host running KVM
>
> The system fails to boot at a very early stage. I have confirmed that
> reverting commit b505f1944535 on top of v6.19-rc7 completely resolves
> the issue, and the kernel boots normally.
>
> Could you please verify if this change is compatible with x86_32
> (non-PAE) configurations?
> I am happy to provide my full .config or test any potential fixes.
Hmm. To start with, does this fix the crash?
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
index acf9ffa1a171..2fe454722e54 100644
--- a/arch/x86/include/asm/kfence.h
+++ b/arch/x86/include/asm/kfence.h
@@ -67,8 +67,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
* If the page was protected (non-present) and we're making it
* present, there is no need to flush the TLB at all.
*/
- if (!protect)
- return true;
/*
* We need to avoid IPIs, as we may get KFENCE allocations or faults
Re-reading, I can't spot anything obvious.
Architecturally, x86 explicitly does not need a TLB flush when turning a
non-present mapping present, and it's strictly 4k leaf mappings we're
handling here.
I wonder if something else is missing a flush, and was being covered by
this.
~Andrew
On Tue, Jan 27, 2026 at 4:39 AM Andrew Cooper wrote:
>
> On 26/01/2026 7:07 pm, Ryusuke Konishi wrote:
> > Hi All,
> >
> > I am reporting a boot regression in v6.19-rc7 on an x86_32
> > environment. The kernel hangs immediately after "Booting the kernel"
> > and does not produce any early console output.
> >
> > A git bisect identified the following commit as the first bad commit:
> > b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
> >
> > Environment and Config:
> > - Guest Arch: x86_32 (one of my test VMs)
> > - Memory Config: # CONFIG_X86_PAE is not set
> > - KFENCE Config: CONFIG_KFENCE=y
> > - Host/Hypervisor: x86_64 host running KVM
> >
> > The system fails to boot at a very early stage. I have confirmed that
> > reverting commit b505f1944535 on top of v6.19-rc7 completely resolves
> > the issue, and the kernel boots normally.
> >
> > Could you please verify if this change is compatible with x86_32
> > (non-PAE) configurations?
> > I am happy to provide my full .config or test any potential fixes.
>
> Hmm. To start with, does this fix the crash?
>
> diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
> index acf9ffa1a171..2fe454722e54 100644
> --- a/arch/x86/include/asm/kfence.h
> +++ b/arch/x86/include/asm/kfence.h
> @@ -67,8 +67,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
> * If the page was protected (non-present) and we're making it
> * present, there is no need to flush the TLB at all.
> */
> - if (!protect)
> - return true;
>
> /*
> * We need to avoid IPIs, as we may get KFENCE allocations or faults
>
>
>
> Re-reading, I can't spot anything obvious.
>
> Architecturally, x86 explicitly does not need a TLB flush when turning a
> non-present mapping present, and it's strictly 4k leaf mappings we're
> handling here.
>
> I wonder if something else is missing a flush, and was being covered by
> this.
>
> ~Andrew
I tested this change, but unfortunately the boot hang still occurs.
Regards,
Ryusuke Konishi
The original patch inverted the PTE unconditionally to avoid
L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level
paging.
Adjust the logic to use the flip_protnone_guard() helper, which is a nop on
2-level paging but inverts the address bits in all other paging modes.
This doesn't matter for the Xen aspect of the original change. Linux no
longer supports running 32bit PV under Xen, and Xen doesn't support running
any 32bit PV guests without using PAE paging.
Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ryusuke Konishi <konishi.ryusuke@gmail.com>
CC: Alexander Potapenko <glider@google.com>
CC: Marco Elver <elver@google.com>
CC: Dmitry Vyukov <dvyukov@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: x86@kernel.org
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jann Horn <jannh@google.com>
CC: kasan-dev@googlegroups.com
CC: linux-kernel@vger.kernel.org
---
v2:
* Fix a spelling mistake in the comment.
---
arch/x86/include/asm/kfence.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
index acf9ffa1a171..dfd5c74ba41a 100644
--- a/arch/x86/include/asm/kfence.h
+++ b/arch/x86/include/asm/kfence.h
@@ -42,7 +42,7 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
- pteval_t val;
+ pteval_t val, new;
if (WARN_ON(!pte || level != PG_LEVEL_4K))
return false;
@@ -57,11 +57,12 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
return true;
/*
- * Otherwise, invert the entire PTE. This avoids writing out an
+ * Otherwise, flip the Present bit, taking care to avoid writing an
* L1TF-vulnerable PTE (not present, without the high address bits
* set).
*/
- set_pte(pte, __pte(~val));
+ new = val ^ _PAGE_PRESENT;
+ set_pte(pte, __pte(flip_protnone_guard(val, new, PTE_PFN_MASK)));
/*
* If the page was protected (non-present) and we're making it
base-commit: fcb70a56f4d81450114034b2c61f48ce7444a0e2
--
2.39.5
On Mon, 26 Jan 2026 21:10:46 +0000 Andrew Cooper <andrew.cooper3@citrix.com> wrote: > The original patch inverted the PTE unconditionally to avoid > L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level > paging. > > Adjust the logic to use the flip_protnone_guard() helper, which is a nop on > 2-level paging but inverts the address bits in all other paging modes. > > This doesn't matter for the Xen aspect of the original change. Linux no > longer supports running 32bit PV under Xen, and Xen doesn't support running > any 32bit PV guests without using PAE paging. Great thanks. I'll add Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> and, importantly, Cc: <stable@vger.kernel.org> to help everything get threaded together correctly. I'll queue this as a 6.19-rcX hotfix.
On Mon, Jan 26, 2026 at 01:24:50PM -0800, Andrew Morton wrote:
> Great thanks. I'll add
>
> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
>
> and, importantly,
>
> Cc: <stable@vger.kernel.org>
>
> to help everything get threaded together correctly.
>
>
> I'll queue this as a 6.19-rcX hotfix.
You can add also
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Works on a real hw too.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 26/01/2026 9:56 pm, Borislav Petkov wrote: > On Mon, Jan 26, 2026 at 01:24:50PM -0800, Andrew Morton wrote: >> Great thanks. I'll add >> >> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> >> >> and, importantly, >> >> Cc: <stable@vger.kernel.org> >> >> to help everything get threaded together correctly. >> >> >> I'll queue this as a 6.19-rcX hotfix. > You can add also > > Tested-by: Borislav Petkov (AMD) <bp@alien8.de> > > Works on a real hw too. Thanks, and sorry for the breakage. ~Andrew
On Mon, Jan 26, 2026 at 10:01:56PM +0000, Andrew Cooper wrote:
> Thanks, and sorry for the breakage.
Bah, no one cares about 32-bit. :-P
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
The original patch inverted the PTE unconditionally to avoid
L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level
paging.
Adjust the logic to use the flip_protnone_guard() helper, which is a nop on
2-level paging but inverts the address bits in all other paging modes.
This doesn't matter for the Xen aspect of the original change. Linux no
longer supports running 32bit PV under Xen, and Xen doesn't support running
any 32bit PV guests without using PAE paging.
Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ryusuke Konishi <konishi.ryusuke@gmail.com>
CC: Alexander Potapenko <glider@google.com>
CC: Marco Elver <elver@google.com>
CC: Dmitry Vyukov <dvyukov@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: x86@kernel.org
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jann Horn <jannh@google.com>
CC: kasan-dev@googlegroups.com
CC: linux-kernel@vger.kernel.org
---
arch/x86/include/asm/kfence.h | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
index acf9ffa1a171..40cf6a5d781d 100644
--- a/arch/x86/include/asm/kfence.h
+++ b/arch/x86/include/asm/kfence.h
@@ -42,7 +42,7 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
- pteval_t val;
+ pteval_t val, new;
if (WARN_ON(!pte || level != PG_LEVEL_4K))
return false;
@@ -57,11 +57,12 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
return true;
/*
- * Otherwise, invert the entire PTE. This avoids writing out an
- * L1TF-vulnerable PTE (not present, without the high address bits
+ * Otherwise, flip the Present bit, taking care to avoid writing an
+ * L1TF-vulenrable PTE (not present, without the high address bits
* set).
*/
- set_pte(pte, __pte(~val));
+ new = val ^ _PAGE_PRESENT;
+ set_pte(pte, __pte(flip_protnone_guard(val, new, PTE_PFN_MASK)));
/*
* If the page was protected (non-present) and we're making it
base-commit: fcb70a56f4d81450114034b2c61f48ce7444a0e2
--
2.39.5
On 26/01/2026 9:06 pm, Andrew Cooper wrote:
> The original patch inverted the PTE unconditionally to avoid
> L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level
> paging.
>
> Adjust the logic to use the flip_protnone_guard() helper, which is a nop on
> 2-level paging but inverts the address bits in all other paging modes.
>
> This doesn't matter for the Xen aspect of the original change. Linux no
> longer supports running 32bit PV under Xen, and Xen doesn't support running
> any 32bit PV guests without using PAE paging.
>
> Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
> Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
> Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: Ryusuke Konishi <konishi.ryusuke@gmail.com>
> CC: Alexander Potapenko <glider@google.com>
> CC: Marco Elver <elver@google.com>
> CC: Dmitry Vyukov <dvyukov@google.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: Dave Hansen <dave.hansen@linux.intel.com>
> CC: x86@kernel.org
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Jann Horn <jannh@google.com>
> CC: kasan-dev@googlegroups.com
> CC: linux-kernel@vger.kernel.org
> ---
> arch/x86/include/asm/kfence.h | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
> index acf9ffa1a171..40cf6a5d781d 100644
> --- a/arch/x86/include/asm/kfence.h
> +++ b/arch/x86/include/asm/kfence.h
> @@ -42,7 +42,7 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
> {
> unsigned int level;
> pte_t *pte = lookup_address(addr, &level);
> - pteval_t val;
> + pteval_t val, new;
>
> if (WARN_ON(!pte || level != PG_LEVEL_4K))
> return false;
> @@ -57,11 +57,12 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
> return true;
>
> /*
> - * Otherwise, invert the entire PTE. This avoids writing out an
> - * L1TF-vulnerable PTE (not present, without the high address bits
> + * Otherwise, flip the Present bit, taking care to avoid writing an
> + * L1TF-vulenrable PTE (not present, without the high address bits
> * set).
> */
> - set_pte(pte, __pte(~val));
> + new = val ^ _PAGE_PRESENT;
> + set_pte(pte, __pte(flip_protnone_guard(val, new, PTE_PFN_MASK)));
>
> /*
> * If the page was protected (non-present) and we're making it
>
> base-commit: fcb70a56f4d81450114034b2c61f48ce7444a0e2
And I apparently can't spell. I'll do a v2 immediately, seeing as this
is somewhat urgent.
~Andrew
On Tue, Jan 6, 2026 at 7:04 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> For native, the choice of PTE is fine. There's real memory backing the
> non-present PTE. However, for XenPV, Xen complains:
>
> (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing
>
> To explain, some background on XenPV pagetables:
>
> Xen PV guests are control their own pagetables; they choose the new PTE
> value, and use hypercalls to make changes so Xen can audit for safety.
>
> In addition to a regular reference count, Xen also maintains a type
> reference count. e.g. SegDesc (referenced by vGDT/vLDT),
> Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a
> lower pagetable level). This is in order to prevent e.g. a page being
> inserted into the pagetables for which the guest has a writable mapping.
>
> For non-present mappings, all other bits become software accessible, and
> typically contain metadata rather a real frame address. There is nothing
> that a reference count could sensibly be tied to. As such, even if Xen
> could recognise the address as currently safe, nothing would prevent that
> frame from changing owner to another VM in the future.
>
> When Xen detects a PV guest writing a L1TF-PTE, it responds by activating
> shadow paging. This is normally only used for the live phase of
> migration, and comes with a reasonable overhead.
>
> KFENCE only cares about getting #PF to catch wild accesses; it doesn't care
> about the value for non-present mappings. Use a fully inverted PTE, to
> avoid hitting the slow path when running under Xen.
>
> While adjusting the logic, take the opportunity to skip all actions if the
> PTE is already in the right state, half the number PVOps callouts, and skip
> TLB maintenance on a !P -> P transition which benefits non-Xen cases too.
>
> Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
> Tested-by: Marco Elver <elver@google.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
> /*
> * We need to avoid IPIs, as we may get KFENCE allocations or faults
> * with interrupts disabled. Therefore, the below is best-effort, and
> @@ -53,11 +77,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
> * lazy fault handling takes care of faults after the page is PRESENT.
> */
Nit: should this comment be moved above before set_pte() or merged wit
the following comment block?
On 07/01/2026 11:31 am, Alexander Potapenko wrote:
> On Tue, Jan 6, 2026 at 7:04 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> For native, the choice of PTE is fine. There's real memory backing the
>> non-present PTE. However, for XenPV, Xen complains:
>>
>> (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing
>>
>> To explain, some background on XenPV pagetables:
>>
>> Xen PV guests are control their own pagetables; they choose the new PTE
>> value, and use hypercalls to make changes so Xen can audit for safety.
>>
>> In addition to a regular reference count, Xen also maintains a type
>> reference count. e.g. SegDesc (referenced by vGDT/vLDT),
>> Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a
>> lower pagetable level). This is in order to prevent e.g. a page being
>> inserted into the pagetables for which the guest has a writable mapping.
>>
>> For non-present mappings, all other bits become software accessible, and
>> typically contain metadata rather a real frame address. There is nothing
>> that a reference count could sensibly be tied to. As such, even if Xen
>> could recognise the address as currently safe, nothing would prevent that
>> frame from changing owner to another VM in the future.
>>
>> When Xen detects a PV guest writing a L1TF-PTE, it responds by activating
>> shadow paging. This is normally only used for the live phase of
>> migration, and comes with a reasonable overhead.
>>
>> KFENCE only cares about getting #PF to catch wild accesses; it doesn't care
>> about the value for non-present mappings. Use a fully inverted PTE, to
>> avoid hitting the slow path when running under Xen.
>>
>> While adjusting the logic, take the opportunity to skip all actions if the
>> PTE is already in the right state, half the number PVOps callouts, and skip
>> TLB maintenance on a !P -> P transition which benefits non-Xen cases too.
>>
>> Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
>> Tested-by: Marco Elver <elver@google.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Alexander Potapenko <glider@google.com>
Thanks.
>
>> /*
>> * We need to avoid IPIs, as we may get KFENCE allocations or faults
>> * with interrupts disabled. Therefore, the below is best-effort, and
>> @@ -53,11 +77,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
>> * lazy fault handling takes care of faults after the page is PRESENT.
>> */
> Nit: should this comment be moved above before set_pte() or merged wit
> the following comment block?
Hmm, probably merged as they're both about the TLB maintenance. But the
end result is a far more messy diff:
@@ -42,23 +42,40 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
+ pteval_t val;
if (WARN_ON(!pte || level != PG_LEVEL_4K))
return false;
+ val = pte_val(*pte);
+
/*
- * We need to avoid IPIs, as we may get KFENCE allocations or faults
- * with interrupts disabled. Therefore, the below is best-effort, and
- * does not flush TLBs on all CPUs. We can tolerate some inaccuracy;
- * lazy fault handling takes care of faults after the page is PRESENT.
+ * protect requires making the page not-present. If the PTE is
+ * already in the right state, there's nothing to do.
+ */
+ if (protect != !!(val & _PAGE_PRESENT))
+ return true;
+
+ /*
+ * Otherwise, invert the entire PTE. This avoids writing out an
+ * L1TF-vulnerable PTE (not present, without the high address bits
+ * set).
*/
+ set_pte(pte, __pte(~val));
- if (protect)
- set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
- else
- set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
+ /*
+ * If the page was protected (non-present) and we're making it
+ * present, there is no need to flush the TLB at all.
+ */
+ if (!protect)
+ return true;
/*
+ * We need to avoid IPIs, as we may get KFENCE allocations or faults
+ * with interrupts disabled. Therefore, the below is best-effort, and
+ * does not flush TLBs on all CPUs. We can tolerate some inaccuracy;
+ * lazy fault handling takes care of faults after the page is PRESENT.
+ *
* Flush this CPU's TLB, assuming whoever did the allocation/free is
* likely to continue running on this CPU.
*/
I need to resubmit anyway, because I've spotted one silly error in the
commit message.
I could submit two patches, with the second one stated as "to make the
previous patch legible".
Thoughts?
~Andrew
© 2016 - 2026 Red Hat, Inc.