Add page fault trace points, which are useful to implement RV monitor which
watches page faults.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
---
arch/arm64/mm/fault.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index ef63651099a9..e3f096b0dffd 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -44,6 +44,9 @@
#include <asm/tlbflush.h>
#include <asm/traps.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/exceptions.h>
+
struct fault_info {
int (*fn)(unsigned long far, unsigned long esr,
struct pt_regs *regs);
@@ -559,6 +562,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
if (kprobe_page_fault(regs, esr))
return 0;
+ if (user_mode(regs))
+ trace_page_fault_user(addr, regs, esr);
+ else
+ trace_page_fault_kernel(addr, regs, esr);
+
/*
* If we're in an interrupt or have no user context, we must not take
* the fault.
--
2.39.5
On Wed, Apr 30, 2025 at 01:02:32PM +0200, Nam Cao wrote:
> Add page fault trace points, which are useful to implement RV monitor which
> watches page faults.
>
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> ---
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> ---
> arch/arm64/mm/fault.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index ef63651099a9..e3f096b0dffd 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -44,6 +44,9 @@
> #include <asm/tlbflush.h>
> #include <asm/traps.h>
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/exceptions.h>
> +
> struct fault_info {
> int (*fn)(unsigned long far, unsigned long esr,
> struct pt_regs *regs);
> @@ -559,6 +562,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> if (kprobe_page_fault(regs, esr))
> return 0;
>
> + if (user_mode(regs))
> + trace_page_fault_user(addr, regs, esr);
> + else
> + trace_page_fault_kernel(addr, regs, esr);
Why is this after kprobe_page_fault()?
It's also a shame that the RV monitor can't hook into perf, as we
already have a sw event for page faults that you could use instead of
adding something new.
Will
On Fri, May 16, 2025 at 03:04:50PM +0100, Will Deacon wrote:
> On Wed, Apr 30, 2025 at 01:02:32PM +0200, Nam Cao wrote:
> > Add page fault trace points, which are useful to implement RV monitor which
> > watches page faults.
> >
> > Signed-off-by: Nam Cao <namcao@linutronix.de>
> > ---
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: linux-arm-kernel@lists.infradead.org
> > ---
> > arch/arm64/mm/fault.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index ef63651099a9..e3f096b0dffd 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -44,6 +44,9 @@
> > #include <asm/tlbflush.h>
> > #include <asm/traps.h>
> >
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/exceptions.h>
> > +
> > struct fault_info {
> > int (*fn)(unsigned long far, unsigned long esr,
> > struct pt_regs *regs);
> > @@ -559,6 +562,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> > if (kprobe_page_fault(regs, esr))
> > return 0;
> >
> > + if (user_mode(regs))
> > + trace_page_fault_user(addr, regs, esr);
> > + else
> > + trace_page_fault_kernel(addr, regs, esr);
>
> Why is this after kprobe_page_fault()?
The kprobe_page_fault() gunk is doing something quite different, and is
poorly named. That's trying to fixup the PC (and some other state) to
hide kprobe details from the fault handling logic when an out-of-line
copy of an instruction somehow triggers a fault.
Logically, that *should* happen before the tracepoints, and shouldn't be
moved later. For other reasons it needs to be even earlier in the fault
handling flow, and is currently far too late, but that only ends up
mattering int he presence of other kernel bugs. For now I think it
should stay where it is.
More details below, for the curious and/or deranged.
The kprobe_page_fault() gunk is trying to fix up the case where an
instruction has been kprobed, an out-of-line copy of that instruction is
being stepped, and the out-of-line instruction has triggered a fault.
When that happens, kprobe_page_fault() tries to reset the faulting PC
and DAIF such that it looks like the fault was taken from the original
PC of the probed instruction.
The real logic for that happens in kprobe_fault_handler(), which adjusts
the values in pt_regs, but does not handle the live DAIF value. It also
doesn't handle the PMR when pNMI is in use. Due to this, the fault
handler can run with DAIF bits masked unexpectedly, and a subsequent
exception return *could* go wrong.
Luckily all code with an extable entry has been blacklisted for kprobes
since commit:
888b3c8720e0a403 ("arm64: Treat all entry code as non-kprobe-able")
... so we should only get here if there's another kernel bug that causes
an unmarked dereference of a faulting address, in which case we're
likely to BUG() anyway.
The real fix would be to hoist this out to the arm64 entry code (and
handle similar for other EL1 exceptions), and get rid of all the
__kprobes annotations inthe fault code.
Mark.
On Mon, May 19, 2025 at 05:17:02PM +0100, Mark Rutland wrote:
> On Fri, May 16, 2025 at 03:04:50PM +0100, Will Deacon wrote:
> > On Wed, Apr 30, 2025 at 01:02:32PM +0200, Nam Cao wrote:
> > > Add page fault trace points, which are useful to implement RV monitor which
> > > watches page faults.
> > >
> > > Signed-off-by: Nam Cao <namcao@linutronix.de>
> > > ---
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: linux-arm-kernel@lists.infradead.org
> > > ---
> > > arch/arm64/mm/fault.c | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > > index ef63651099a9..e3f096b0dffd 100644
> > > --- a/arch/arm64/mm/fault.c
> > > +++ b/arch/arm64/mm/fault.c
> > > @@ -44,6 +44,9 @@
> > > #include <asm/tlbflush.h>
> > > #include <asm/traps.h>
> > >
> > > +#define CREATE_TRACE_POINTS
> > > +#include <trace/events/exceptions.h>
> > > +
> > > struct fault_info {
> > > int (*fn)(unsigned long far, unsigned long esr,
> > > struct pt_regs *regs);
> > > @@ -559,6 +562,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> > > if (kprobe_page_fault(regs, esr))
> > > return 0;
> > >
> > > + if (user_mode(regs))
> > > + trace_page_fault_user(addr, regs, esr);
> > > + else
> > > + trace_page_fault_kernel(addr, regs, esr);
> >
> > Why is this after kprobe_page_fault()?
>
> The kprobe_page_fault() gunk is doing something quite different, and is
> poorly named. That's trying to fixup the PC (and some other state) to
> hide kprobe details from the fault handling logic when an out-of-line
> copy of an instruction somehow triggers a fault.
>
> Logically, that *should* happen before the tracepoints, and shouldn't be
> moved later. For other reasons it needs to be even earlier in the fault
> handling flow, and is currently far too late, but that only ends up
> mattering int he presence of other kernel bugs. For now I think it
> should stay where it is.
I thought these tracepoints were intended to be used by RV, in which
case I'd have thought we'd want as much coverage as possible to reason
about what the kernel is actually doing.
> More details below, for the curious and/or deranged.
>
> The kprobe_page_fault() gunk is trying to fix up the case where an
> instruction has been kprobed, an out-of-line copy of that instruction is
> being stepped, and the out-of-line instruction has triggered a fault.
> When that happens, kprobe_page_fault() tries to reset the faulting PC
> and DAIF such that it looks like the fault was taken from the original
> PC of the probed instruction.
>
> The real logic for that happens in kprobe_fault_handler(), which adjusts
> the values in pt_regs, but does not handle the live DAIF value. It also
> doesn't handle the PMR when pNMI is in use. Due to this, the fault
> handler can run with DAIF bits masked unexpectedly, and a subsequent
> exception return *could* go wrong.
>
> Luckily all code with an extable entry has been blacklisted for kprobes
> since commit:
>
> 888b3c8720e0a403 ("arm64: Treat all entry code as non-kprobe-able")
>
> ... so we should only get here if there's another kernel bug that causes
> an unmarked dereference of a faulting address, in which case we're
> likely to BUG() anyway.
>
> The real fix would be to hoist this out to the arm64 entry code (and
> handle similar for other EL1 exceptions), and get rid of all the
> __kprobes annotations inthe fault code.
This seems to be an argument for removing kprobe_page_fault() entirely,
which is fine, but while it exists it's not obvious to me how it's
supposed to interact with RV.
I suppose the pragmatic thing to do would be to align as closely as
possible with x86, but any documentation/guidance/tests to help us
maintain that would be really helpful. Otherwise, this feels like we're
going to have a repeat of the syscall entry mess where the interaction
with ptrace, audit, seccomp etc was perpetually broken in user-visible
ways.
Will
On May 16, 2025 10:04:50 AM EDT, Will Deacon <will@kernel.org> wrote: > >> + if (user_mode(regs)) >> + trace_page_fault_user(addr, regs, esr); >> + else >> + trace_page_fault_kernel(addr, regs, esr); > >Why is this after kprobe_page_fault()? > >It's also a shame that the RV monitor can't hook into perf, as we >already have a sw event for page faults that you could use instead of >adding something new. > Perf events work for perf only. My question is why isn't this a tracepoint that perf could hook into? Tracepoints are made to be generic, whereas perf events are not. -- Steve
On Fri, May 16, 2025 at 10:42:48AM -0400, Steven Rostedt wrote: > > > On May 16, 2025 10:04:50 AM EDT, Will Deacon <will@kernel.org> wrote: > > > >> + if (user_mode(regs)) > >> + trace_page_fault_user(addr, regs, esr); > >> + else > >> + trace_page_fault_kernel(addr, regs, esr); > > > >Why is this after kprobe_page_fault()? > > > >It's also a shame that the RV monitor can't hook into perf, as we > >already have a sw event for page faults that you could use instead of > >adding something new. > > > > Perf events work for perf only. My question is why isn't this a tracepoint > that perf could hook into? Well, the perf event came first in this case, so we're stuck with it :/ I was hoping we could settle for a generic helper that could emit both the trace event and the perf event (so that the ordering of the two is portable across architectures) but, judging by Nam's reply, the trace event is needed before kprobes gets a look in. Will
On Mon, 19 May 2025 16:12:39 +0100 Will Deacon <will@kernel.org> wrote: > > Perf events work for perf only. My question is why isn't this a tracepoint > > that perf could hook into? > > Well, the perf event came first in this case, so we're stuck with it :/ I wonder what effort it will take to convert perf events to tracepoints ;-) Note, I'm talking about tracepoints and not trace events, where the latter is exposed to tracefs and the former is not. > > I was hoping we could settle for a generic helper that could emit both > the trace event and the perf event (so that the ordering of the two is > portable across architectures) but, judging by Nam's reply, the trace > event is needed before kprobes gets a look in. Perhaps we could add a helper function that does both (perf and tracepoint) and hide the implementation from the code that calls it? But I'm currently still on PTO so I haven't looked at the details yet. -- Steve
On Mon, May 19, 2025 at 12:08:37PM -0400, Steven Rostedt wrote: > On Mon, 19 May 2025 16:12:39 +0100 > Will Deacon <will@kernel.org> wrote: > > > > Perf events work for perf only. My question is why isn't this a tracepoint > > > that perf could hook into? > > > > Well, the perf event came first in this case, so we're stuck with it :/ > > I wonder what effort it will take to convert perf events to tracepoints ;-) > > Note, I'm talking about tracepoints and not trace events, where the > latter is exposed to tracefs and the former is not. > > > > > I was hoping we could settle for a generic helper that could emit both > > the trace event and the perf event (so that the ordering of the two is > > portable across architectures) but, judging by Nam's reply, the trace > > event is needed before kprobes gets a look in. > > Perhaps we could add a helper function that does both (perf and > tracepoint) and hide the implementation from the code that calls it? Something like that sounds like a good idea, yes. > But I'm currently still on PTO so I haven't looked at the details yet. Enjoy! Will
On Fri, May 16, 2025 at 03:04:50PM +0100, Will Deacon wrote:
> On Wed, Apr 30, 2025 at 01:02:32PM +0200, Nam Cao wrote:
> > Add page fault trace points, which are useful to implement RV monitor which
> > watches page faults.
> >
> > Signed-off-by: Nam Cao <namcao@linutronix.de>
> > ---
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: linux-arm-kernel@lists.infradead.org
> > ---
> > arch/arm64/mm/fault.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index ef63651099a9..e3f096b0dffd 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -44,6 +44,9 @@
> > #include <asm/tlbflush.h>
> > #include <asm/traps.h>
> >
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/exceptions.h>
> > +
> > struct fault_info {
> > int (*fn)(unsigned long far, unsigned long esr,
> > struct pt_regs *regs);
> > @@ -559,6 +562,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> > if (kprobe_page_fault(regs, esr))
> > return 0;
> >
> > + if (user_mode(regs))
> > + trace_page_fault_user(addr, regs, esr);
> > + else
> > + trace_page_fault_kernel(addr, regs, esr);
>
> Why is this after kprobe_page_fault()?
This is me being incompetent, sorry about that. It is more logical to put
them at the beginning.
Best regards,
Nam
Can I get an Acked-by from the ARM64 maintainers?
Thanks,
-- Steve
On Wed, 30 Apr 2025 13:02:32 +0200
Nam Cao <namcao@linutronix.de> wrote:
> Add page fault trace points, which are useful to implement RV monitor which
> watches page faults.
>
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> ---
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> ---
> arch/arm64/mm/fault.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index ef63651099a9..e3f096b0dffd 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -44,6 +44,9 @@
> #include <asm/tlbflush.h>
> #include <asm/traps.h>
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/exceptions.h>
> +
> struct fault_info {
> int (*fn)(unsigned long far, unsigned long esr,
> struct pt_regs *regs);
> @@ -559,6 +562,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> if (kprobe_page_fault(regs, esr))
> return 0;
>
> + if (user_mode(regs))
> + trace_page_fault_user(addr, regs, esr);
> + else
> + trace_page_fault_kernel(addr, regs, esr);
> +
> /*
> * If we're in an interrupt or have no user context, we must not take
> * the fault.
© 2016 - 2026 Red Hat, Inc.