[v13] perf/core: Add ability for an event to "pause" or "resume" AUX area tracing

[PATCH V13 00/14] perf/core: Add ability for an event to "pause" or "resume" AUX area tracing

Posted by Adrian Hunter 1 year, 3 months ago

Hi

Note for V12:
	There was a small conflict between the Intel PT changes in
	"KVM: x86: Fix Intel PT Host/Guest mode when host tracing" and the
	changes in this patch set, so I have put the patch sets together,
	along with outstanding fix "perf/x86/intel/pt: Fix buffer full but
	size is 0 case"

	Cover letter for KVM changes (patches 2 to 4):

	There is a long-standing problem whereby running Intel PT on host and guest
	in Host/Guest mode, causes VM-Entry failure.

	The motivation for this patch set is to provide a fix for stable kernels
	prior to the advent of the "Mediated Passthrough vPMU" patch set:

		https://lore.kernel.org/kvm/20240801045907.4010984-1-mizhang@google.com/

	which would render a large part of the fix unnecessary but likely not be
	suitable for backport to stable due to its size and complexity.

	Ideally, this patch set would be applied before "Mediated Passthrough vPMU"

	Note that the fix does not conflict with "Mediated Passthrough vPMU", it
	is just that "Mediated Passthrough vPMU" will make the code to stop and
	restart Intel PT unnecessary.

Note for V11:
	Moving aux_paused into a union within struct hw_perf_event caused
	a regression because aux_paused was being written unconditionally
	even though it is valid only for AUX (e.g. Intel PT) PMUs.
	That is fixed in V11.

Hardware traces, such as instruction traces, can produce a vast amount of
trace data, so being able to reduce tracing to more specific circumstances
can be useful.

The ability to pause or resume tracing when another event happens, can do
that.

These patches add such a facilty and show how it would work for Intel
Processor Trace.

Maintainers of other AUX area tracing implementations are requested to
consider if this is something they might employ and then whether or not
the ABI would work for them.  Note, thank you to James Clark (ARM) for
evaluating the API for Coresight.  Suzuki K Poulose (ARM) also responded
positively to the RFC.

Changes to perf tools are now (since V4) fleshed out.

Please note, Intel® Architecture Instruction Set Extensions and Future
Features Programming Reference March 2024 319433-052, currently:

	https://cdrdv2.intel.com/v1/dl/getContent/671368

introduces hardware pause / resume for Intel PT in a feature named
Intel PT Trigger Tracing.

For that more fields in perf_event_attr will be necessary.  The main
differences are:
	- it can be applied not just to overflows, but optionally to
	every event
	- a packet is emitted into the trace, optionally with IP
	information
	- no PMI
	- works with PMC and DR (breakpoint) events only

Here are the proposed additions to perf_event_attr, please comment:

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 0c557f0a17b3..05dcc43f11bb 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -369,6 +369,22 @@ enum perf_event_read_format {
 	PERF_FORMAT_MAX = 1U << 5,		/* non-ABI */
 };
 
+enum {
+	PERF_AUX_ACTION_START_PAUSED		=   1U << 0,
+	PERF_AUX_ACTION_PAUSE			=   1U << 1,
+	PERF_AUX_ACTION_RESUME			=   1U << 2,
+	PERF_AUX_ACTION_EMIT			=   1U << 3,
+	PERF_AUX_ACTION_NR			= 0x1f << 4,
+	PERF_AUX_ACTION_NO_IP			=   1U << 9,
+	PERF_AUX_ACTION_PAUSE_ON_EVT		=   1U << 10,
+	PERF_AUX_ACTION_RESUME_ON_EVT		=   1U << 11,
+	PERF_AUX_ACTION_EMIT_ON_EVT		=   1U << 12,
+	PERF_AUX_ACTION_NR_ON_EVT		= 0x1f << 13,
+	PERF_AUX_ACTION_NO_IP_ON_EVT		=   1U << 18,
+	PERF_AUX_ACTION_MASK			= ~PERF_AUX_ACTION_START_PAUSED,
+	PERF_AUX_PAUSE_RESUME_MASK		= PERF_AUX_ACTION_PAUSE | PERF_AUX_ACTION_RESUME,
+};
+
 #define PERF_ATTR_SIZE_VER0	64	/* sizeof first published struct */
 #define PERF_ATTR_SIZE_VER1	72	/* add: config2 */
 #define PERF_ATTR_SIZE_VER2	80	/* add: branch_sample_type */
@@ -515,10 +531,19 @@ struct perf_event_attr {
 	union {
 		__u32	aux_action;
 		struct {
-			__u32	aux_start_paused :  1, /* start AUX area tracing paused */
-				aux_pause        :  1, /* on overflow, pause AUX area tracing */
-				aux_resume       :  1, /* on overflow, resume AUX area tracing */
-				__reserved_3     : 29;
+			__u32	aux_start_paused  :  1, /* start AUX area tracing paused */
+				aux_pause         :  1, /* on overflow, pause AUX area tracing */
+				aux_resume        :  1, /* on overflow, resume AUX area tracing */
+				aux_emit          :  1, /* generate AUX records instead of events */
+				aux_nr            :  5, /* AUX area tracing reference number */
+				aux_no_ip         :  1, /* suppress IP in AUX records */
+				/* Following apply to event occurrence not overflows */
+				aux_pause_on_evt  :  1, /* on event, pause AUX area tracing */
+				aux_resume_on_evt :  1, /* on event, resume AUX area tracing */
+				aux_emit_on_evt   :  1, /* generate AUX records instead of events */
+				aux_nr_on_evt     :  5, /* AUX area tracing reference number */
+				aux_no_ip_on_evt  :  1, /* suppress IP in AUX records */
+				__reserved_3      : 13;
 		};
 	};


Changes in V13:
      perf/core: Add aux_pause, aux_resume, aux_start_paused
	Do aux_resume at the end of __perf_event_overflow() so as to trace
	less of perf itself

      perf tools: Add missing_features for aux_start_paused, aux_pause, aux_resume
	Add error message also in EOPNOTSUPP case (Leo)

Changes in V12:
	Add previously sent patch "perf/x86/intel/pt: Fix buffer full
	but size is 0 case"

	Add previously sent patch set "KVM: x86: Fix Intel PT Host/Guest
	mode when host tracing"

	Rebase on current tip plus patch set "KVM: x86: Fix Intel PT Host/Guest
	mode when host tracing"

Changes in V11:
      perf/core: Add aux_pause, aux_resume, aux_start_paused
	Make assignment to event->hw.aux_paused conditional on
	(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE).

      perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
	Remove definition of has_aux_action() because it has
	already been added as an inline function.

      perf/x86/intel/pt: Fix sampling synchronization
      perf tools: Enable evsel__is_aux_event() to work for ARM/ARM64
      perf tools: Enable evsel__is_aux_event() to work for S390_CPUMSF
	Dropped because they have already been applied

Changes in V10:
      perf/core: Add aux_pause, aux_resume, aux_start_paused
	Move aux_paused into a union within struct hw_perf_event.
	Additional comment wrt PERF_EF_PAUSE/PERF_EF_RESUME.
	Factor out has_aux_action() as an inline function.
	Use scoped_guard for irqsave.
	Move calls of perf_event_aux_pause() from __perf_event_output()
	to __perf_event_overflow().

Changes in V9:
      perf/x86/intel/pt: Fix sampling synchronization
	New patch

      perf/core: Add aux_pause, aux_resume, aux_start_paused
	Move aux_paused to struct hw_perf_event

      perf/x86/intel/pt: Add support for pause / resume
	Add more comments and barriers for resume_allowed and
	pause_allowed
	Always use WRITE_ONCE with resume_allowed


Changes in V8:

      perf tools: Parse aux-action
	Fix clang warning:
	     util/auxtrace.c:821:7: error: missing field 'aux_action' initializer [-Werror,-Wmissing-field-initializers]
	     821 |         {NULL},
	         |              ^

Changes in V7:

	Add Andi's Reviewed-by for patches 2-12
	Re-base

Changes in V6:

      perf/core: Add aux_pause, aux_resume, aux_start_paused
	Removed READ/WRITE_ONCE from __perf_event_aux_pause()
	Expanded comment about guarding against NMI

Changes in V5:

    perf/core: Add aux_pause, aux_resume, aux_start_paused
	Added James' Ack

    perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
	New patch

    perf tools
	Added Ian's Ack

Changes in V4:

    perf/core: Add aux_pause, aux_resume, aux_start_paused
	Rename aux_output_cfg -> aux_action
	Reorder aux_action bits from:
		aux_pause, aux_resume, aux_start_paused
	to:
		aux_start_paused, aux_pause, aux_resume
	Fix aux_action bits __u64 -> __u32

    coresight: Have a stab at support for pause / resume
	Dropped

    perf tools
	All new patches

Changes in RFC V3:

    coresight: Have a stab at support for pause / resume
	'mode' -> 'flags' so it at least compiles

Changes in RFC V2:

	Use ->stop() / ->start() instead of ->pause_resume()
	Move aux_start_paused bit into aux_output_cfg
	Tighten up when Intel PT pause / resume is allowed
	Add an example of how it might work for CoreSight


Adrian Hunter (14):
      perf/x86/intel/pt: Fix buffer full but size is 0 case
      KVM: x86: Fix Intel PT IA32_RTIT_CTL MSR validation
      KVM: x86: Fix Intel PT Host/Guest mode when host tracing also
      KVM: selftests: Add guest Intel PT test
      perf/core: Add aux_pause, aux_resume, aux_start_paused
      perf/x86/intel/pt: Add support for pause / resume
      perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
      perf tools: Add aux_start_paused, aux_pause and aux_resume
      perf tools: Add aux-action config term
      perf tools: Parse aux-action
      perf tools: Add missing_features for aux_start_paused, aux_pause, aux_resume
      perf intel-pt: Improve man page format
      perf intel-pt: Add documentation for pause / resume
      perf intel-pt: Add a test for pause / resume

 arch/x86/events/intel/core.c                       |   4 +-
 arch/x86/events/intel/pt.c                         | 209 +++++++-
 arch/x86/events/intel/pt.h                         |  16 +
 arch/x86/include/asm/intel_pt.h                    |   4 +
 arch/x86/kvm/vmx/vmx.c                             |  26 +-
 arch/x86/kvm/vmx/vmx.h                             |   1 -
 include/linux/perf_event.h                         |  28 +
 include/uapi/linux/perf_event.h                    |  11 +-
 kernel/events/core.c                               |  75 ++-
 kernel/events/internal.h                           |   1 +
 tools/include/uapi/linux/perf_event.h              |  11 +-
 tools/perf/Documentation/perf-intel-pt.txt         | 596 +++++++++++++--------
 tools/perf/Documentation/perf-record.txt           |   4 +
 tools/perf/builtin-record.c                        |   4 +-
 tools/perf/tests/shell/test_intel_pt.sh            |  28 +
 tools/perf/util/auxtrace.c                         |  67 ++-
 tools/perf/util/auxtrace.h                         |   6 +-
 tools/perf/util/evsel.c                            |  15 +
 tools/perf/util/evsel.h                            |   1 +
 tools/perf/util/evsel_config.h                     |   1 +
 tools/perf/util/parse-events.c                     |  10 +
 tools/perf/util/parse-events.h                     |   1 +
 tools/perf/util/parse-events.l                     |   1 +
 tools/perf/util/perf_event_attr_fprintf.c          |   3 +
 tools/perf/util/pmu.c                              |   1 +
 tools/testing/selftests/kvm/Makefile               |   1 +
 .../selftests/kvm/include/x86_64/processor.h       |   1 +
 tools/testing/selftests/kvm/x86_64/intel_pt.c      | 381 +++++++++++++
 28 files changed, 1243 insertions(+), 264 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/intel_pt.c


Regards
Adrian

[PATCH V13 01/14] perf/x86/intel/pt: Fix buffer full but size is 0 case

Posted by Adrian Hunter 1 year, 3 months ago

If the trace data buffer becomes full, a truncated flag [T] is reported
in PERF_RECORD_AUX.  In some cases, the size reported is 0, even though
data must have been added to make the buffer full.

That happens when the buffer fills up from empty to full before the
Intel PT driver has updated the buffer position.  Then the driver
calculates the new buffer position before calculating the data size.
If the old and new positions are the same, the data size is reported
as 0, even though it is really the whole buffer size.

Fix by detecting when the buffer position is wrapped, and adjust the
data size calculation accordingly.

Example

  Use a very small buffer size (8K) and observe the size of truncated [T]
  data. Before the fix, it is possible to see records of 0 size.

  Before:

    $ perf record -m,8K -e intel_pt// uname
    Linux
    [ perf record: Woken up 2 times to write data ]
    [ perf record: Captured and wrote 0.105 MB perf.data ]
    $ perf script -D --no-itrace | grep AUX | grep -F '[T]'
    Warning:
    AUX data lost 2 times out of 3!

    5 19462712368111 0x19710 [0x40]: PERF_RECORD_AUX offset: 0 size: 0 flags: 0x1 [T]
    5 19462712700046 0x19ba8 [0x40]: PERF_RECORD_AUX offset: 0x170 size: 0xe90 flags: 0x1 [T]

 After:

    $ perf record -m,8K -e intel_pt// uname
    Linux
    [ perf record: Woken up 3 times to write data ]
    [ perf record: Captured and wrote 0.040 MB perf.data ]
    $ perf script -D --no-itrace | grep AUX | grep -F '[T]'
    Warning:
    AUX data lost 2 times out of 3!

    1 113720802995 0x4948 [0x40]: PERF_RECORD_AUX offset: 0 size: 0x2000 flags: 0x1 [T]
    1 113720979812 0x6b10 [0x40]: PERF_RECORD_AUX offset: 0x2000 size: 0x2000 flags: 0x1 [T]

Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/events/intel/pt.c | 11 ++++++++---
 arch/x86/events/intel/pt.h |  2 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index fd4670a6694e..a087bc0c5498 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -828,11 +828,13 @@ static void pt_buffer_advance(struct pt_buffer *buf)
 	buf->cur_idx++;
 
 	if (buf->cur_idx == buf->cur->last) {
-		if (buf->cur == buf->last)
+		if (buf->cur == buf->last) {
 			buf->cur = buf->first;
-		else
+			buf->wrapped = true;
+		} else {
 			buf->cur = list_entry(buf->cur->list.next, struct topa,
 					      list);
+		}
 		buf->cur_idx = 0;
 	}
 }
@@ -846,8 +848,11 @@ static void pt_buffer_advance(struct pt_buffer *buf)
 static void pt_update_head(struct pt *pt)
 {
 	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	bool wrapped = buf->wrapped;
 	u64 topa_idx, base, old;
 
+	buf->wrapped = false;
+
 	if (buf->single) {
 		local_set(&buf->data_size, buf->output_off);
 		return;
@@ -865,7 +870,7 @@ static void pt_update_head(struct pt *pt)
 	} else {
 		old = (local64_xchg(&buf->head, base) &
 		       ((buf->nr_pages << PAGE_SHIFT) - 1));
-		if (base < old)
+		if (base < old || (base == old && wrapped))
 			base += buf->nr_pages << PAGE_SHIFT;
 
 		local_add(base - old, &buf->data_size);
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index f5e46c04c145..a1b6c04b7f68 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -65,6 +65,7 @@ struct pt_pmu {
  * @head:	logical write offset inside the buffer
  * @snapshot:	if this is for a snapshot/overwrite counter
  * @single:	use Single Range Output instead of ToPA
+ * @wrapped:	buffer advance wrapped back to the first topa table
  * @stop_pos:	STOP topa entry index
  * @intr_pos:	INT topa entry index
  * @stop_te:	STOP topa entry pointer
@@ -82,6 +83,7 @@ struct pt_buffer {
 	local64_t		head;
 	bool			snapshot;
 	bool			single;
+	bool			wrapped;
 	long			stop_pos, intr_pos;
 	struct topa_entry	*stop_te, *intr_te;
 	void			**data_pages;
-- 
2.43.0

[PATCH V13 02/14] KVM: x86: Fix Intel PT IA32_RTIT_CTL MSR validation

Posted by Adrian Hunter 1 year, 3 months ago

Fix KVM IA32_RTIT_CTL MSR validation logic so that if RTIT_CTL_TRACEEN
bit is cleared, then other bits are allowed to change also. For example,
writing 0 to IA32_RTIT_CTL in order to stop tracing, is valid.

Fixes: bf8c55d8dc09 ("KVM: x86: Implement Intel PT MSRs read/write emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1a4438358c5e..eaf4965ac6df 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1635,7 +1635,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
 	 * result in a #GP unless the same write also clears TraceEn.
 	 */
 	if ((vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) &&
-		((vmx->pt_desc.guest.ctl ^ data) & ~RTIT_CTL_TRACEEN))
+	    (data & RTIT_CTL_TRACEEN) &&
+	    data != vmx->pt_desc.guest.ctl)
 		return 1;
 
 	/*
-- 
2.43.0

Re: [PATCH V13 02/14] KVM: x86: Fix Intel PT IA32_RTIT_CTL MSR validation

Posted by Sean Christopherson 1 year, 3 months ago

"KVM: VMX:" for the scope.

And I would much prefer to actually state what is changing.  "Fix XYZ" isn't
helpful in understanding what's actually broken, fallout from the bug, etc.  It's
never easy to describe bugs where the logic is flat out busted, but I think we can
at least capture the basic gist, and allude to the badness being a wrongly disallowed
write.

On Mon, Oct 14, 2024, Adrian Hunter wrote:
> Fix KVM IA32_RTIT_CTL MSR validation logic so that if RTIT_CTL_TRACEEN
> bit is cleared, then other bits are allowed to change also. For example,
> writing 0 to IA32_RTIT_CTL in order to stop tracing, is valid.

There's a fair amount of extraneous and disctracting information in both the shortlog
and changelog.  E.g. "Intel PT IA32_RTIT_CTL MSR" can simply be MSR_IA32_RTIT_CTL.
And the 

I'll fix up to the below when applying; AFAICT, this fix is completely independent
of the rest of the series.

KVM: VMX: Allow toggling bits in MSR_IA32_RTIT_CTL when enable bit is cleared

  Allow toggling other bits in MSR_IA32_RTIT_CTL if the enable bit is being
  cleared, the existing logic simply ignores the enable bit.  E.g. KVM will
  incorrectly reject a write of '0' to stop tracing.

> Fixes: bf8c55d8dc09 ("KVM: x86: Implement Intel PT MSRs read/write emulation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 1a4438358c5e..eaf4965ac6df 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1635,7 +1635,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
>  	 * result in a #GP unless the same write also clears TraceEn.
>  	 */
>  	if ((vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) &&
> -		((vmx->pt_desc.guest.ctl ^ data) & ~RTIT_CTL_TRACEEN))
> +	    (data & RTIT_CTL_TRACEEN) &&
> +	    data != vmx->pt_desc.guest.ctl)
>  		return 1;
>  
>  	/*
> -- 
> 2.43.0
>

[PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Adrian Hunter 1 year, 3 months ago

Ensure Intel PT tracing is disabled before VM-Entry in Intel PT Host/Guest
mode.

Intel PT has 2 modes for tracing virtual machines. The default is System
mode whereby host and guest output to the host trace buffer. The other is
Host/Guest mode whereby host and guest output to their own buffers.
Host/Guest mode is selected by kvm_intel module parameter pt_mode=1.

In Host/Guest mode, the following rule must be followed:

	If the logical processor is operating with Intel PT enabled
	(if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the
	"load IA32_RTIT_CTL" VM-entry control must be 0.

However, "load IA32_RTIT_CTL" VM-entry control is always 1 in Host/Guest
mode, so IA32_RTIT_CTL.TraceEn must always be 0 at VM entry, irrespective
of whether guest IA32_RTIT_CTL.TraceEn is 1.

Fix by stopping host Intel PT tracing always at VM entry in Host/Guest
mode. That also fixes the issue whereby the Intel PT NMI handler would
set IA32_RTIT_CTL.TraceEn back to 1 after KVM has just set it to 0.

Fixes: 2ef444f1600b ("KVM: x86: Add Intel PT context switch for each vcpu")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/events/intel/pt.c      | 131 +++++++++++++++++++++++++++++++-
 arch/x86/events/intel/pt.h      |  10 +++
 arch/x86/include/asm/intel_pt.h |   4 +
 arch/x86/kvm/vmx/vmx.c          |  23 ++----
 arch/x86/kvm/vmx/vmx.h          |   1 -
 5 files changed, 147 insertions(+), 22 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index a087bc0c5498..d9469d2d6aa6 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -480,16 +480,20 @@ static u64 pt_config_filters(struct perf_event *event)
 		 */
 
 		/* avoid redundant msr writes */
-		if (pt->filters.filter[range].msr_a != filter->msr_a) {
+		if (pt->filters.filter[range].msr_a != filter->msr_a ||
+		    pt->write_filter_msrs[range]) {
 			wrmsrl(pt_address_ranges[range].msr_a, filter->msr_a);
 			pt->filters.filter[range].msr_a = filter->msr_a;
 		}
 
-		if (pt->filters.filter[range].msr_b != filter->msr_b) {
+		if (pt->filters.filter[range].msr_b != filter->msr_b ||
+		    pt->write_filter_msrs[range]) {
 			wrmsrl(pt_address_ranges[range].msr_b, filter->msr_b);
 			pt->filters.filter[range].msr_b = filter->msr_b;
 		}
 
+		pt->write_filter_msrs[range] = false;
+
 		rtit_ctl |= (u64)filter->config << pt_address_ranges[range].reg_off;
 	}
 
@@ -534,6 +538,11 @@ static void pt_config(struct perf_event *event)
 	reg |= (event->attr.config & PT_CONFIG_MASK);
 
 	event->hw.aux_config = reg;
+
+	/* Configuration is complete, it is now OK to handle an NMI */
+	barrier();
+	WRITE_ONCE(pt->handle_nmi, 1);
+
 	pt_config_start(event);
 }
 
@@ -950,6 +959,7 @@ static void pt_handle_status(struct pt *pt)
 		pt_buffer_advance(buf);
 
 	wrmsrl(MSR_IA32_RTIT_STATUS, status);
+	pt->status = status;
 }
 
 /**
@@ -1588,7 +1598,6 @@ static void pt_event_start(struct perf_event *event, int mode)
 			goto fail_end_stop;
 	}
 
-	WRITE_ONCE(pt->handle_nmi, 1);
 	hwc->state = 0;
 
 	pt_config_buffer(buf);
@@ -1643,6 +1652,104 @@ static void pt_event_stop(struct perf_event *event, int mode)
 	}
 }
 
+#define PT_VM_NO_TRANSITION	0
+#define PT_VM_ENTRY		1
+#define PT_VM_EXIT		2
+
+void intel_pt_vm_entry(bool guest_trace_enable)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct perf_event *event;
+
+	pt->restart_event = NULL;
+	pt->stashed_buf_sz = 0;
+
+	WRITE_ONCE(pt->vm_transition, PT_VM_ENTRY);
+	barrier();
+
+	if (READ_ONCE(pt->handle_nmi)) {
+		/* Must stop handler before reading pt->handle.event */
+		WRITE_ONCE(pt->handle_nmi, 0);
+		barrier();
+		event = pt->handle.event;
+		if (event && !event->hw.state) {
+			struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+			if (buf && buf->snapshot)
+				pt->stashed_buf_sz = buf->nr_pages << PAGE_SHIFT;
+			pt->restart_event = event;
+			pt_event_stop(event, PERF_EF_UPDATE);
+		}
+	}
+
+	/*
+	 * If guest_trace_enable, MSRs need to be saved, but the values are
+	 * either already cached or not needed:
+	 *	MSR_IA32_RTIT_CTL		event->hw.aux_config
+	 *	MSR_IA32_RTIT_STATUS		pt->status
+	 *	MSR_IA32_RTIT_CR3_MATCH		not used
+	 *	MSR_IA32_RTIT_OUTPUT_BASE	pt->output_base
+	 *	MSR_IA32_RTIT_OUTPUT_MASK	pt->output_mask
+	 *	MSR_IA32_RTIT_ADDR...		pt->filters
+	 */
+}
+EXPORT_SYMBOL_GPL(intel_pt_vm_entry);
+
+void intel_pt_vm_exit(bool guest_trace_enable)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	u64 base = pt->output_base;
+	u64 mask = pt->output_mask;
+
+	WRITE_ONCE(pt->vm_transition, PT_VM_EXIT);
+	barrier();
+
+	/*
+	 * If guest_trace_enable, MSRs need to be restored, but that is handled
+	 * in different ways:
+	 *	MSR_IA32_RTIT_CTL		written next start
+	 *	MSR_IA32_RTIT_STATUS		restored below
+	 *	MSR_IA32_RTIT_CR3_MATCH		not used
+	 *	MSR_IA32_RTIT_OUTPUT_BASE	written next start or restored
+	 *					further below
+	 *	MSR_IA32_RTIT_OUTPUT_MASK	written next start or restored
+	 *					further below
+	 *	MSR_IA32_RTIT_ADDR...		flagged to be written when
+	 *					needed
+	 */
+	if (guest_trace_enable) {
+		wrmsrl(MSR_IA32_RTIT_STATUS, pt->status);
+		/*
+		 * Force address filter MSR writes during reconfiguration,
+		 * refer pt_config_filters().
+		 */
+		for (int range = 0; range < PT_FILTERS_NUM; range++)
+			pt->write_filter_msrs[range] = true;
+	}
+
+	if (pt->restart_event) {
+		if (guest_trace_enable) {
+			/* Invalidate to force buffer reconfiguration */
+			pt->output_base = ~0ULL;
+			pt->output_mask = 0;
+		}
+		pt_event_start(pt->restart_event, 0);
+		pt->restart_event = NULL;
+	}
+
+	/* If tracing wasn't started, restore buffer configuration */
+	if (guest_trace_enable && !READ_ONCE(pt->handle_nmi)) {
+		wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base);
+		wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, mask);
+		pt->output_base = base;
+		pt->output_mask = mask;
+	}
+
+	barrier();
+	WRITE_ONCE(pt->vm_transition, PT_VM_NO_TRANSITION);
+}
+EXPORT_SYMBOL_GPL(intel_pt_vm_exit);
+
 static long pt_event_snapshot_aux(struct perf_event *event,
 				  struct perf_output_handle *handle,
 				  unsigned long size)
@@ -1651,6 +1758,24 @@ static long pt_event_snapshot_aux(struct perf_event *event,
 	struct pt_buffer *buf = perf_get_aux(&pt->handle);
 	unsigned long from = 0, to;
 	long ret;
+	int tr;
+
+	/*
+	 * Special handling during VM transition. At VM-Entry stage, once
+	 * tracing is stopped, as indicated by buf == NULL, snapshot using the
+	 * saved head position. At VM-Exit do that also until tracing is
+	 * reconfigured as indicated by handle_nmi.
+	 */
+	tr = READ_ONCE(pt->vm_transition);
+	if ((tr == PT_VM_ENTRY && !buf) || (tr == PT_VM_EXIT && !READ_ONCE(pt->handle_nmi))) {
+		if (WARN_ON_ONCE(!pt->stashed_buf_sz))
+			return 0;
+		to = pt->handle.head;
+		if (to < size)
+			from = pt->stashed_buf_sz;
+		from += to - size;
+		return perf_output_copy_aux(&pt->handle, handle, from, to);
+	}
 
 	if (WARN_ON_ONCE(!buf))
 		return 0;
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index a1b6c04b7f68..0428019b92f4 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -121,6 +121,11 @@ struct pt_filters {
  * @vmx_on:		1 if VMX is ON on this cpu
  * @output_base:	cached RTIT_OUTPUT_BASE MSR value
  * @output_mask:	cached RTIT_OUTPUT_MASK MSR value
+ * @status:		cached RTIT_STATUS MSR value
+ * @vm_transition:	VM transition (snapshot_aux needs special handling)
+ * @write_filter_msrs:	write address filter MSRs during configuration
+ * @stashed_buf_sz:	buffer size during VM transition
+ * @restart_event:	event to restart after VM-Exit
  */
 struct pt {
 	struct perf_output_handle handle;
@@ -129,6 +134,11 @@ struct pt {
 	int			vmx_on;
 	u64			output_base;
 	u64			output_mask;
+	u64			status;
+	int			vm_transition;
+	bool			write_filter_msrs[PT_FILTERS_NUM];
+	unsigned long		stashed_buf_sz;
+	struct perf_event	*restart_event;
 };
 
 #endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
index c796e9bc98b6..a673ac3a825e 100644
--- a/arch/x86/include/asm/intel_pt.h
+++ b/arch/x86/include/asm/intel_pt.h
@@ -30,11 +30,15 @@ enum pt_capabilities {
 void cpu_emergency_stop_pt(void);
 extern u32 intel_pt_validate_hw_cap(enum pt_capabilities cap);
 extern u32 intel_pt_validate_cap(u32 *caps, enum pt_capabilities cap);
+extern void intel_pt_vm_entry(bool guest_trace_enable);
+extern void intel_pt_vm_exit(bool guest_trace_enable);
 extern int is_intel_pt_event(struct perf_event *event);
 #else
 static inline void cpu_emergency_stop_pt(void) {}
 static inline u32 intel_pt_validate_hw_cap(enum pt_capabilities cap) { return 0; }
 static inline u32 intel_pt_validate_cap(u32 *caps, enum pt_capabilities capability) { return 0; }
+static inline void intel_pt_vm_entry(bool guest_trace_enable) {}
+static inline void intel_pt_vm_exit(bool guest_trace_enable) {}
 static inline int is_intel_pt_event(struct perf_event *event) { return 0; }
 #endif
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index eaf4965ac6df..9998da4e774d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1220,16 +1220,10 @@ static void pt_guest_enter(struct vcpu_vmx *vmx)
 	if (vmx_pt_mode_is_system())
 		return;
 
-	/*
-	 * GUEST_IA32_RTIT_CTL is already set in the VMCS.
-	 * Save host state before VM entry.
-	 */
-	rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
-	if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
-		wrmsrl(MSR_IA32_RTIT_CTL, 0);
-		pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
+	intel_pt_vm_entry(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN);
+
+	if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)
 		pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
-	}
 }
 
 static void pt_guest_exit(struct vcpu_vmx *vmx)
@@ -1237,17 +1231,10 @@ static void pt_guest_exit(struct vcpu_vmx *vmx)
 	if (vmx_pt_mode_is_system())
 		return;
 
-	if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
+	if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)
 		pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
-		pt_load_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
-	}
 
-	/*
-	 * KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest,
-	 * i.e. RTIT_CTL is always cleared on VM-Exit.  Restore it if necessary.
-	 */
-	if (vmx->pt_desc.host.ctl)
-		wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
+	intel_pt_vm_exit(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN);
 }
 
 void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2325f773a20b..24ac6f6dc0ca 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -63,7 +63,6 @@ struct pt_desc {
 	u64 ctl_bitmask;
 	u32 num_address_ranges;
 	u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
-	struct pt_ctx host;
 	struct pt_ctx guest;
 };
 
-- 
2.43.0

Re: [PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Sean Christopherson 1 year, 3 months ago

On Mon, Oct 14, 2024, Adrian Hunter wrote:
> Ensure Intel PT tracing is disabled before VM-Entry in Intel PT Host/Guest
> mode.
> 
> Intel PT has 2 modes for tracing virtual machines. The default is System
> mode whereby host and guest output to the host trace buffer. The other is
> Host/Guest mode whereby host and guest output to their own buffers.
> Host/Guest mode is selected by kvm_intel module parameter pt_mode=1.
> 
> In Host/Guest mode, the following rule must be followed:

This is misleading and arguably wrong.  The following "rule" must _always_ be
followed.  If I weren't intimately familiar with the distinctive style of the
SDM's consistency checks, odds are good I wouldn't have any idea where this rule
came from.

> 	If the logical processor is operating with Intel PT enabled
> 	(if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the
> 	"load IA32_RTIT_CTL" VM-entry control must be 0.

> However, "load IA32_RTIT_CTL" VM-entry control is always 1 in Host/Guest
> mode, so IA32_RTIT_CTL.TraceEn must always be 0 at VM entry, irrespective
> of whether guest IA32_RTIT_CTL.TraceEn is 1.

Explicitly state what the bad behavior is, _somewhere_.  Similar to the previous
patch, their is a lot of information to wade through just to understand that this
results in a failed VM-Entry.

Furthermore, nothing in here spells out exactly under what conditions this bug
surfaces, which makes it unnecessarily difficult to understand what can go wrong,
and when.

> Fix by stopping host Intel PT tracing always at VM entry in Host/Guest

It's not _at_ VM-Entry.  The language matters, because this makes it sound like
PT tracing is being disabled as part of VM-Entry.

> mode.
>
> That also fixes the issue whereby the Intel PT NMI handler would
> set IA32_RTIT_CTL.TraceEn back to 1 after KVM has just set it to 0.

In theory, this should be an entirely separate fix.  In practice, simply clearing
MSR_IA32_RTIT_CTL before VM-Enter if tracing is enabled doesn't help much, i.e.
re-enabling in the NMI handler isn't all that rare.  That absolutely needs to
be called out in the changelog.

> Fixes: 2ef444f1600b ("KVM: x86: Add Intel PT context switch for each vcpu")
> Cc: stable@vger.kernel.org

This is way, way too big for stable@.  Given that host/guest mode is disabled by
default and that no one has complained about this, I think it's safe to say that
unless we can provide a minimal patch, fixing this in LTS kernels isn't a priority.

Alternatively, I'm tempted to simply drop support for host/guest mode.  It clearly
hasn't been well tested, and given the lack of bug reports, likely doesn't have
many, if any, users.  And I'm guessing the overhead needed to context switch all
the RTIT MSRs makes tracing in the guest relatively useless.

/me fiddles around

LOL, yeah, this needs to be burned with fire.  It's wildly broken.  So for stable@,
I'll post a patch to hide the module param if CONFIG_BROKEN=n (and will omit
stable@ for the previous patch).

Going forward, if someone actually cares about virtualizing PT enough to want to
fix KVM's mess, then they can put in the effort to fix all the bugs, write all
the tests, and in general clean up the implementation to meet KVM's current
standards.  E.g. KVM usage of intel_pt_validate_cap() instead of KVM's guest CPUID
and capabilities infrastructure needs to go.

My vote is to queue the current code for removal, and revisit support after the
mediated PMU has landed.  Because I don't see any point in supporting Intel PT
without a mediated PMU, as host/guest mode really only makes sense if the entire
PMU is being handed over to the guest.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f587daf2a3bb..fe5046709bc3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -217,9 +217,13 @@ module_param(ple_window_shrink, uint, 0444);
 static unsigned int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 module_param(ple_window_max, uint, 0444);
 
-/* Default is SYSTEM mode, 1 for host-guest mode */
+/* Default is SYSTEM mode, 1 for host-guest mode (which is BROKEN) */
+#ifdef CONFIG_BROKEN
 int __read_mostly pt_mode = PT_MODE_SYSTEM;
 module_param(pt_mode, int, S_IRUGO);
+#else
+#define pt_mode PT_MODE_SYSTEM
+#endif
 
 struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
 
[ 1458.686107] ------------[ cut here ]------------
[ 1458.690766] Invalid MSR 588, please adapt vmx_possible_passthrough_msrs[]
[ 1458.690790] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1458.708588] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
[ 1458.725826] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S                 6.12.0-smp--65cbdf61cc85-dbg #445
[ 1458.736197] Tainted: [S]=CPU_OUT_OF_SPEC
[ 1458.740145] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
[ 1458.747651] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1458.754561] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
[ 1458.773346] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
[ 1458.778598] RAX: 49af8c020dc11100 RBX: 0000000000000588 RCX: 0000000000000027
[ 1458.785761] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
[ 1458.792929] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
[ 1458.800082] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
[ 1458.807250] R13: 0000000000000004 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
[ 1458.814409] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
[ 1458.822525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1458.828295] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
[ 1458.835457] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1458.842619] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1458.849794] PKRU: 55555554
[ 1458.852537] Call Trace:
[ 1458.855013]  <TASK>
[ 1458.857151]  ? __warn+0xce/0x210
[ 1458.860417]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1458.866713]  ? report_bug+0xbd/0x160
[ 1458.870320]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1458.876628]  ? handle_bug+0x63/0x90
[ 1458.880156]  ? exc_invalid_op+0x1a/0x50
[ 1458.884021]  ? asm_exc_invalid_op+0x1a/0x20
[ 1458.888243]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1458.894544]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1458.900846]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
[ 1458.906974]  pt_update_intercept_for_msr+0x18e/0x2d0 [kvm_intel]
[ 1458.913017]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1458.918140]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
[ 1458.922795]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1458.927902]  __kvm_set_msr+0xa3/0x180 [kvm]
[ 1458.932140]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1458.937252]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
[ 1458.942184]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1458.946688]  ? __mutex_lock+0x65/0xbe0
[ 1458.950473]  ? __mutex_lock+0x231/0xbe0
[ 1458.954345]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
[ 1458.958929]  ? kfree+0x4a/0x380
[ 1458.962109]  ? __mutex_unlock_slowpath+0x3a/0x230
[ 1458.966852]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
[ 1458.971272]  ? vma_end_read+0x14/0xf0
[ 1458.974969]  ? vma_end_read+0xd2/0xf0
[ 1458.978664]  __se_sys_ioctl+0x6b/0xc0
[ 1458.982366]  do_syscall_64+0x83/0x160
[ 1458.986075]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1458.991160] RIP: 0033:0x45d93b
[ 1458.994252] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1459.013025] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1459.020624] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1459.027789] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
[ 1459.034952] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
[ 1459.042112] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
[ 1459.049271] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
[ 1459.056440]  </TASK>
[ 1459.058670] irq event stamp: 10347
[ 1459.062107] hardirqs last  enabled at (10357): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
[ 1459.070749] hardirqs last disabled at (10372): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
[ 1459.079400] softirqs last  enabled at (10418): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1459.087953] softirqs last disabled at (10381): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1459.096505] ---[ end trace 0000000000000000 ]---
[ 1459.101160] ------------[ cut here ]------------
[ 1459.105817] Invalid MSR 589, please adapt vmx_possible_passthrough_msrs[]
[ 1459.105826] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.123618] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
[ 1459.140843] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
[ 1459.151217] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 1459.156042] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
[ 1459.163554] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.170459] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
[ 1459.189245] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
[ 1459.194502] RAX: 49af8c020dc11100 RBX: 0000000000000589 RCX: 0000000000000027
[ 1459.201670] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
[ 1459.208830] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
[ 1459.215990] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
[ 1459.223154] R13: 0000000000000004 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
[ 1459.230319] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
[ 1459.238437] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1459.244208] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
[ 1459.251369] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1459.258530] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1459.265698] PKRU: 55555554
[ 1459.268441] Call Trace:
[ 1459.270918]  <TASK>
[ 1459.273053]  ? __warn+0xce/0x210
[ 1459.276311]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.282614]  ? report_bug+0xbd/0x160
[ 1459.286234]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.292535]  ? handle_bug+0x63/0x90
[ 1459.296052]  ? exc_invalid_op+0x1a/0x50
[ 1459.299917]  ? asm_exc_invalid_op+0x1a/0x20
[ 1459.304133]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.310434]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.316732]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
[ 1459.322858]  pt_update_intercept_for_msr+0x19e/0x2d0 [kvm_intel]
[ 1459.328903]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1459.334016]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
[ 1459.338674]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1459.343778]  __kvm_set_msr+0xa3/0x180 [kvm]
[ 1459.348017]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1459.353126]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
[ 1459.358064]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1459.362559]  ? __mutex_lock+0x65/0xbe0
[ 1459.366340]  ? __mutex_lock+0x231/0xbe0
[ 1459.370205]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
[ 1459.374789]  ? kfree+0x4a/0x380
[ 1459.377958]  ? __mutex_unlock_slowpath+0x3a/0x230
[ 1459.382699]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
[ 1459.387118]  ? vma_end_read+0x14/0xf0
[ 1459.390814]  ? vma_end_read+0xd2/0xf0
[ 1459.394507]  __se_sys_ioctl+0x6b/0xc0
[ 1459.398205]  do_syscall_64+0x83/0x160
[ 1459.401903]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1459.406992] RIP: 0033:0x45d93b
[ 1459.410081] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1459.428854] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1459.436458] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1459.443621] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
[ 1459.450778] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
[ 1459.457940] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
[ 1459.465109] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
[ 1459.472273]  </TASK>
[ 1459.474493] irq event stamp: 11613
[ 1459.477922] hardirqs last  enabled at (11623): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
[ 1459.486562] hardirqs last disabled at (11632): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
[ 1459.495198] softirqs last  enabled at (11580): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1459.503755] softirqs last disabled at (11651): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1459.512304] ---[ end trace 0000000000000000 ]---
[ 1459.516951] ------------[ cut here ]------------
[ 1459.521594] Invalid MSR 58a, please adapt vmx_possible_passthrough_msrs[]
[ 1459.521601] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.539388] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
[ 1459.556613] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
[ 1459.566986] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 1459.571809] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
[ 1459.579318] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.586226] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
[ 1459.605008] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
[ 1459.610262] RAX: 49af8c020dc11100 RBX: 000000000000058a RCX: 0000000000000027
[ 1459.617423] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
[ 1459.624584] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
[ 1459.631754] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
[ 1459.638915] R13: 0000000000000005 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
[ 1459.646071] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
[ 1459.654185] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1459.659960] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
[ 1459.667125] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1459.674287] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1459.681450] PKRU: 55555554
[ 1459.684192] Call Trace:
[ 1459.686675]  <TASK>
[ 1459.688814]  ? __warn+0xce/0x210
[ 1459.692077]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.698379]  ? report_bug+0xbd/0x160
[ 1459.701999]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.708312]  ? handle_bug+0x63/0x90
[ 1459.711837]  ? exc_invalid_op+0x1a/0x50
[ 1459.715704]  ? asm_exc_invalid_op+0x1a/0x20
[ 1459.719927]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.726225]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.732520]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
[ 1459.738645]  pt_update_intercept_for_msr+0x18e/0x2d0 [kvm_intel]
[ 1459.744682]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1459.749787]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
[ 1459.754443]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1459.759550]  __kvm_set_msr+0xa3/0x180 [kvm]
[ 1459.763798]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1459.768911]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
[ 1459.773844]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1459.778348]  ? __mutex_lock+0x65/0xbe0
[ 1459.782133]  ? __mutex_lock+0x231/0xbe0
[ 1459.786008]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
[ 1459.790602]  ? kfree+0x4a/0x380
[ 1459.793780]  ? __mutex_unlock_slowpath+0x3a/0x230
[ 1459.798513]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
[ 1459.802922]  ? vma_end_read+0x14/0xf0
[ 1459.806613]  ? vma_end_read+0xd2/0xf0
[ 1459.810307]  __se_sys_ioctl+0x6b/0xc0
[ 1459.813999]  do_syscall_64+0x83/0x160
[ 1459.817692]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1459.822779] RIP: 0033:0x45d93b
[ 1459.825862] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1459.844633] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1459.852227] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1459.859394] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
[ 1459.866555] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
[ 1459.873729] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
[ 1459.880889] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
[ 1459.888053]  </TASK>
[ 1459.890276] irq event stamp: 12747
[ 1459.893707] hardirqs last  enabled at (12757): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
[ 1459.902345] hardirqs last disabled at (12766): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
[ 1459.910978] softirqs last  enabled at (12716): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1459.919527] softirqs last disabled at (12703): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1459.928078] ---[ end trace 0000000000000000 ]---
[ 1459.932723] ------------[ cut here ]------------
[ 1459.937370] Invalid MSR 58b, please adapt vmx_possible_passthrough_msrs[]
[ 1459.937376] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1459.955169] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
[ 1459.972406] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
[ 1459.982794] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 1459.987619] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
[ 1459.995124] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.002033] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
[ 1460.020843] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
[ 1460.026103] RAX: 49af8c020dc11100 RBX: 000000000000058b RCX: 0000000000000027
[ 1460.033267] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
[ 1460.040429] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
[ 1460.047591] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
[ 1460.054752] R13: 0000000000000005 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
[ 1460.061918] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
[ 1460.070028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1460.075801] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
[ 1460.082964] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1460.090132] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1460.097295] PKRU: 55555554
[ 1460.100033] Call Trace:
[ 1460.102511]  <TASK>
[ 1460.104641]  ? __warn+0xce/0x210
[ 1460.107905]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.114203]  ? report_bug+0xbd/0x160
[ 1460.117808]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.124111]  ? handle_bug+0x63/0x90
[ 1460.127639]  ? exc_invalid_op+0x1a/0x50
[ 1460.131511]  ? asm_exc_invalid_op+0x1a/0x20
[ 1460.135729]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.142026]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.148321]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
[ 1460.154450]  pt_update_intercept_for_msr+0x19e/0x2d0 [kvm_intel]
[ 1460.160489]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1460.165600]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
[ 1460.170258]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1460.175363]  __kvm_set_msr+0xa3/0x180 [kvm]
[ 1460.179604]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1460.184706]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
[ 1460.189644]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1460.194146]  ? __mutex_lock+0x65/0xbe0
[ 1460.197924]  ? __mutex_lock+0x231/0xbe0
[ 1460.201789]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
[ 1460.206377]  ? kfree+0x4a/0x380
[ 1460.209553]  ? __mutex_unlock_slowpath+0x3a/0x230
[ 1460.214302]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
[ 1460.218718]  ? vma_end_read+0x14/0xf0
[ 1460.222418]  ? vma_end_read+0xd2/0xf0
[ 1460.226117]  __se_sys_ioctl+0x6b/0xc0
[ 1460.229811]  do_syscall_64+0x83/0x160
[ 1460.233521]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1460.238610] RIP: 0033:0x45d93b
[ 1460.241699] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1460.260470] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1460.268067] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1460.275228] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
[ 1460.282390] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
[ 1460.289557] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
[ 1460.296718] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
[ 1460.303887]  </TASK>
[ 1460.306114] irq event stamp: 14023
[ 1460.309551] hardirqs last  enabled at (14033): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
[ 1460.318187] hardirqs last disabled at (14042): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
[ 1460.326831] softirqs last  enabled at (14070): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1460.335378] softirqs last disabled at (14083): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1460.343926] ---[ end trace 0000000000000000 ]---
[ 1460.348579] ------------[ cut here ]------------
[ 1460.353231] Invalid MSR 58c, please adapt vmx_possible_passthrough_msrs[]
[ 1460.353237] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.371028] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
[ 1460.388254] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
[ 1460.398631] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 1460.403459] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
[ 1460.410967] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.417877] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
[ 1460.436658] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
[ 1460.441918] RAX: 49af8c020dc11100 RBX: 000000000000058c RCX: 0000000000000027
[ 1460.449083] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
[ 1460.456247] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
[ 1460.463406] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
[ 1460.470566] R13: 0000000000000006 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
[ 1460.477728] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
[ 1460.485848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1460.491623] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
[ 1460.498787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1460.505952] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1460.513119] PKRU: 55555554
[ 1460.515861] Call Trace:
[ 1460.518335]  <TASK>
[ 1460.520473]  ? __warn+0xce/0x210
[ 1460.523737]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.530041]  ? report_bug+0xbd/0x160
[ 1460.533654]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.539952]  ? handle_bug+0x63/0x90
[ 1460.543477]  ? exc_invalid_op+0x1a/0x50
[ 1460.547344]  ? asm_exc_invalid_op+0x1a/0x20
[ 1460.551565]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.557869]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.564171]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
[ 1460.570300]  pt_update_intercept_for_msr+0x18e/0x2d0 [kvm_intel]
[ 1460.576335]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1460.581440]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
[ 1460.586096]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1460.591202]  __kvm_set_msr+0xa3/0x180 [kvm]
[ 1460.595449]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1460.600564]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
[ 1460.605503]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1460.610009]  ? __mutex_lock+0x65/0xbe0
[ 1460.613797]  ? __mutex_lock+0x231/0xbe0
[ 1460.617669]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
[ 1460.622267]  ? kfree+0x4a/0x380
[ 1460.625445]  ? __mutex_unlock_slowpath+0x3a/0x230
[ 1460.630186]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
[ 1460.634605]  ? vma_end_read+0x14/0xf0
[ 1460.638306]  ? vma_end_read+0xd2/0xf0
[ 1460.642004]  __se_sys_ioctl+0x6b/0xc0
[ 1460.645704]  do_syscall_64+0x83/0x160
[ 1460.649397]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1460.654485] RIP: 0033:0x45d93b
[ 1460.657578] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1460.676348] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1460.683942] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1460.691108] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
[ 1460.698271] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
[ 1460.705432] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
[ 1460.712594] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
[ 1460.719757]  </TASK>
[ 1460.721980] irq event stamp: 15053
[ 1460.725410] hardirqs last  enabled at (15063): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
[ 1460.734047] hardirqs last disabled at (15072): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
[ 1460.742686] softirqs last  enabled at (15104): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1460.751238] softirqs last disabled at (15115): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1460.759781] ---[ end trace 0000000000000000 ]---
[ 1460.764428] ------------[ cut here ]------------
[ 1460.769071] Invalid MSR 58d, please adapt vmx_possible_passthrough_msrs[]
[ 1460.769077] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.786863] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
[ 1460.804086] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
[ 1460.814453] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 1460.819275] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
[ 1460.826784] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.833692] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
[ 1460.852464] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
[ 1460.857716] RAX: 49af8c020dc11100 RBX: 000000000000058d RCX: 0000000000000027
[ 1460.864876] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
[ 1460.872035] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
[ 1460.879203] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
[ 1460.886372] R13: 0000000000000006 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
[ 1460.893543] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
[ 1460.901658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1460.907445] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
[ 1460.914605] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1460.921759] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1460.928920] PKRU: 55555554
[ 1460.931657] Call Trace:
[ 1460.934138]  <TASK>
[ 1460.936276]  ? __warn+0xce/0x210
[ 1460.939539]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.945842]  ? report_bug+0xbd/0x160
[ 1460.949459]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.955756]  ? handle_bug+0x63/0x90
[ 1460.959284]  ? exc_invalid_op+0x1a/0x50
[ 1460.963153]  ? asm_exc_invalid_op+0x1a/0x20
[ 1460.967368]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.973665]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
[ 1460.979961]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
[ 1460.986086]  pt_update_intercept_for_msr+0x19e/0x2d0 [kvm_intel]
[ 1460.992125]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1460.997233]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
[ 1461.001891]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1461.006999]  __kvm_set_msr+0xa3/0x180 [kvm]
[ 1461.011248]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
[ 1461.016361]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
[ 1461.021301]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1461.025795]  ? __mutex_lock+0x65/0xbe0
[ 1461.029575]  ? __mutex_lock+0x231/0xbe0
[ 1461.033438]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
[ 1461.038032]  ? kfree+0x4a/0x380
[ 1461.041209]  ? __mutex_unlock_slowpath+0x3a/0x230
[ 1461.045950]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
[ 1461.050370]  ? vma_end_read+0x14/0xf0
[ 1461.054069]  ? vma_end_read+0xd2/0xf0
[ 1461.057768]  __se_sys_ioctl+0x6b/0xc0
[ 1461.061463]  do_syscall_64+0x83/0x160
[ 1461.065160]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1461.070244] RIP: 0033:0x45d93b
[ 1461.073335] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1461.092107] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1461.099706] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1461.106867] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
[ 1461.114035] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
[ 1461.121198] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
[ 1461.128364] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
[ 1461.135530]  </TASK>
[ 1461.137753] irq event stamp: 16059
[ 1461.141183] hardirqs last  enabled at (16069): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
[ 1461.149819] hardirqs last disabled at (16078): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
[ 1461.158458] softirqs last  enabled at (16046): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1461.167003] softirqs last disabled at (16041): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
[ 1461.175545] ---[ end trace 0000000000000000 ]---
[ 1461.201335] kvm_intel: PT tracing already disabled, RTIT_CTL = 0
[ 1461.207370] unchecked MSR access error: RDMSR from 0x584 at rIP: 0xffffffffc0a9d5a7 (pt_save_msr+0x77/0x1a0 [kvm_intel])
[ 1461.218257] Call Trace:
[ 1461.220731]  <TASK>
[ 1461.222861]  ? fixup_exception+0x50e/0x580
[ 1461.226985]  ? up+0x14/0x50
[ 1461.229802]  ? gp_try_fixup_and_notify+0x34/0xe0
[ 1461.234438]  ? exc_general_protection+0xe5/0x1f0
[ 1461.239073]  ? lock_release+0xf7/0x310
[ 1461.242845]  ? prb_read_valid+0x29/0x50
[ 1461.246700]  ? asm_exc_general_protection+0x26/0x30
[ 1461.251603]  ? pt_save_msr+0x77/0x1a0 [kvm_intel]
[ 1461.256330]  vmx_vcpu_run+0x687/0xb20 [kvm_intel]
[ 1461.261063]  ? lockdep_hardirqs_on_prepare+0x163/0x250
[ 1461.266221]  ? lock_release+0xf7/0x310
[ 1461.269997]  ? kvm_arch_vcpu_ioctl_run+0x9f/0x2720 [kvm]
[ 1461.275360]  kvm_arch_vcpu_ioctl_run+0x1784/0x2720 [kvm]
[ 1461.280718]  ? kvm_arch_vcpu_ioctl_run+0x9f/0x2720 [kvm]
[ 1461.286075]  ? arch_get_unmapped_area_topdown+0x27d/0x2d0
[ 1461.291492]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1461.295980]  ? lock_acquire+0xd9/0x260
[ 1461.299749]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1461.304237]  ? get_task_pid+0x20/0x1a0
[ 1461.308012]  ? lock_acquire+0xd9/0x260
[ 1461.311786]  ? get_task_pid+0x20/0x1a0
[ 1461.315561]  ? lock_release+0xf7/0x310
[ 1461.319337]  ? get_task_pid+0x20/0x1a0
[ 1461.323110]  ? get_task_pid+0x20/0x1a0
[ 1461.326886]  kvm_vcpu_ioctl+0x54f/0x620 [kvm]
[ 1461.331287]  ? vm_mmap_pgoff+0x119/0x1b0
[ 1461.335231]  __se_sys_ioctl+0x6b/0xc0
[ 1461.338914]  do_syscall_64+0x83/0x160
[ 1461.342598]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1461.347668] RIP: 0033:0x45d93b
[ 1461.350748] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1461.369518] RSP: 002b:00007fffccda3740 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1461.377111] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1461.384267] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005
[ 1461.391416] RBP: 000000003d655e60 R08: 0000000000000006 R09: 0000000000005000
[ 1461.398566] R10: 0000000000000001 R11: 0000000000000246 R12: 000000003d653840
[ 1461.405720] R13: 0000000000000006 R14: 0000000000000002 R15: 0000000000000002
[ 1461.412879]  </TASK>
[ 1461.415101] kvm_intel: Loading guest Intel PT MSRs
[ 1461.420361] kvm_intel: Cleared RTIT_CTL
[ 1461.424252] kvm_intel: Cleared RTIT_CTL
[ 1461.428126] kvm_intel: Cleared RTIT_CTL
[ 1461.432002] kvm_intel: Cleared RTIT_CTL
[ 1461.435868] kvm_intel: Cleared RTIT_CTL
[ 1461.439736] kvm_intel: Cleared RTIT_CTL
[ 1461.443644] pt: ToPA ERROR encountered, trying to recover

[ 1461.443652] ======================================================
[ 1461.443653] WARNING: possible circular locking dependency detected
[ 1461.443654] 6.12.0-smp--65cbdf61cc85-dbg #445 Tainted: G S      W         
[ 1461.443655] ------------------------------------------------------
[ 1461.443656] intel_pt/40110 is trying to acquire lock:
[ 1461.443657] ffffffffb0672898 ((console_sem).lock){-...}-{2:2}, at: down_trylock+0x12/0x40
[ 1461.443660] 
               but task is already holding lock:\x00k:
[ 1461.443660] ff31455cac47a618 (&ctx->lock){-...}-{2:2}, at: __perf_event_task_sched_out+0x2f8/0x3a0
[ 1461.443663] 
               which lock already depends on the new lock.
\x00.\x0a
[ 1461.443664] 
               the existing dependency chain (in reverse order) is:\x00s:
[ 1461.443664] 
               -> #3 (&ctx->lock){-...}-{2:2}:\x00}:
[ 1461.443665]        _raw_spin_lock+0x30/0x40
[ 1461.443667]        __perf_event_task_sched_out+0x2f8/0x3a0
[ 1461.443669]        __schedule+0xd60/0xda0
[ 1461.443671]        schedule+0xb0/0x140
[ 1461.443672]        xfer_to_guest_mode_handle_work+0x4c/0xc0
[ 1461.443674]        kvm_arch_vcpu_ioctl_run+0x1a1b/0x2720 [kvm]
[ 1461.443708]        kvm_vcpu_ioctl+0x54f/0x620 [kvm]
[ 1461.443735]        __se_sys_ioctl+0x6b/0xc0
[ 1461.443737]        do_syscall_64+0x83/0x160
[ 1461.443738]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1461.443739] 
               -> #2 (&rq->__lock){-.-.}-{2:2}:\x00}:
[ 1461.443740]        _raw_spin_lock_nested+0x2e/0x40
[ 1461.443742]        __task_rq_lock+0x5d/0x100
[ 1461.443744]        wake_up_new_task+0xf8/0x300
[ 1461.443745]        kernel_clone+0x187/0x340
[ 1461.443746]        user_mode_thread+0xc0/0xf0
[ 1461.443748]        rest_init+0x1f/0x1f0
[ 1461.443749]        start_kernel+0x38f/0x3d0
[ 1461.443750]        x86_64_start_reservations+0x24/0x30
[ 1461.443751]        x86_64_start_kernel+0xa9/0xb0
[ 1461.443752]        common_startup_64+0x13e/0x140
[ 1461.443753] 
               -> #1 (&p->pi_lock){-.-.}-{2:2}:\x00}:
[ 1461.443754]        _raw_spin_lock_irqsave+0x5a/0x90
[ 1461.443755]        try_to_wake_up+0x56/0x840
[ 1461.443756]        up+0x3d/0x50
[ 1461.443757]        __console_unlock+0x6c/0xa0
[ 1461.443758]        console_unlock+0x6c/0x110
[ 1461.443758]        vprintk_emit+0x22e/0x330
[ 1461.443759]        _printk+0x5d/0x80
[ 1461.443761]        do_exit+0x7fb/0xa90
[ 1461.443762]        __x64_sys_exit+0x17/0x20
[ 1461.443764]        x64_sys_call+0x2113/0x2130
[ 1461.443765]        do_syscall_64+0x83/0x160
[ 1461.443766]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1461.443767] 
               -> #0 ((console_sem).lock){-...}-{2:2}:\x00}:
[ 1461.443768]        __lock_acquire+0x15c0/0x2ea0
[ 1461.443769]        lock_acquire+0xd9/0x260
[ 1461.443770]        _raw_spin_lock_irqsave+0x5a/0x90
[ 1461.443771]        down_trylock+0x12/0x40
[ 1461.443772]        __down_trylock_console_sem+0x46/0xc0
[ 1461.443773]        vprintk_emit+0x115/0x330
[ 1461.443773]        _printk+0x5d/0x80
[ 1461.443774]        pt_handle_status+0x1ad/0x200
[ 1461.443776]        pt_event_stop+0x127/0x200
[ 1461.443777]        event_sched_out+0xd4/0x280
[ 1461.443779]        group_sched_out+0x40/0xc0
[ 1461.443780]        __pmu_ctx_sched_out+0xeb/0x140
[ 1461.443781]        ctx_sched_out+0x124/0x190
[ 1461.443782]        __perf_event_task_sched_out+0x31b/0x3a0
[ 1461.443783]        __schedule+0xd60/0xda0
[ 1461.443785]        schedule+0xb0/0x140
[ 1461.443786]        xfer_to_guest_mode_handle_work+0x4c/0xc0
[ 1461.443787]        kvm_arch_vcpu_ioctl_run+0x1a1b/0x2720 [kvm]
[ 1461.443814]        kvm_vcpu_ioctl+0x54f/0x620 [kvm]
[ 1461.443840]        __se_sys_ioctl+0x6b/0xc0
[ 1461.443842]        do_syscall_64+0x83/0x160
[ 1461.443842]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1461.443843] 
               other info that might help us debug this:
\x00:\x0a
[ 1461.443844] Chain exists of:
                 (console_sem).lock --> &rq->__lock --> &ctx->lock
\x00k\x0a
[ 1461.443845]  Possible unsafe locking scenario:
\x000a
[ 1461.443845]        CPU0                    CPU1
[ 1461.443845]        ----                    ----
[ 1461.443846]   lock(&ctx->lock);
[ 1461.443846]                                lock(&rq->__lock);
[ 1461.443846]                                lock(&ctx->lock);
[ 1461.443847]   lock((console_sem).lock);
[ 1461.443847] 
                *** DEADLOCK ***
\x00*\x0a
[ 1461.443848] 3 locks held by intel_pt/40110:
[ 1461.443848]  #0: ff31455ce186cf30 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1461.443876]  #1: ff31459afe235b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x1a7/0xda0
[ 1461.443878]  #2: ff31455cac47a618 (&ctx->lock){-...}-{2:2}, at: __perf_event_task_sched_out+0x2f8/0x3a0
[ 1461.443880] 
               stack backtrace:\x00e:
[ 1461.443880] CPU: 120 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
[ 1461.443882] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 1461.443883] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
[ 1461.443883] Call Trace:
[ 1461.443884]  <TASK>
[ 1461.443884]  dump_stack_lvl+0x7e/0xc0
[ 1461.443886]  print_circular_bug+0x2e5/0x300
[ 1461.443888]  check_noncircular+0xfd/0x120
[ 1461.443890]  __lock_acquire+0x15c0/0x2ea0
[ 1461.443892]  ? save_trace+0x3d/0x300
[ 1461.443893]  ? _prb_read_valid+0x1c9/0x4d0
[ 1461.443894]  ? down_trylock+0x12/0x40
[ 1461.443895]  lock_acquire+0xd9/0x260
[ 1461.443896]  ? down_trylock+0x12/0x40
[ 1461.443898]  _raw_spin_lock_irqsave+0x5a/0x90
[ 1461.443899]  ? down_trylock+0x12/0x40
[ 1461.443900]  down_trylock+0x12/0x40
[ 1461.443900]  ? _printk+0x5d/0x80
[ 1461.443902]  __down_trylock_console_sem+0x46/0xc0
[ 1461.443903]  vprintk_emit+0x115/0x330
[ 1461.443904]  _printk+0x5d/0x80
[ 1461.443906]  pt_handle_status+0x1ad/0x200
[ 1461.443908]  pt_event_stop+0x127/0x200
[ 1461.443909]  event_sched_out+0xd4/0x280
[ 1461.443911]  group_sched_out+0x40/0xc0
[ 1461.443912]  __pmu_ctx_sched_out+0xeb/0x140
[ 1461.443914]  ctx_sched_out+0x124/0x190
[ 1461.443916]  __perf_event_task_sched_out+0x31b/0x3a0
[ 1461.443917]  ? lock_is_held_type+0x8e/0x130
[ 1461.443918]  __schedule+0xd60/0xda0
[ 1461.443920]  schedule+0xb0/0x140
[ 1461.443922]  xfer_to_guest_mode_handle_work+0x4c/0xc0
[ 1461.443923]  kvm_arch_vcpu_ioctl_run+0x1a1b/0x2720 [kvm]
[ 1461.443950]  ? kvm_arch_vcpu_ioctl_run+0x9f/0x2720 [kvm]
[ 1461.443977]  ? arch_get_unmapped_area_topdown+0x27d/0x2d0
[ 1461.443980]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1461.444006]  ? lock_acquire+0xd9/0x260
[ 1461.444007]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
[ 1461.444034]  ? get_task_pid+0x20/0x1a0
[ 1461.444036]  ? lock_acquire+0xd9/0x260
[ 1461.444036]  ? get_task_pid+0x20/0x1a0
[ 1461.444037]  ? lock_release+0xf7/0x310
[ 1461.444038]  ? get_task_pid+0x20/0x1a0
[ 1461.444039]  ? get_task_pid+0x20/0x1a0
[ 1461.444041]  kvm_vcpu_ioctl+0x54f/0x620 [kvm]
[ 1461.444067]  ? vm_mmap_pgoff+0x119/0x1b0
[ 1461.444069]  __se_sys_ioctl+0x6b/0xc0
[ 1461.444070]  do_syscall_64+0x83/0x160
[ 1461.444072]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1461.444073] RIP: 0033:0x45d93b
[ 1461.444074] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1461.444075] RSP: 002b:00007fffccda3740 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1461.444076] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
[ 1461.444076] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005
[ 1461.444077] RBP: 000000003d655e60 R08: 0000000000000006 R09: 0000000000005000
[ 1461.444077] R10: 0000000000000001 R11: 0000000000000246 R12: 000000003d653840
[ 1461.444078] R13: 0000000000000006 R14: 0000000000000002 R15: 0000000000000002
[ 1461.444079]  </TASK>

Re: [PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Adrian Hunter 1 year, 3 months ago

On 14/10/24 21:25, Sean Christopherson wrote:
> On Mon, Oct 14, 2024, Adrian Hunter wrote:
>> Ensure Intel PT tracing is disabled before VM-Entry in Intel PT Host/Guest
>> mode.
>>
>> Intel PT has 2 modes for tracing virtual machines. The default is System
>> mode whereby host and guest output to the host trace buffer. The other is
>> Host/Guest mode whereby host and guest output to their own buffers.
>> Host/Guest mode is selected by kvm_intel module parameter pt_mode=1.
>>
>> In Host/Guest mode, the following rule must be followed:
> 
> This is misleading and arguably wrong.  The following "rule" must _always_ be
> followed.  If I weren't intimately familiar with the distinctive style of the
> SDM's consistency checks, odds are good I wouldn't have any idea where this rule
> came from.
> 
>> 	If the logical processor is operating with Intel PT enabled
>> 	(if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the
>> 	"load IA32_RTIT_CTL" VM-entry control must be 0.
> 
>> However, "load IA32_RTIT_CTL" VM-entry control is always 1 in Host/Guest
>> mode, so IA32_RTIT_CTL.TraceEn must always be 0 at VM entry, irrespective
>> of whether guest IA32_RTIT_CTL.TraceEn is 1.
> 
> Explicitly state what the bad behavior is, _somewhere_.  Similar to the previous
> patch, their is a lot of information to wade through just to understand that this
> results in a failed VM-Entry.

Sorry for the slow reply, been away.  Yes, the commit message fails to call
out that the issue is failed VM-Entry.

> 
> Furthermore, nothing in here spells out exactly under what conditions this bug
> surfaces, which makes it unnecessarily difficult to understand what can go wrong,
> and when.
> 
>> Fix by stopping host Intel PT tracing always at VM entry in Host/Guest
> 
> It's not _at_ VM-Entry.  The language matters, because this makes it sound like
> PT tracing is being disabled as part of VM-Entry.
> 
>> mode.
>>
>> That also fixes the issue whereby the Intel PT NMI handler would
>> set IA32_RTIT_CTL.TraceEn back to 1 after KVM has just set it to 0.
> 
> In theory, this should be an entirely separate fix.  In practice, simply clearing
> MSR_IA32_RTIT_CTL before VM-Enter if tracing is enabled doesn't help much, i.e.
> re-enabling in the NMI handler isn't all that rare.

The commit message also fails to make clear that there are 2 ways that
VM-Entry can fail.

1. Not setting MSR_IA32_RTIT_CTL to zero _always_ in host/guest mode.
This is the common case.  Current code sets MSR_IA32_RTIT_CTL to zero
only if the guest has TraceEn, so if the guest is not tracing but the
host is tracing, then VM-Entry fails.

2. More rarely, the PT NMI might set TraceEn again before VM-Entry.
It isn't that easy to hit, but the selftest in patch 3 usually
manages it by using a small buffer size and trying many times gradually
increasing the amount of trace data.

>                                                      That absolutely needs to
> be called out in the changelog.
> 
>> Fixes: 2ef444f1600b ("KVM: x86: Add Intel PT context switch for each vcpu")
>> Cc: stable@vger.kernel.org
> 
> This is way, way too big for stable@.  Given that host/guest mode is disabled by
> default and that no one has complained about this, I think it's safe to say that
> unless we can provide a minimal patch, fixing this in LTS kernels isn't a priority.
> 
> Alternatively, I'm tempted to simply drop support for host/guest mode.  It clearly
> hasn't been well tested, and given the lack of bug reports, likely doesn't have
> many, if any, users.  And I'm guessing the overhead needed to context switch all
> the RTIT MSRs makes tracing in the guest relatively useless.

As a control flow trace, it is not affected by context switch overhead.
Intel PT timestamps are also not affected by that.

This patch reduces the MSR switching.

> 
> /me fiddles around
> 
> LOL, yeah, this needs to be burned with fire.  It's wildly broken.  So for stable@,

It doesn't seem wildly broken.  Just the VMM passing invalid CPUID
and KVM not validating it.

> I'll post a patch to hide the module param if CONFIG_BROKEN=n (and will omit
> stable@ for the previous patch).
> 
> Going forward, if someone actually cares about virtualizing PT enough to want to
> fix KVM's mess, then they can put in the effort to fix all the bugs, write all
> the tests, and in general clean up the implementation to meet KVM's current
> standards.  E.g. KVM usage of intel_pt_validate_cap() instead of KVM's guest CPUID
> and capabilities infrastructure needs to go.

The problem below seems to be caused by not validating against the *host*
CPUID.  KVM's CPUID information seems to be invalid.

> 
> My vote is to queue the current code for removal, and revisit support after the
> mediated PMU has landed.  Because I don't see any point in supporting Intel PT
> without a mediated PMU, as host/guest mode really only makes sense if the entire
> PMU is being handed over to the guest.

Why?  Intel PT PMU is programmed separately from the x86 PMU.

> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f587daf2a3bb..fe5046709bc3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -217,9 +217,13 @@ module_param(ple_window_shrink, uint, 0444);
>  static unsigned int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
>  module_param(ple_window_max, uint, 0444);
>  
> -/* Default is SYSTEM mode, 1 for host-guest mode */
> +/* Default is SYSTEM mode, 1 for host-guest mode (which is BROKEN) */
> +#ifdef CONFIG_BROKEN
>  int __read_mostly pt_mode = PT_MODE_SYSTEM;
>  module_param(pt_mode, int, S_IRUGO);
> +#else
> +#define pt_mode PT_MODE_SYSTEM
> +#endif
>  
>  struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
>  
> [ 1458.686107] ------------[ cut here ]------------
> [ 1458.690766] Invalid MSR 588, please adapt vmx_possible_passthrough_msrs[]

VMM is trying to set a non-existent MSR.  Looks like it has
decided there are more PT address filter MSRs that are architecturally
possible.

I had no idea QEMU was so broken.  I always just use -cpu host.

What were you setting?

> [ 1458.690790] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1458.708588] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
> [ 1458.725826] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S                 6.12.0-smp--65cbdf61cc85-dbg #445
> [ 1458.736197] Tainted: [S]=CPU_OUT_OF_SPEC
> [ 1458.740145] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
> [ 1458.747651] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1458.754561] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
> [ 1458.773346] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
> [ 1458.778598] RAX: 49af8c020dc11100 RBX: 0000000000000588 RCX: 0000000000000027
> [ 1458.785761] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
> [ 1458.792929] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
> [ 1458.800082] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
> [ 1458.807250] R13: 0000000000000004 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
> [ 1458.814409] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
> [ 1458.822525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1458.828295] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
> [ 1458.835457] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1458.842619] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1458.849794] PKRU: 55555554
> [ 1458.852537] Call Trace:
> [ 1458.855013]  <TASK>
> [ 1458.857151]  ? __warn+0xce/0x210
> [ 1458.860417]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1458.866713]  ? report_bug+0xbd/0x160
> [ 1458.870320]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1458.876628]  ? handle_bug+0x63/0x90
> [ 1458.880156]  ? exc_invalid_op+0x1a/0x50
> [ 1458.884021]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1458.888243]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1458.894544]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1458.900846]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
> [ 1458.906974]  pt_update_intercept_for_msr+0x18e/0x2d0 [kvm_intel]
> [ 1458.913017]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1458.918140]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
> [ 1458.922795]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1458.927902]  __kvm_set_msr+0xa3/0x180 [kvm]
> [ 1458.932140]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1458.937252]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
> [ 1458.942184]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1458.946688]  ? __mutex_lock+0x65/0xbe0
> [ 1458.950473]  ? __mutex_lock+0x231/0xbe0
> [ 1458.954345]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
> [ 1458.958929]  ? kfree+0x4a/0x380
> [ 1458.962109]  ? __mutex_unlock_slowpath+0x3a/0x230
> [ 1458.966852]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
> [ 1458.971272]  ? vma_end_read+0x14/0xf0
> [ 1458.974969]  ? vma_end_read+0xd2/0xf0
> [ 1458.978664]  __se_sys_ioctl+0x6b/0xc0
> [ 1458.982366]  do_syscall_64+0x83/0x160
> [ 1458.986075]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1458.991160] RIP: 0033:0x45d93b
> [ 1458.994252] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1459.013025] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1459.020624] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1459.027789] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
> [ 1459.034952] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
> [ 1459.042112] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
> [ 1459.049271] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
> [ 1459.056440]  </TASK>
> [ 1459.058670] irq event stamp: 10347
> [ 1459.062107] hardirqs last  enabled at (10357): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
> [ 1459.070749] hardirqs last disabled at (10372): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
> [ 1459.079400] softirqs last  enabled at (10418): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1459.087953] softirqs last disabled at (10381): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1459.096505] ---[ end trace 0000000000000000 ]---
> [ 1459.101160] ------------[ cut here ]------------
> [ 1459.105817] Invalid MSR 589, please adapt vmx_possible_passthrough_msrs[]
> [ 1459.105826] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.123618] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
> [ 1459.140843] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
> [ 1459.151217] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
> [ 1459.156042] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
> [ 1459.163554] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.170459] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
> [ 1459.189245] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
> [ 1459.194502] RAX: 49af8c020dc11100 RBX: 0000000000000589 RCX: 0000000000000027
> [ 1459.201670] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
> [ 1459.208830] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
> [ 1459.215990] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
> [ 1459.223154] R13: 0000000000000004 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
> [ 1459.230319] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
> [ 1459.238437] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1459.244208] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
> [ 1459.251369] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1459.258530] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1459.265698] PKRU: 55555554
> [ 1459.268441] Call Trace:
> [ 1459.270918]  <TASK>
> [ 1459.273053]  ? __warn+0xce/0x210
> [ 1459.276311]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.282614]  ? report_bug+0xbd/0x160
> [ 1459.286234]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.292535]  ? handle_bug+0x63/0x90
> [ 1459.296052]  ? exc_invalid_op+0x1a/0x50
> [ 1459.299917]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1459.304133]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.310434]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.316732]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
> [ 1459.322858]  pt_update_intercept_for_msr+0x19e/0x2d0 [kvm_intel]
> [ 1459.328903]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1459.334016]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
> [ 1459.338674]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1459.343778]  __kvm_set_msr+0xa3/0x180 [kvm]
> [ 1459.348017]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1459.353126]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
> [ 1459.358064]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1459.362559]  ? __mutex_lock+0x65/0xbe0
> [ 1459.366340]  ? __mutex_lock+0x231/0xbe0
> [ 1459.370205]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
> [ 1459.374789]  ? kfree+0x4a/0x380
> [ 1459.377958]  ? __mutex_unlock_slowpath+0x3a/0x230
> [ 1459.382699]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
> [ 1459.387118]  ? vma_end_read+0x14/0xf0
> [ 1459.390814]  ? vma_end_read+0xd2/0xf0
> [ 1459.394507]  __se_sys_ioctl+0x6b/0xc0
> [ 1459.398205]  do_syscall_64+0x83/0x160
> [ 1459.401903]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1459.406992] RIP: 0033:0x45d93b
> [ 1459.410081] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1459.428854] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1459.436458] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1459.443621] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
> [ 1459.450778] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
> [ 1459.457940] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
> [ 1459.465109] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
> [ 1459.472273]  </TASK>
> [ 1459.474493] irq event stamp: 11613
> [ 1459.477922] hardirqs last  enabled at (11623): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
> [ 1459.486562] hardirqs last disabled at (11632): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
> [ 1459.495198] softirqs last  enabled at (11580): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1459.503755] softirqs last disabled at (11651): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1459.512304] ---[ end trace 0000000000000000 ]---
> [ 1459.516951] ------------[ cut here ]------------
> [ 1459.521594] Invalid MSR 58a, please adapt vmx_possible_passthrough_msrs[]
> [ 1459.521601] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.539388] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
> [ 1459.556613] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
> [ 1459.566986] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
> [ 1459.571809] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
> [ 1459.579318] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.586226] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
> [ 1459.605008] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
> [ 1459.610262] RAX: 49af8c020dc11100 RBX: 000000000000058a RCX: 0000000000000027
> [ 1459.617423] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
> [ 1459.624584] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
> [ 1459.631754] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
> [ 1459.638915] R13: 0000000000000005 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
> [ 1459.646071] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
> [ 1459.654185] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1459.659960] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
> [ 1459.667125] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1459.674287] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1459.681450] PKRU: 55555554
> [ 1459.684192] Call Trace:
> [ 1459.686675]  <TASK>
> [ 1459.688814]  ? __warn+0xce/0x210
> [ 1459.692077]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.698379]  ? report_bug+0xbd/0x160
> [ 1459.701999]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.708312]  ? handle_bug+0x63/0x90
> [ 1459.711837]  ? exc_invalid_op+0x1a/0x50
> [ 1459.715704]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1459.719927]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.726225]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.732520]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
> [ 1459.738645]  pt_update_intercept_for_msr+0x18e/0x2d0 [kvm_intel]
> [ 1459.744682]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1459.749787]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
> [ 1459.754443]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1459.759550]  __kvm_set_msr+0xa3/0x180 [kvm]
> [ 1459.763798]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1459.768911]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
> [ 1459.773844]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1459.778348]  ? __mutex_lock+0x65/0xbe0
> [ 1459.782133]  ? __mutex_lock+0x231/0xbe0
> [ 1459.786008]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
> [ 1459.790602]  ? kfree+0x4a/0x380
> [ 1459.793780]  ? __mutex_unlock_slowpath+0x3a/0x230
> [ 1459.798513]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
> [ 1459.802922]  ? vma_end_read+0x14/0xf0
> [ 1459.806613]  ? vma_end_read+0xd2/0xf0
> [ 1459.810307]  __se_sys_ioctl+0x6b/0xc0
> [ 1459.813999]  do_syscall_64+0x83/0x160
> [ 1459.817692]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1459.822779] RIP: 0033:0x45d93b
> [ 1459.825862] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1459.844633] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1459.852227] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1459.859394] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
> [ 1459.866555] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
> [ 1459.873729] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
> [ 1459.880889] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
> [ 1459.888053]  </TASK>
> [ 1459.890276] irq event stamp: 12747
> [ 1459.893707] hardirqs last  enabled at (12757): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
> [ 1459.902345] hardirqs last disabled at (12766): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
> [ 1459.910978] softirqs last  enabled at (12716): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1459.919527] softirqs last disabled at (12703): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1459.928078] ---[ end trace 0000000000000000 ]---
> [ 1459.932723] ------------[ cut here ]------------
> [ 1459.937370] Invalid MSR 58b, please adapt vmx_possible_passthrough_msrs[]
> [ 1459.937376] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1459.955169] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
> [ 1459.972406] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
> [ 1459.982794] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
> [ 1459.987619] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
> [ 1459.995124] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.002033] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
> [ 1460.020843] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
> [ 1460.026103] RAX: 49af8c020dc11100 RBX: 000000000000058b RCX: 0000000000000027
> [ 1460.033267] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
> [ 1460.040429] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
> [ 1460.047591] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
> [ 1460.054752] R13: 0000000000000005 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
> [ 1460.061918] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
> [ 1460.070028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1460.075801] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
> [ 1460.082964] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1460.090132] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1460.097295] PKRU: 55555554
> [ 1460.100033] Call Trace:
> [ 1460.102511]  <TASK>
> [ 1460.104641]  ? __warn+0xce/0x210
> [ 1460.107905]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.114203]  ? report_bug+0xbd/0x160
> [ 1460.117808]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.124111]  ? handle_bug+0x63/0x90
> [ 1460.127639]  ? exc_invalid_op+0x1a/0x50
> [ 1460.131511]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1460.135729]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.142026]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.148321]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
> [ 1460.154450]  pt_update_intercept_for_msr+0x19e/0x2d0 [kvm_intel]
> [ 1460.160489]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1460.165600]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
> [ 1460.170258]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1460.175363]  __kvm_set_msr+0xa3/0x180 [kvm]
> [ 1460.179604]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1460.184706]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
> [ 1460.189644]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1460.194146]  ? __mutex_lock+0x65/0xbe0
> [ 1460.197924]  ? __mutex_lock+0x231/0xbe0
> [ 1460.201789]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
> [ 1460.206377]  ? kfree+0x4a/0x380
> [ 1460.209553]  ? __mutex_unlock_slowpath+0x3a/0x230
> [ 1460.214302]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
> [ 1460.218718]  ? vma_end_read+0x14/0xf0
> [ 1460.222418]  ? vma_end_read+0xd2/0xf0
> [ 1460.226117]  __se_sys_ioctl+0x6b/0xc0
> [ 1460.229811]  do_syscall_64+0x83/0x160
> [ 1460.233521]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1460.238610] RIP: 0033:0x45d93b
> [ 1460.241699] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1460.260470] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1460.268067] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1460.275228] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
> [ 1460.282390] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
> [ 1460.289557] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
> [ 1460.296718] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
> [ 1460.303887]  </TASK>
> [ 1460.306114] irq event stamp: 14023
> [ 1460.309551] hardirqs last  enabled at (14033): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
> [ 1460.318187] hardirqs last disabled at (14042): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
> [ 1460.326831] softirqs last  enabled at (14070): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1460.335378] softirqs last disabled at (14083): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1460.343926] ---[ end trace 0000000000000000 ]---
> [ 1460.348579] ------------[ cut here ]------------
> [ 1460.353231] Invalid MSR 58c, please adapt vmx_possible_passthrough_msrs[]
> [ 1460.353237] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.371028] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
> [ 1460.388254] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
> [ 1460.398631] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
> [ 1460.403459] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
> [ 1460.410967] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.417877] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
> [ 1460.436658] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
> [ 1460.441918] RAX: 49af8c020dc11100 RBX: 000000000000058c RCX: 0000000000000027
> [ 1460.449083] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
> [ 1460.456247] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
> [ 1460.463406] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
> [ 1460.470566] R13: 0000000000000006 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
> [ 1460.477728] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
> [ 1460.485848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1460.491623] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
> [ 1460.498787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1460.505952] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1460.513119] PKRU: 55555554
> [ 1460.515861] Call Trace:
> [ 1460.518335]  <TASK>
> [ 1460.520473]  ? __warn+0xce/0x210
> [ 1460.523737]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.530041]  ? report_bug+0xbd/0x160
> [ 1460.533654]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.539952]  ? handle_bug+0x63/0x90
> [ 1460.543477]  ? exc_invalid_op+0x1a/0x50
> [ 1460.547344]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1460.551565]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.557869]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.564171]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
> [ 1460.570300]  pt_update_intercept_for_msr+0x18e/0x2d0 [kvm_intel]
> [ 1460.576335]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1460.581440]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
> [ 1460.586096]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1460.591202]  __kvm_set_msr+0xa3/0x180 [kvm]
> [ 1460.595449]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1460.600564]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
> [ 1460.605503]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1460.610009]  ? __mutex_lock+0x65/0xbe0
> [ 1460.613797]  ? __mutex_lock+0x231/0xbe0
> [ 1460.617669]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
> [ 1460.622267]  ? kfree+0x4a/0x380
> [ 1460.625445]  ? __mutex_unlock_slowpath+0x3a/0x230
> [ 1460.630186]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
> [ 1460.634605]  ? vma_end_read+0x14/0xf0
> [ 1460.638306]  ? vma_end_read+0xd2/0xf0
> [ 1460.642004]  __se_sys_ioctl+0x6b/0xc0
> [ 1460.645704]  do_syscall_64+0x83/0x160
> [ 1460.649397]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1460.654485] RIP: 0033:0x45d93b
> [ 1460.657578] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1460.676348] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1460.683942] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1460.691108] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
> [ 1460.698271] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
> [ 1460.705432] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
> [ 1460.712594] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
> [ 1460.719757]  </TASK>
> [ 1460.721980] irq event stamp: 15053
> [ 1460.725410] hardirqs last  enabled at (15063): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
> [ 1460.734047] hardirqs last disabled at (15072): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
> [ 1460.742686] softirqs last  enabled at (15104): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1460.751238] softirqs last disabled at (15115): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1460.759781] ---[ end trace 0000000000000000 ]---
> [ 1460.764428] ------------[ cut here ]------------
> [ 1460.769071] Invalid MSR 58d, please adapt vmx_possible_passthrough_msrs[]
> [ 1460.769077] WARNING: CPU: 0 PID: 40110 at arch/x86/kvm/vmx/vmx.c:701 vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.786863] Modules linked in: kvm_intel kvm vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
> [ 1460.804086] CPU: 0 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
> [ 1460.814453] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
> [ 1460.819275] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
> [ 1460.826784] RIP: 0010:vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.833692] Code: 00 00 c3 cc cc cc cc cc b8 02 00 00 00 c3 cc cc cc cc cc b8 0f 00 00 00 c3 cc cc cc cc cc 48 c7 c7 af ed ac c0 e8 4e 80 43 ee <0f> 0b b8 fe ff ff ff c3 cc cc cc cc cc 90 90 90 90 90 90 90 90 90
> [ 1460.852464] RSP: 0018:ff31455ca2bbfc78 EFLAGS: 00010246
> [ 1460.857716] RAX: 49af8c020dc11100 RBX: 000000000000058d RCX: 0000000000000027
> [ 1460.864876] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ff31459afc420b08
> [ 1460.872035] RBP: 0000000000000003 R08: 000000000000ffff R09: ff3145dbffc5f000
> [ 1460.879203] R10: 000000000002fffd R11: 0000000000000004 R12: 000000000000240d
> [ 1460.886372] R13: 0000000000000006 R14: ff31455ce186ce80 R15: ff31455cf6c9a000
> [ 1460.893543] FS:  000000003d6523c0(0000) GS:ff31459afc400000(0000) knlGS:0000000000000000
> [ 1460.901658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1460.907445] CR2: 000000003d6567c8 CR3: 0000000137ca0003 CR4: 0000000000f73ef0
> [ 1460.914605] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1460.921759] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1460.928920] PKRU: 55555554
> [ 1460.931657] Call Trace:
> [ 1460.934138]  <TASK>
> [ 1460.936276]  ? __warn+0xce/0x210
> [ 1460.939539]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.945842]  ? report_bug+0xbd/0x160
> [ 1460.949459]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.955756]  ? handle_bug+0x63/0x90
> [ 1460.959284]  ? exc_invalid_op+0x1a/0x50
> [ 1460.963153]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1460.967368]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.973665]  ? vmx_get_passthrough_msr_slot+0x222/0x230 [kvm_intel]
> [ 1460.979961]  vmx_disable_intercept_for_msr+0x38/0x170 [kvm_intel]
> [ 1460.986086]  pt_update_intercept_for_msr+0x19e/0x2d0 [kvm_intel]
> [ 1460.992125]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1460.997233]  vmx_set_msr+0xae3/0xbf0 [kvm_intel]
> [ 1461.001891]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1461.006999]  __kvm_set_msr+0xa3/0x180 [kvm]
> [ 1461.011248]  ? kvm_arch_vcpu_ioctl+0x2e2/0x1150 [kvm]
> [ 1461.016361]  kvm_arch_vcpu_ioctl+0xf10/0x1150 [kvm]
> [ 1461.021301]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1461.025795]  ? __mutex_lock+0x65/0xbe0
> [ 1461.029575]  ? __mutex_lock+0x231/0xbe0
> [ 1461.033438]  ? kvm_vcpu_ioctl+0x589/0x620 [kvm]
> [ 1461.038032]  ? kfree+0x4a/0x380
> [ 1461.041209]  ? __mutex_unlock_slowpath+0x3a/0x230
> [ 1461.045950]  kvm_vcpu_ioctl+0x4f8/0x620 [kvm]
> [ 1461.050370]  ? vma_end_read+0x14/0xf0
> [ 1461.054069]  ? vma_end_read+0xd2/0xf0
> [ 1461.057768]  __se_sys_ioctl+0x6b/0xc0
> [ 1461.061463]  do_syscall_64+0x83/0x160
> [ 1461.065160]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1461.070244] RIP: 0033:0x45d93b
> [ 1461.073335] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1461.092107] RSP: 002b:00007fffccda3ba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1461.099706] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1461.106867] RDX: 00007fffccda3c00 RSI: 000000004008ae89 RDI: 0000000000000005
> [ 1461.114035] RBP: 000000000000240d R08: 0000000000000000 R09: 0000000000000007
> [ 1461.121198] R10: 000000003d6563ec R11: 0000000000000246 R12: 0000000000000570
> [ 1461.128364] R13: 00000000004f5b40 R14: 0000000000000002 R15: 0000000000000002
> [ 1461.135530]  </TASK>
> [ 1461.137753] irq event stamp: 16059
> [ 1461.141183] hardirqs last  enabled at (16069): [<ffffffffaef6b916>] __console_unlock+0x76/0xa0
> [ 1461.149819] hardirqs last disabled at (16078): [<ffffffffaef6b8fb>] __console_unlock+0x5b/0xa0
> [ 1461.158458] softirqs last  enabled at (16046): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1461.167003] softirqs last disabled at (16041): [<ffffffffaeed4d3a>] __irq_exit_rcu+0x6a/0x100
> [ 1461.175545] ---[ end trace 0000000000000000 ]---
> [ 1461.201335] kvm_intel: PT tracing already disabled, RTIT_CTL = 0
> [ 1461.207370] unchecked MSR access error: RDMSR from 0x584 at rIP: 0xffffffffc0a9d5a7 (pt_save_msr+0x77/0x1a0 [kvm_intel])

Again it seems like VMM has managed to defined an invalid number
of address filters.  Looking at the code, I cannot see anywhere
that it validates what the host actually supports, but no processor
currently supports more than 2, so the valid address filter MSRs are
at most:

	#define MSR_IA32_RTIT_ADDR0_A		0x00000580
	#define MSR_IA32_RTIT_ADDR0_B		0x00000581
	#define MSR_IA32_RTIT_ADDR1_A		0x00000582
	#define MSR_IA32_RTIT_ADDR1_B		0x00000583


> [ 1461.218257] Call Trace:
> [ 1461.220731]  <TASK>
> [ 1461.222861]  ? fixup_exception+0x50e/0x580
> [ 1461.226985]  ? up+0x14/0x50
> [ 1461.229802]  ? gp_try_fixup_and_notify+0x34/0xe0
> [ 1461.234438]  ? exc_general_protection+0xe5/0x1f0
> [ 1461.239073]  ? lock_release+0xf7/0x310
> [ 1461.242845]  ? prb_read_valid+0x29/0x50
> [ 1461.246700]  ? asm_exc_general_protection+0x26/0x30
> [ 1461.251603]  ? pt_save_msr+0x77/0x1a0 [kvm_intel]
> [ 1461.256330]  vmx_vcpu_run+0x687/0xb20 [kvm_intel]
> [ 1461.261063]  ? lockdep_hardirqs_on_prepare+0x163/0x250
> [ 1461.266221]  ? lock_release+0xf7/0x310
> [ 1461.269997]  ? kvm_arch_vcpu_ioctl_run+0x9f/0x2720 [kvm]
> [ 1461.275360]  kvm_arch_vcpu_ioctl_run+0x1784/0x2720 [kvm]
> [ 1461.280718]  ? kvm_arch_vcpu_ioctl_run+0x9f/0x2720 [kvm]
> [ 1461.286075]  ? arch_get_unmapped_area_topdown+0x27d/0x2d0
> [ 1461.291492]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1461.295980]  ? lock_acquire+0xd9/0x260
> [ 1461.299749]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1461.304237]  ? get_task_pid+0x20/0x1a0
> [ 1461.308012]  ? lock_acquire+0xd9/0x260
> [ 1461.311786]  ? get_task_pid+0x20/0x1a0
> [ 1461.315561]  ? lock_release+0xf7/0x310
> [ 1461.319337]  ? get_task_pid+0x20/0x1a0
> [ 1461.323110]  ? get_task_pid+0x20/0x1a0
> [ 1461.326886]  kvm_vcpu_ioctl+0x54f/0x620 [kvm]
> [ 1461.331287]  ? vm_mmap_pgoff+0x119/0x1b0
> [ 1461.335231]  __se_sys_ioctl+0x6b/0xc0
> [ 1461.338914]  do_syscall_64+0x83/0x160
> [ 1461.342598]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1461.347668] RIP: 0033:0x45d93b
> [ 1461.350748] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1461.369518] RSP: 002b:00007fffccda3740 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1461.377111] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1461.384267] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005
> [ 1461.391416] RBP: 000000003d655e60 R08: 0000000000000006 R09: 0000000000005000
> [ 1461.398566] R10: 0000000000000001 R11: 0000000000000246 R12: 000000003d653840
> [ 1461.405720] R13: 0000000000000006 R14: 0000000000000002 R15: 0000000000000002
> [ 1461.412879]  </TASK>
> [ 1461.415101] kvm_intel: Loading guest Intel PT MSRs
> [ 1461.420361] kvm_intel: Cleared RTIT_CTL
> [ 1461.424252] kvm_intel: Cleared RTIT_CTL
> [ 1461.428126] kvm_intel: Cleared RTIT_CTL
> [ 1461.432002] kvm_intel: Cleared RTIT_CTL
> [ 1461.435868] kvm_intel: Cleared RTIT_CTL
> [ 1461.439736] kvm_intel: Cleared RTIT_CTL
> [ 1461.443644] pt: ToPA ERROR encountered, trying to recover

I'd guess the unchecked MSR access has left the PT MSRs
in a half-updated state.

> 
> [ 1461.443652] ======================================================
> [ 1461.443653] WARNING: possible circular locking dependency detected
> [ 1461.443654] 6.12.0-smp--65cbdf61cc85-dbg #445 Tainted: G S      W         
> [ 1461.443655] ------------------------------------------------------
> [ 1461.443656] intel_pt/40110 is trying to acquire lock:
> [ 1461.443657] ffffffffb0672898 ((console_sem).lock){-...}-{2:2}, at: down_trylock+0x12/0x40

Console printing from interrupt context does seem to deadlock
so that is likely not related.

> [ 1461.443660] 
>                but task is already holding lock:\x00k:
> [ 1461.443660] ff31455cac47a618 (&ctx->lock){-...}-{2:2}, at: __perf_event_task_sched_out+0x2f8/0x3a0
> [ 1461.443663] 
>                which lock already depends on the new lock.
> \x00.\x0a
> [ 1461.443664] 
>                the existing dependency chain (in reverse order) is:\x00s:
> [ 1461.443664] 
>                -> #3 (&ctx->lock){-...}-{2:2}:\x00}:
> [ 1461.443665]        _raw_spin_lock+0x30/0x40
> [ 1461.443667]        __perf_event_task_sched_out+0x2f8/0x3a0
> [ 1461.443669]        __schedule+0xd60/0xda0
> [ 1461.443671]        schedule+0xb0/0x140
> [ 1461.443672]        xfer_to_guest_mode_handle_work+0x4c/0xc0
> [ 1461.443674]        kvm_arch_vcpu_ioctl_run+0x1a1b/0x2720 [kvm]
> [ 1461.443708]        kvm_vcpu_ioctl+0x54f/0x620 [kvm]
> [ 1461.443735]        __se_sys_ioctl+0x6b/0xc0
> [ 1461.443737]        do_syscall_64+0x83/0x160
> [ 1461.443738]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1461.443739] 
>                -> #2 (&rq->__lock){-.-.}-{2:2}:\x00}:
> [ 1461.443740]        _raw_spin_lock_nested+0x2e/0x40
> [ 1461.443742]        __task_rq_lock+0x5d/0x100
> [ 1461.443744]        wake_up_new_task+0xf8/0x300
> [ 1461.443745]        kernel_clone+0x187/0x340
> [ 1461.443746]        user_mode_thread+0xc0/0xf0
> [ 1461.443748]        rest_init+0x1f/0x1f0
> [ 1461.443749]        start_kernel+0x38f/0x3d0
> [ 1461.443750]        x86_64_start_reservations+0x24/0x30
> [ 1461.443751]        x86_64_start_kernel+0xa9/0xb0
> [ 1461.443752]        common_startup_64+0x13e/0x140
> [ 1461.443753] 
>                -> #1 (&p->pi_lock){-.-.}-{2:2}:\x00}:
> [ 1461.443754]        _raw_spin_lock_irqsave+0x5a/0x90
> [ 1461.443755]        try_to_wake_up+0x56/0x840
> [ 1461.443756]        up+0x3d/0x50
> [ 1461.443757]        __console_unlock+0x6c/0xa0
> [ 1461.443758]        console_unlock+0x6c/0x110
> [ 1461.443758]        vprintk_emit+0x22e/0x330
> [ 1461.443759]        _printk+0x5d/0x80
> [ 1461.443761]        do_exit+0x7fb/0xa90
> [ 1461.443762]        __x64_sys_exit+0x17/0x20
> [ 1461.443764]        x64_sys_call+0x2113/0x2130
> [ 1461.443765]        do_syscall_64+0x83/0x160
> [ 1461.443766]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1461.443767] 
>                -> #0 ((console_sem).lock){-...}-{2:2}:\x00}:
> [ 1461.443768]        __lock_acquire+0x15c0/0x2ea0
> [ 1461.443769]        lock_acquire+0xd9/0x260
> [ 1461.443770]        _raw_spin_lock_irqsave+0x5a/0x90
> [ 1461.443771]        down_trylock+0x12/0x40
> [ 1461.443772]        __down_trylock_console_sem+0x46/0xc0
> [ 1461.443773]        vprintk_emit+0x115/0x330
> [ 1461.443773]        _printk+0x5d/0x80
> [ 1461.443774]        pt_handle_status+0x1ad/0x200
> [ 1461.443776]        pt_event_stop+0x127/0x200
> [ 1461.443777]        event_sched_out+0xd4/0x280
> [ 1461.443779]        group_sched_out+0x40/0xc0
> [ 1461.443780]        __pmu_ctx_sched_out+0xeb/0x140
> [ 1461.443781]        ctx_sched_out+0x124/0x190
> [ 1461.443782]        __perf_event_task_sched_out+0x31b/0x3a0
> [ 1461.443783]        __schedule+0xd60/0xda0
> [ 1461.443785]        schedule+0xb0/0x140
> [ 1461.443786]        xfer_to_guest_mode_handle_work+0x4c/0xc0
> [ 1461.443787]        kvm_arch_vcpu_ioctl_run+0x1a1b/0x2720 [kvm]
> [ 1461.443814]        kvm_vcpu_ioctl+0x54f/0x620 [kvm]
> [ 1461.443840]        __se_sys_ioctl+0x6b/0xc0
> [ 1461.443842]        do_syscall_64+0x83/0x160
> [ 1461.443842]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1461.443843] 
>                other info that might help us debug this:
> \x00:\x0a
> [ 1461.443844] Chain exists of:
>                  (console_sem).lock --> &rq->__lock --> &ctx->lock
> \x00k\x0a
> [ 1461.443845]  Possible unsafe locking scenario:
> \x000a
> [ 1461.443845]        CPU0                    CPU1
> [ 1461.443845]        ----                    ----
> [ 1461.443846]   lock(&ctx->lock);
> [ 1461.443846]                                lock(&rq->__lock);
> [ 1461.443846]                                lock(&ctx->lock);
> [ 1461.443847]   lock((console_sem).lock);
> [ 1461.443847] 
>                 *** DEADLOCK ***
> \x00*\x0a
> [ 1461.443848] 3 locks held by intel_pt/40110:
> [ 1461.443848]  #0: ff31455ce186cf30 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1461.443876]  #1: ff31459afe235b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x1a7/0xda0
> [ 1461.443878]  #2: ff31455cac47a618 (&ctx->lock){-...}-{2:2}, at: __perf_event_task_sched_out+0x2f8/0x3a0
> [ 1461.443880] 
>                stack backtrace:\x00e:
> [ 1461.443880] CPU: 120 UID: 0 PID: 40110 Comm: intel_pt Tainted: G S      W          6.12.0-smp--65cbdf61cc85-dbg #445
> [ 1461.443882] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
> [ 1461.443883] Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240508.2-0 06/25/2024
> [ 1461.443883] Call Trace:
> [ 1461.443884]  <TASK>
> [ 1461.443884]  dump_stack_lvl+0x7e/0xc0
> [ 1461.443886]  print_circular_bug+0x2e5/0x300
> [ 1461.443888]  check_noncircular+0xfd/0x120
> [ 1461.443890]  __lock_acquire+0x15c0/0x2ea0
> [ 1461.443892]  ? save_trace+0x3d/0x300
> [ 1461.443893]  ? _prb_read_valid+0x1c9/0x4d0
> [ 1461.443894]  ? down_trylock+0x12/0x40
> [ 1461.443895]  lock_acquire+0xd9/0x260
> [ 1461.443896]  ? down_trylock+0x12/0x40
> [ 1461.443898]  _raw_spin_lock_irqsave+0x5a/0x90
> [ 1461.443899]  ? down_trylock+0x12/0x40
> [ 1461.443900]  down_trylock+0x12/0x40
> [ 1461.443900]  ? _printk+0x5d/0x80
> [ 1461.443902]  __down_trylock_console_sem+0x46/0xc0
> [ 1461.443903]  vprintk_emit+0x115/0x330
> [ 1461.443904]  _printk+0x5d/0x80
> [ 1461.443906]  pt_handle_status+0x1ad/0x200
> [ 1461.443908]  pt_event_stop+0x127/0x200
> [ 1461.443909]  event_sched_out+0xd4/0x280
> [ 1461.443911]  group_sched_out+0x40/0xc0
> [ 1461.443912]  __pmu_ctx_sched_out+0xeb/0x140
> [ 1461.443914]  ctx_sched_out+0x124/0x190
> [ 1461.443916]  __perf_event_task_sched_out+0x31b/0x3a0
> [ 1461.443917]  ? lock_is_held_type+0x8e/0x130
> [ 1461.443918]  __schedule+0xd60/0xda0
> [ 1461.443920]  schedule+0xb0/0x140
> [ 1461.443922]  xfer_to_guest_mode_handle_work+0x4c/0xc0
> [ 1461.443923]  kvm_arch_vcpu_ioctl_run+0x1a1b/0x2720 [kvm]
> [ 1461.443950]  ? kvm_arch_vcpu_ioctl_run+0x9f/0x2720 [kvm]
> [ 1461.443977]  ? arch_get_unmapped_area_topdown+0x27d/0x2d0
> [ 1461.443980]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1461.444006]  ? lock_acquire+0xd9/0x260
> [ 1461.444007]  ? kvm_vcpu_ioctl+0x85/0x620 [kvm]
> [ 1461.444034]  ? get_task_pid+0x20/0x1a0
> [ 1461.444036]  ? lock_acquire+0xd9/0x260
> [ 1461.444036]  ? get_task_pid+0x20/0x1a0
> [ 1461.444037]  ? lock_release+0xf7/0x310
> [ 1461.444038]  ? get_task_pid+0x20/0x1a0
> [ 1461.444039]  ? get_task_pid+0x20/0x1a0
> [ 1461.444041]  kvm_vcpu_ioctl+0x54f/0x620 [kvm]
> [ 1461.444067]  ? vm_mmap_pgoff+0x119/0x1b0
> [ 1461.444069]  __se_sys_ioctl+0x6b/0xc0
> [ 1461.444070]  do_syscall_64+0x83/0x160
> [ 1461.444072]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1461.444073] RIP: 0033:0x45d93b
> [ 1461.444074] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [ 1461.444075] RSP: 002b:00007fffccda3740 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1461.444076] RAX: ffffffffffffffda RBX: 000000003d655e60 RCX: 000000000045d93b
> [ 1461.444076] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005
> [ 1461.444077] RBP: 000000003d655e60 R08: 0000000000000006 R09: 0000000000005000
> [ 1461.444077] R10: 0000000000000001 R11: 0000000000000246 R12: 000000003d653840
> [ 1461.444078] R13: 0000000000000006 R14: 0000000000000002 R15: 0000000000000002
> [ 1461.444079]  </TASK>
> 
> 
>

Re: [PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Sean Christopherson 1 year, 3 months ago

On Tue, Oct 22, 2024, Adrian Hunter wrote:
> On 14/10/24 21:25, Sean Christopherson wrote:
> >> Fixes: 2ef444f1600b ("KVM: x86: Add Intel PT context switch for each vcpu")
> >> Cc: stable@vger.kernel.org
> > 
> > This is way, way too big for stable@.  Given that host/guest mode is disabled by
> > default and that no one has complained about this, I think it's safe to say that
> > unless we can provide a minimal patch, fixing this in LTS kernels isn't a priority.
> > 
> > Alternatively, I'm tempted to simply drop support for host/guest mode.  It clearly
> > hasn't been well tested, and given the lack of bug reports, likely doesn't have
> > many, if any, users.  And I'm guessing the overhead needed to context switch all
> > the RTIT MSRs makes tracing in the guest relatively useless.
> 
> As a control flow trace, it is not affected by context switch overhead.

Out of curiosity, how much is Intel PT used purely for control flow tracing, i.e.
without caring _at all_ about perceived execution time?

> Intel PT timestamps are also not affected by that.

Timestamps are affected because the guest will see inexplicable jumps in time.
Those gaps are unavoidable to some degree, but context switching on every entry
and exit is 

> This patch reduces the MSR switching.

To be clear, I'm not objecting to any of the ideas in this patch, I'm "objecting"
to trying to put band-aids on KVM's existing implementation, which is clearly
buggy and, like far too many PMU-ish features in KVM, was probably developed
without any thought as to how it would affect use cases beyond the host admin
and the VM owner being a single person.  And I'm also objecting, vehemently, to
sending anything of this magnitude and complexity to LTS kernels.

> > /me fiddles around
> > 
> > LOL, yeah, this needs to be burned with fire.  It's wildly broken.  So for stable@,
> 
> It doesn't seem wildly broken.  Just the VMM passing invalid CPUID
> and KVM not validating it.

Heh, I agree with "just", but unfortunately "just ... not validating" a large
swath of userspace inputs is pretty widly broken.  More importantly, it's not
easy to fix.  E.g. KVM could require the inputs to exactly match hardware, but
that creates an ABI that I'm not entirely sure is desirable in the long term.

> > I'll post a patch to hide the module param if CONFIG_BROKEN=n (and will omit
> > stable@ for the previous patch).
> > 
> > Going forward, if someone actually cares about virtualizing PT enough to want to
> > fix KVM's mess, then they can put in the effort to fix all the bugs, write all
> > the tests, and in general clean up the implementation to meet KVM's current
> > standards.  E.g. KVM usage of intel_pt_validate_cap() instead of KVM's guest CPUID
> > and capabilities infrastructure needs to go.
> 
> The problem below seems to be caused by not validating against the *host*
> CPUID.  KVM's CPUID information seems to be invalid.

Yes.

> > My vote is to queue the current code for removal, and revisit support after the
> > mediated PMU has landed.  Because I don't see any point in supporting Intel PT
> > without a mediated PMU, as host/guest mode really only makes sense if the entire
> > PMU is being handed over to the guest.
> 
> Why?

To simplify the implementation, and because I don't see how virtualizing Intel PT
without also enabling the mediated PMU makes any sense.

Conceptually, KVM's PT implementation is very, very similar to the mediated PMU.
They both effectively give the guest control of hardware when the vCPU starts
running, and take back control when the vCPU stops running.

If KVM allows Intel PT without the mediated PMU, then KVM and perf have to support
two separate implementations for the same model.  If virtualizing Intel PT is
allowed if and only if the mediated PMU is enabled, then .handle_intel_pt_intr()
goes away.  And on the flip side, it becomes super obvious that host usage of
Intel PT needs to be mutually exclusive with the mediated PMU.

> Intel PT PMU is programmed separately from the x86 PMU.

Except for the minor detail that Intel PT generates PMIs, and that PEBS can log
to PT buffers.  Oh, and giving the guest control of the PMU means host usage of
Intel PT will break the host *and* guest.  The host won't get PMIs, while the
guest will see spurious PMIs.

So I don't see any reason to try to separate the two.

> > [ 1458.686107] ------------[ cut here ]------------
> > [ 1458.690766] Invalid MSR 588, please adapt vmx_possible_passthrough_msrs[]
> 
> VMM is trying to set a non-existent MSR.  Looks like it has
> decided there are more PT address filter MSRs that are architecturally
> possible.
> 
> I had no idea QEMU was so broken.  

It's not QEMU that's broken, it's KVM that's broken.  

> I always just use -cpu host.

Yes, and that's exactly the problem.  The only people that have ever touched this
likely only ever use `-cpu host`, and so KVM's flaws have gone unnoticed.

> What were you setting?

I tweaked your selftest to feed KVM garbage.

Re: [PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Andi Kleen 1 year, 3 months ago

> 
> Out of curiosity, how much is Intel PT used purely for control flow tracing, i.e.
> without caring _at all_ about perceived execution time?

It is very common, e.g. one major use of PT is control flow discovery in
feedback fuzzers.

-Andi

Re: [PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Adrian Hunter 1 year, 3 months ago

On 22/10/24 19:30, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Adrian Hunter wrote:
>> On 14/10/24 21:25, Sean Christopherson wrote:
>>>> Fixes: 2ef444f1600b ("KVM: x86: Add Intel PT context switch for each vcpu")
>>>> Cc: stable@vger.kernel.org
>>>
>>> This is way, way too big for stable@.  Given that host/guest mode is disabled by
>>> default and that no one has complained about this, I think it's safe to say that
>>> unless we can provide a minimal patch, fixing this in LTS kernels isn't a priority.
>>>
>>> Alternatively, I'm tempted to simply drop support for host/guest mode.  It clearly
>>> hasn't been well tested, and given the lack of bug reports, likely doesn't have
>>> many, if any, users.  And I'm guessing the overhead needed to context switch all
>>> the RTIT MSRs makes tracing in the guest relatively useless.
>>
>> As a control flow trace, it is not affected by context switch overhead.
> 
> Out of curiosity, how much is Intel PT used purely for control flow tracing, i.e.
> without caring _at all_ about perceived execution time?

It can be used to try to understand how code went wrong.

But timestamps would still indicate what was happening on different VCPUs
at about the same time.

> 
>> Intel PT timestamps are also not affected by that.
> 
> Timestamps are affected because the guest will see inexplicable jumps in time.

Tracing a task on the host sees jumps in time for scheduler
context switches anyway.

> Those gaps are unavoidable to some degree, but context switching on every entry
> and exit is 
> 
>> This patch reduces the MSR switching.
> 
> To be clear, I'm not objecting to any of the ideas in this patch, I'm "objecting"
> to trying to put band-aids on KVM's existing implementation, which is clearly
> buggy and, like far too many PMU-ish features in KVM, was probably developed
> without any thought as to how it would affect use cases beyond the host admin
> and the VM owner being a single person.  And I'm also objecting, vehemently, to
> sending anything of this magnitude and complexity to LTS kernels.

That's your call for sure.

> 
>>> /me fiddles around
>>>
>>> LOL, yeah, this needs to be burned with fire.  It's wildly broken.  So for stable@,
>>
>> It doesn't seem wildly broken.  Just the VMM passing invalid CPUID
>> and KVM not validating it.
> 
> Heh, I agree with "just", but unfortunately "just ... not validating" a large
> swath of userspace inputs is pretty widly broken.  More importantly, it's not
> easy to fix.  E.g. KVM could require the inputs to exactly match hardware, but
> that creates an ABI that I'm not entirely sure is desirable in the long term.

Although the CPUID ABI does not really change.  KVM does not support
emulating Intel PT, so accepting CPUID that the hardware cannot support
seems like a bit of a lie.  Aren't there other features that KVM does not
support if the hardware support is not there?

To some degree, a testing and debugging feature does not have to be
available in 100% of cases because it can still be useful when it is
available.

> 
>>> I'll post a patch to hide the module param if CONFIG_BROKEN=n (and will omit
>>> stable@ for the previous patch).
>>>
>>> Going forward, if someone actually cares about virtualizing PT enough to want to
>>> fix KVM's mess, then they can put in the effort to fix all the bugs, write all
>>> the tests, and in general clean up the implementation to meet KVM's current
>>> standards.  E.g. KVM usage of intel_pt_validate_cap() instead of KVM's guest CPUID
>>> and capabilities infrastructure needs to go.
>>
>> The problem below seems to be caused by not validating against the *host*
>> CPUID.  KVM's CPUID information seems to be invalid.
> 
> Yes.
> 
>>> My vote is to queue the current code for removal, and revisit support after the
>>> mediated PMU has landed.  Because I don't see any point in supporting Intel PT
>>> without a mediated PMU, as host/guest mode really only makes sense if the entire
>>> PMU is being handed over to the guest.
>>
>> Why?
> 
> To simplify the implementation, and because I don't see how virtualizing Intel PT
> without also enabling the mediated PMU makes any sense.
> 
> Conceptually, KVM's PT implementation is very, very similar to the mediated PMU.
> They both effectively give the guest control of hardware when the vCPU starts
> running, and take back control when the vCPU stops running.
> 
> If KVM allows Intel PT without the mediated PMU, then KVM and perf have to support
> two separate implementations for the same model.  If virtualizing Intel PT is
> allowed if and only if the mediated PMU is enabled, then .handle_intel_pt_intr()
> goes away.  And on the flip side, it becomes super obvious that host usage of
> Intel PT needs to be mutually exclusive with the mediated PMU.

And forgo being able to trace mediated passthough with Intel PT ;-)

> 
>> Intel PT PMU is programmed separately from the x86 PMU.
> 
> Except for the minor detail that Intel PT generates PMIs, and that PEBS can log
> to PT buffers.  Oh, and giving the guest control of the PMU means host usage of
> Intel PT will break the host *and* guest.  The host won't get PMIs, while the
> guest will see spurious PMIs.

Yes the PMI is in conflict if it is not also a VM-Exit

Note that Intel PT does have a snapshot mode that doesn't use a PMI.
Trace data continuously overwrites the ring buffer until the user asks
for a snapshot.

> 
> So I don't see any reason to try to separate the two.

OK

> 
>>> [ 1458.686107] ------------[ cut here ]------------
>>> [ 1458.690766] Invalid MSR 588, please adapt vmx_possible_passthrough_msrs[]
>>
>> VMM is trying to set a non-existent MSR.  Looks like it has
>> decided there are more PT address filter MSRs that are architecturally
>> possible.
>>
>> I had no idea QEMU was so broken.  
> 
> It's not QEMU that's broken, it's KVM that's broken.  
> 
>> I always just use -cpu host.
> 
> Yes, and that's exactly the problem.  The only people that have ever touched this
> likely only ever use `-cpu host`, and so KVM's flaws have gone unnoticed.
> 
>> What were you setting?
> 
> I tweaked your selftest to feed KVM garbage.

Re: [PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Sean Christopherson 1 year, 3 months ago

On Tue, Oct 22, 2024, Adrian Hunter wrote:
> On 22/10/24 19:30, Sean Christopherson wrote:
> >>> LOL, yeah, this needs to be burned with fire.  It's wildly broken.  So for stable@,
> >>
> >> It doesn't seem wildly broken.  Just the VMM passing invalid CPUID
> >> and KVM not validating it.
> > 
> > Heh, I agree with "just", but unfortunately "just ... not validating" a large
> > swath of userspace inputs is pretty widly broken.  More importantly, it's not
> > easy to fix.  E.g. KVM could require the inputs to exactly match hardware, but
> > that creates an ABI that I'm not entirely sure is desirable in the long term.
> 
> Although the CPUID ABI does not really change.  KVM does not support
> emulating Intel PT, so accepting CPUID that the hardware cannot support
> seems like a bit of a lie.

But it's not all or nothing, e.g. KVM should support exposing fewer address ranges
than are supported by hardware, so that the same virtual CPU model can be run on
different generations of hardware.

> Aren't there other features that KVM does not support if the hardware support
> is not there?

Many.  But either features are one-off things without configurable properties,
or KVM does the right thing (usually).  E.g. nested virtualization heavily relies
on hardware, and has a plethora of knobs, but KVM (usually) honors and validates
the configuration provided by userspace.

> To some degree, a testing and debugging feature does not have to be
> available in 100% of cases because it can still be useful when it is
> available.

I don't disagree, but "works on my machine" is how KVM has gotten into so many
messes with such features.  I also don't necessarily disagree with supporting a
very limited subset of use cases, but I want such support to come as well-defined
package with proper guard rails, docs, and ideally tests.

> >>> I'll post a patch to hide the module param if CONFIG_BROKEN=n (and will omit
> >>> stable@ for the previous patch).
> >>>
> >>> Going forward, if someone actually cares about virtualizing PT enough to want to
> >>> fix KVM's mess, then they can put in the effort to fix all the bugs, write all
> >>> the tests, and in general clean up the implementation to meet KVM's current
> >>> standards.  E.g. KVM usage of intel_pt_validate_cap() instead of KVM's guest CPUID
> >>> and capabilities infrastructure needs to go.
> >>
> >> The problem below seems to be caused by not validating against the *host*
> >> CPUID.  KVM's CPUID information seems to be invalid.
> > 
> > Yes.
> > 
> >>> My vote is to queue the current code for removal, and revisit support after the
> >>> mediated PMU has landed.  Because I don't see any point in supporting Intel PT
> >>> without a mediated PMU, as host/guest mode really only makes sense if the entire
> >>> PMU is being handed over to the guest.
> >>
> >> Why?
> > 
> > To simplify the implementation, and because I don't see how virtualizing Intel PT
> > without also enabling the mediated PMU makes any sense.
> > 
> > Conceptually, KVM's PT implementation is very, very similar to the mediated PMU.
> > They both effectively give the guest control of hardware when the vCPU starts
> > running, and take back control when the vCPU stops running.
> > 
> > If KVM allows Intel PT without the mediated PMU, then KVM and perf have to support
> > two separate implementations for the same model.  If virtualizing Intel PT is
> > allowed if and only if the mediated PMU is enabled, then .handle_intel_pt_intr()
> > goes away.  And on the flip side, it becomes super obvious that host usage of
> > Intel PT needs to be mutually exclusive with the mediated PMU.
> 
> And forgo being able to trace mediated passthough with Intel PT ;-)

It can't work, generally.  Anything that generates a ToPA PMI will go sideways.
In the worst case scenario, the spurious PMI could crash the guest.

And when the mediated PMU supports PEBS, that would likely break too.

Re: [PATCH V13 03/14] KVM: x86: Fix Intel PT Host/Guest mode when host tracing also

Posted by Adrian Hunter 1 year, 3 months ago

On 23/10/24 01:30, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Adrian Hunter wrote:
>> On 22/10/24 19:30, Sean Christopherson wrote:
>>>>> LOL, yeah, this needs to be burned with fire.  It's wildly broken.  So for stable@,
>>>>
>>>> It doesn't seem wildly broken.  Just the VMM passing invalid CPUID
>>>> and KVM not validating it.
>>>
>>> Heh, I agree with "just", but unfortunately "just ... not validating" a large
>>> swath of userspace inputs is pretty widly broken.  More importantly, it's not
>>> easy to fix.  E.g. KVM could require the inputs to exactly match hardware, but
>>> that creates an ABI that I'm not entirely sure is desirable in the long term.
>>
>> Although the CPUID ABI does not really change.  KVM does not support
>> emulating Intel PT, so accepting CPUID that the hardware cannot support
>> seems like a bit of a lie.
> 
> But it's not all or nothing, e.g. KVM should support exposing fewer address ranges
> than are supported by hardware, so that the same virtual CPU model can be run on
> different generations of hardware.
> 
>> Aren't there other features that KVM does not support if the hardware support
>> is not there?
> 
> Many.  But either features are one-off things without configurable properties,
> or KVM does the right thing (usually).  E.g. nested virtualization heavily relies
> on hardware, and has a plethora of knobs, but KVM (usually) honors and validates
> the configuration provided by userspace.
> 
>> To some degree, a testing and debugging feature does not have to be
>> available in 100% of cases because it can still be useful when it is
>> available.
> 
> I don't disagree, but "works on my machine" is how KVM has gotten into so many
> messes with such features.  I also don't necessarily disagree with supporting a
> very limited subset of use cases, but I want such support to come as well-defined
> package with proper guard rails, docs, and ideally tests.

Ok, so how about: leave VMM to choose CPUID, but then map it to what the
hardware actually supports for what is possible.  So the guest user might
not get trace data exactly as expected, or perhaps not at all, but at least
KVM doesn't die.  Then add documentation to explain how it all works.

Note, the number of address ranges is not that much of an issue because
currently all processors that support Intel PT virtualization have 2.

I have a feeling QEMU was targeting compatibility with IceLake, which
would probably work for all processors that support Intel PT virtualization
except for one feature - the maximum number of cycle thresholds (dropped
from 2048 to 16)

[PATCH V13 04/14] KVM: selftests: Add guest Intel PT test

Posted by Adrian Hunter 1 year, 3 months ago

Add a test that starts Intel PT traces on host and guest. The test requires
support for Intel PT and having Host/Guest mode enabled i.e. kvm_intel
module parameter pt_mode=1.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/processor.h  |   1 +
 tools/testing/selftests/kvm/x86_64/intel_pt.c | 381 ++++++++++++++++++
 3 files changed, 383 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/intel_pt.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 960cf6a77198..625222f348e4 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -79,6 +79,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_features
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_ipi
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_tlb_flush
+TEST_GEN_PROGS_x86_64 += x86_64/intel_pt
 TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
 TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
 TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index e247f99e0473..808a23ec4160 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -161,6 +161,7 @@ struct kvm_x86_cpu_feature {
 #define	X86_FEATURE_PCOMMIT		KVM_X86_CPU_FEATURE(0x7, 0, EBX, 22)
 #define	X86_FEATURE_CLFLUSHOPT		KVM_X86_CPU_FEATURE(0x7, 0, EBX, 23)
 #define	X86_FEATURE_CLWB		KVM_X86_CPU_FEATURE(0x7, 0, EBX, 24)
+#define	X86_FEATURE_INTEL_PT		KVM_X86_CPU_FEATURE(0x7, 0, EBX, 25)
 #define	X86_FEATURE_UMIP		KVM_X86_CPU_FEATURE(0x7, 0, ECX, 2)
 #define	X86_FEATURE_PKU			KVM_X86_CPU_FEATURE(0x7, 0, ECX, 3)
 #define	X86_FEATURE_OSPKE		KVM_X86_CPU_FEATURE(0x7, 0, ECX, 4)
diff --git a/tools/testing/selftests/kvm/x86_64/intel_pt.c b/tools/testing/selftests/kvm/x86_64/intel_pt.c
new file mode 100644
index 000000000000..94753b12936e
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/intel_pt.c
@@ -0,0 +1,381 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM guest Intel PT test
+ *
+ * Copyright (C) 2024, Intel Corporation.
+ */
+#include <linux/sizes.h>
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/perf_event.h>
+
+#include <sched.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <sys/syscall.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "processor.h"
+#include "ucall_common.h"
+
+#define MEM_GPA			SZ_256M
+/* Set PT_NR_PAGES to 1 to avoid single range errata on some processors */
+#define PT_NR_PAGES		1
+
+#define PT_CPUID_LEAVES		2
+#define PT_CPUID_REGS_NUM	4 /* number of registers (eax, ebx, ecx, edx) */
+
+/* Capability-related code is from the Kernel Intel PT driver */
+enum pt_capabilities {
+	PT_CAP_max_subleaf = 0,
+	PT_CAP_cr3_filtering,
+	PT_CAP_psb_cyc,
+	PT_CAP_ip_filtering,
+	PT_CAP_mtc,
+	PT_CAP_ptwrite,
+	PT_CAP_power_event_trace,
+	PT_CAP_event_trace,
+	PT_CAP_tnt_disable,
+	PT_CAP_topa_output,
+	PT_CAP_topa_multiple_entries,
+	PT_CAP_single_range_output,
+	PT_CAP_output_subsys,
+	PT_CAP_payloads_lip,
+	PT_CAP_num_address_ranges,
+	PT_CAP_mtc_periods,
+	PT_CAP_cycle_thresholds,
+	PT_CAP_psb_periods,
+};
+
+#define PT_CAP(_n, _l, _r, _m)						\
+	[PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l,	\
+			    .reg = KVM_ ## _r, .mask = _m }
+
+static struct pt_cap_desc {
+	const char	*name;
+	u32		leaf;
+	u8		reg;
+	u32		mask;
+} pt_caps[] = {
+	PT_CAP(max_subleaf,		0, CPUID_EAX, 0xffffffff),
+	PT_CAP(cr3_filtering,		0, CPUID_EBX, BIT(0)),
+	PT_CAP(psb_cyc,			0, CPUID_EBX, BIT(1)),
+	PT_CAP(ip_filtering,		0, CPUID_EBX, BIT(2)),
+	PT_CAP(mtc,			0, CPUID_EBX, BIT(3)),
+	PT_CAP(ptwrite,			0, CPUID_EBX, BIT(4)),
+	PT_CAP(power_event_trace,	0, CPUID_EBX, BIT(5)),
+	PT_CAP(event_trace,		0, CPUID_EBX, BIT(7)),
+	PT_CAP(tnt_disable,		0, CPUID_EBX, BIT(8)),
+	PT_CAP(topa_output,		0, CPUID_ECX, BIT(0)),
+	PT_CAP(topa_multiple_entries,	0, CPUID_ECX, BIT(1)),
+	PT_CAP(single_range_output,	0, CPUID_ECX, BIT(2)),
+	PT_CAP(output_subsys,		0, CPUID_ECX, BIT(3)),
+	PT_CAP(payloads_lip,		0, CPUID_ECX, BIT(31)),
+	PT_CAP(num_address_ranges,	1, CPUID_EAX, 0x7),
+	PT_CAP(mtc_periods,		1, CPUID_EAX, 0xffff0000),
+	PT_CAP(cycle_thresholds,	1, CPUID_EBX, 0xffff),
+	PT_CAP(psb_periods,		1, CPUID_EBX, 0xffff0000),
+};
+
+static u32 intel_pt_validate_cap(u32 *caps, enum pt_capabilities capability)
+{
+	struct pt_cap_desc *cd = &pt_caps[capability];
+	u32 c = caps[cd->leaf * PT_CPUID_REGS_NUM + cd->reg];
+	unsigned int shift = __ffs(cd->mask);
+
+	return (c & cd->mask) >> shift;
+}
+
+static int calc_psb_freq(u32 *caps, u64 *psb_freq)
+{
+	u64 allowed;
+
+	if (!(intel_pt_validate_cap(caps, PT_CAP_psb_cyc)))
+		return 0; /* PSBFreq not supported */
+
+	allowed = intel_pt_validate_cap(caps, PT_CAP_psb_periods);
+	if (!allowed)
+		return -1;
+
+	/* Select biggest period */
+	*psb_freq = __fls(allowed) << RTIT_CTL_PSB_FREQ_OFFSET;
+
+	return 0;
+}
+
+static u64 guest_psb_freq(u32 *caps)
+{
+	u64 psb_freq = 0;
+
+	GUEST_ASSERT(!calc_psb_freq(caps, &psb_freq));
+
+	return psb_freq;
+}
+
+static u64 host_psb_freq(u32 *caps)
+{
+	u64 psb_freq = 0;
+
+	TEST_ASSERT(!calc_psb_freq(caps, &psb_freq), "No valid PSBFreq");
+
+	return psb_freq;
+}
+
+static void read_caps(u32 *caps)
+{
+	for (int i = 0; i < PT_CPUID_LEAVES; i++) {
+		__cpuid(0x14, i,
+			&caps[KVM_CPUID_EAX + i * PT_CPUID_REGS_NUM],
+			&caps[KVM_CPUID_EBX + i * PT_CPUID_REGS_NUM],
+			&caps[KVM_CPUID_ECX + i * PT_CPUID_REGS_NUM],
+			&caps[KVM_CPUID_EDX + i * PT_CPUID_REGS_NUM]);
+	}
+}
+
+static void guest_code(void)
+{
+	u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
+	u64 status;
+
+	GUEST_ASSERT(this_cpu_has(X86_FEATURE_INTEL_PT));
+
+	read_caps(caps);
+
+	/* Config PT buffer */
+	wrmsr(MSR_IA32_RTIT_OUTPUT_MASK, PT_NR_PAGES * PAGE_SIZE - 1);
+	wrmsr(MSR_IA32_RTIT_OUTPUT_BASE, MEM_GPA);
+
+	/* Start tracing */
+	wrmsr(MSR_IA32_RTIT_CTL, RTIT_CTL_TRACEEN | RTIT_CTL_OS | RTIT_CTL_USR | RTIT_CTL_TSC_EN |
+				 RTIT_CTL_BRANCH_EN | guest_psb_freq(caps));
+
+	GUEST_ASSERT(rdmsr(MSR_IA32_RTIT_CTL) & RTIT_CTL_TRACEEN);
+
+	/*
+	 * Test repeated VM_Exit / VM-Entry. PAGE_SIZE to match aux_watermark,
+	 * refer to the handlng of UCALL_SYNC.
+	 */
+	for (int i = 0; i < PAGE_SIZE; i++)
+		GUEST_SYNC(i);
+
+	/* Stop tracing */
+	wrmsr(MSR_IA32_RTIT_CTL, 0);
+
+	status = rdmsr(MSR_IA32_RTIT_STATUS);
+
+	GUEST_ASSERT(!(status & (RTIT_STATUS_ERROR | RTIT_STATUS_STOPPED)));
+
+	GUEST_DONE();
+}
+
+static long perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
+			    int group_fd, unsigned long flags)
+{
+	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
+}
+
+static int read_sysfs(const char *file_path, unsigned int *val)
+{
+	FILE *f = fopen(file_path, "r");
+	int ret;
+
+	if (!f)
+		return -1;
+
+	ret = fscanf(f, "%u", val);
+
+	fclose(f);
+
+	return ret == 1 ? 0 : -1;
+}
+
+#define PT_CONFIG_PASS_THRU	1
+
+static int do_open_pt(u32 *caps, unsigned int type)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(attr),
+		.type = type,
+		.config = PT_CONFIG_PASS_THRU | RTIT_CTL_BRANCH_EN | host_psb_freq(caps),
+		.sample_period = 1,
+		.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_CPU |
+			       PERF_SAMPLE_TIME | PERF_SAMPLE_IDENTIFIER,
+		.exclude_kernel = 1,
+		.exclude_user = 0,
+		.exclude_hv = 1,
+		.sample_id_all = 1,
+		.exclude_guest = 1,
+		.aux_watermark = PAGE_SIZE,
+	};
+
+	return perf_event_open(&attr, 0, -1, -1, 0);
+}
+
+static int open_pt(u32 *caps)
+{
+	unsigned int type;
+	int err;
+
+	err = read_sysfs("/sys/bus/event_source/devices/intel_pt/type", &type);
+	if (err)
+		return -1;
+
+	return do_open_pt(caps, type);
+}
+
+#define PERF_HOST_BUF_SZ	(4 * PAGE_SIZE)
+#define PERF_HOST_MMAP_SZ	(PERF_HOST_BUF_SZ + PAGE_SIZE)
+#define PT_HOST_BUF_SZ		(2 * PAGE_SIZE)
+
+struct perf_info {
+	int fd;
+	void *perf_buf;
+	void *pt_buf;
+};
+
+static int perf_open(struct perf_info *pi)
+{
+	u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
+	struct perf_event_mmap_page *pc;
+
+	read_caps(caps);
+
+	pi->fd = open_pt(caps);
+	if (pi->fd < 0)
+		goto out_err;
+
+	/* mmap host buffer and user page */
+	pi->perf_buf = mmap(NULL, PERF_HOST_MMAP_SZ, PROT_READ | PROT_WRITE,
+			    MAP_SHARED, pi->fd, 0);
+	if (pi->perf_buf == MAP_FAILED)
+		goto out_close;
+
+	pc = pi->perf_buf;
+	pc->aux_offset = PERF_HOST_MMAP_SZ;
+	pc->aux_size = PT_HOST_BUF_SZ;
+
+	/* mmap pt buffer */
+	pi->pt_buf = mmap(NULL, PT_HOST_BUF_SZ, PROT_READ | PROT_WRITE,
+			  MAP_SHARED, pi->fd, PERF_HOST_MMAP_SZ);
+	if (pi->pt_buf == MAP_FAILED)
+		goto out_munmap;
+
+	return 0;
+
+out_munmap:
+	munmap(pi->perf_buf, PERF_HOST_MMAP_SZ);
+out_close:
+	close(pi->fd);
+	pi->fd = -1;
+out_err:
+	TEST_FAIL("Failed to start Intel PT tracing on host");
+	return -1;
+}
+
+static void perf_close(struct perf_info *pi)
+{
+	if (pi->fd < 0)
+		return;
+
+	munmap(pi->pt_buf, PT_HOST_BUF_SZ);
+	munmap(pi->perf_buf, PERF_HOST_MMAP_SZ);
+	close(pi->fd);
+}
+
+static void perf_forward(struct perf_info *pi)
+{
+	volatile struct perf_event_mmap_page *pc = pi->perf_buf;
+
+	if (pi->fd < 0)
+		return;
+
+	/* Must stop to ensure aux_head is up to date */
+	ioctl(pi->fd, PERF_EVENT_IOC_DISABLE, 0);
+
+	/* Discard all trace data */
+	pc->data_tail = pc->data_head;
+	pc->aux_tail = pc->aux_head;
+
+	/* Start after setting aux_tail */
+	ioctl(pi->fd, PERF_EVENT_IOC_ENABLE, 0);
+}
+
+/* Use volatile to discourage the compiler from unrolling the loop */
+volatile int loop_spin;
+
+static void run_vcpu(struct kvm_vcpu *vcpu, struct perf_info *pi)
+{
+	bool done = false;
+	struct ucall uc;
+
+	while (!done) {
+		vcpu_run(vcpu);
+		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_PRINTF:
+			pr_info("%s", uc.buffer);
+			break;
+		case UCALL_SYNC:
+			/*
+			 * Empty the buffer and spin to add trace data in ever
+			 * increasing amounts, which will cause the host PMI to
+			 * more likely happen somewhere sensitive prior to
+			 * VM-Entry.
+			 */
+			perf_forward(pi);
+			for (int cnt = 0; cnt < uc.args[1]; cnt++)
+				for (loop_spin = 0; loop_spin < 5; loop_spin++)
+					cpu_relax();
+			break;
+		case UCALL_DONE:
+			done = true;
+			break;
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		default:
+			TEST_FAIL("Unknown ucall %lu exit reason: %s",
+				  uc.cmd, exit_reason_str(vcpu->run->exit_reason));
+			break;
+		}
+	}
+}
+
+#define PT_CAP_SINGLE_RANGE_OUTPUT \
+	KVM_X86_CPU_FEATURE(0x14, 0, ECX, 2)
+
+int main(int argc, char *argv[])
+{
+	struct perf_info pi = {.fd = -1};
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+	/*
+	 * Guest X86_FEATURE_INTEL_PT depends on Intel PT support and kvm_intel
+	 * module parameter pt_mode=1.
+	 */
+	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_INTEL_PT));
+
+	/*
+	 * Only using single-range for now. Currently only BDW does not support it, but
+	 * BDW also doesn't support PT in VMX operation anyway.
+	 */
+	TEST_REQUIRE(vcpu_cpuid_has(vcpu, PT_CAP_SINGLE_RANGE_OUTPUT));
+
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, MEM_GPA, 1, PT_NR_PAGES, 0);
+
+	perf_open(&pi);
+
+	run_vcpu(vcpu, &pi);
+
+	perf_close(&pi);
+
+	kvm_vm_free(vm);
+
+	return 0;
+}
-- 
2.43.0

[PATCH V13 05/14] perf/core: Add aux_pause, aux_resume, aux_start_paused

Posted by Adrian Hunter 1 year, 3 months ago

Hardware traces, such as instruction traces, can produce a vast amount of
trace data, so being able to reduce tracing to more specific circumstances
can be useful.

The ability to pause or resume tracing when another event happens, can do
that.

Add ability for an event to "pause" or "resume" AUX area tracing.

Add aux_pause bit to perf_event_attr to indicate that, if the event
happens, the associated AUX area tracing should be paused. Ditto
aux_resume. Do not allow aux_pause and aux_resume to be set together.

Add aux_start_paused bit to perf_event_attr to indicate to an AUX area
event that it should start in a "paused" state.

Add aux_paused to struct hw_perf_event for AUX area events to keep track of
the "paused" state. aux_paused is initialized to aux_start_paused.

Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start()
callbacks. Call as needed, during __perf_event_output(). Add
aux_in_pause_resume to struct perf_buffer to prevent races with the NMI
handler. Pause/resume in NMI context will miss out if it coincides with
another pause/resume.

To use aux_pause or aux_resume, an event must be in a group with the AUX
area event as the group leader.

Example (requires Intel PT and tools patches also):

 $ perf record --kcore -e intel_pt/aux-action=start-paused/k,syscalls:sys_enter_newuname/aux-action=resume/,syscalls:sys_exit_newuname/aux-action=pause/ uname
 Linux
 [ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 0.043 MB perf.data ]
 $ perf script --call-trace
 uname   30805 [000] 24001.058782799: name: 0x7ffc9c1865b0
 uname   30805 [000] 24001.058784424:  psb offs: 0
 uname   30805 [000] 24001.058784424:  cbr: 39 freq: 3904 MHz (139%)
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])        debug_smp_processor_id
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])        __x64_sys_newuname
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])            down_read
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                __cond_resched
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_add
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                    in_lock_functions
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_sub
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])            up_read
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_add
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                    in_lock_functions
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                preempt_count_sub
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])            _copy_to_user
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])        syscall_exit_to_user_mode
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])            syscall_exit_work
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                perf_syscall_exit
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                    debug_smp_processor_id
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                    perf_trace_buf_alloc
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_swevent_get_recursion_context
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            debug_smp_processor_id
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        debug_smp_processor_id
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                    perf_tp_event
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_trace_buf_update
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            tracing_gen_ctx_irq_test
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_swevent_event
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            __perf_event_account_interrupt
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                __this_cpu_preempt_check
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            perf_event_output_forward
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                perf_event_aux_pause
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                    ring_buffer_get
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                        __rcu_read_lock
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                        __rcu_read_unlock
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                    pt_event_stop
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        debug_smp_processor_id
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        debug_smp_processor_id
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        native_write_msr
 uname   30805 [000] 24001.058785463: ([kernel.kallsyms])                                        native_write_msr
 uname   30805 [000] 24001.058785639: 0x0

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: James Clark <james.clark@arm.com>
---


Changes in V13:
	Do aux_resume at the end of __perf_event_overflow() so as to trace
	less of perf itself

Changes in V12:
	Rebase on current tip

Changes in V11:
	Make assignment to event->hw.aux_paused conditional on
	(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE).

Changes in V10:
	Move aux_paused into a union within struct hw_perf_event.
	Additional comment wrt PERF_EF_PAUSE/PERF_EF_RESUME.
	Factor out has_aux_action() as an inline function.
	Use scoped_guard for irqsave.
	Move calls of perf_event_aux_pause() from __perf_event_output()
	to __perf_event_overflow().

Changes in V9:
	Move aux_paused to struct hw_perf_event

Changes in V6:
	Removed READ/WRITE_ONCE from __perf_event_aux_pause()
	Expanded comment about guarding against NMI

Changes in V5:
	Added James' Ack

Changes in V4:
	Rename aux_output_cfg -> aux_action
	Reorder aux_action bits from:
		aux_pause, aux_resume, aux_start_paused
	to:
		aux_start_paused, aux_pause, aux_resume
	Fix aux_action bits __u64 -> __u32


 include/linux/perf_event.h      | 28 ++++++++++++
 include/uapi/linux/perf_event.h | 11 ++++-
 kernel/events/core.c            | 75 +++++++++++++++++++++++++++++++--
 kernel/events/internal.h        |  1 +
 4 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..91b310052a7c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -170,6 +170,12 @@ struct hw_perf_event {
 		};
 		struct { /* aux / Intel-PT */
 			u64		aux_config;
+			/*
+			 * For AUX area events, aux_paused cannot be a state
+			 * flag because it can be updated asynchronously to
+			 * state.
+			 */
+			unsigned int	aux_paused;
 		};
 		struct { /* software */
 			struct hrtimer	hrtimer;
@@ -294,6 +300,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
 #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
+#define PERF_PMU_CAP_AUX_PAUSE			0x0200
 
 /**
  * pmu::scope
@@ -384,6 +391,8 @@ struct pmu {
 #define PERF_EF_START	0x01		/* start the counter when adding    */
 #define PERF_EF_RELOAD	0x02		/* reload the counter when starting */
 #define PERF_EF_UPDATE	0x04		/* update the counter when stopping */
+#define PERF_EF_PAUSE	0x08		/* AUX area event, pause tracing */
+#define PERF_EF_RESUME	0x10		/* AUX area event, resume tracing */
 
 	/*
 	 * Adds/Removes a counter to/from the PMU, can be done inside a
@@ -423,6 +432,18 @@ struct pmu {
 	 *
 	 * ->start() with PERF_EF_RELOAD will reprogram the counter
 	 *  value, must be preceded by a ->stop() with PERF_EF_UPDATE.
+	 *
+	 * ->stop() with PERF_EF_PAUSE will stop as simply as possible. Will not
+	 * overlap another ->stop() with PERF_EF_PAUSE nor ->start() with
+	 * PERF_EF_RESUME.
+	 *
+	 * ->start() with PERF_EF_RESUME will start as simply as possible but
+	 * only if the counter is not otherwise stopped. Will not overlap
+	 * another ->start() with PERF_EF_RESUME nor ->stop() with
+	 * PERF_EF_PAUSE.
+	 *
+	 * Notably, PERF_EF_PAUSE/PERF_EF_RESUME *can* be concurrent with other
+	 * ->stop()/->start() invocations, just not itself.
 	 */
 	void (*start)			(struct perf_event *event, int flags);
 	void (*stop)			(struct perf_event *event, int flags);
@@ -1679,6 +1700,13 @@ static inline bool has_aux(struct perf_event *event)
 	return event->pmu->setup_aux;
 }
 
+static inline bool has_aux_action(struct perf_event *event)
+{
+	return event->attr.aux_sample_size ||
+	       event->attr.aux_pause ||
+	       event->attr.aux_resume;
+}
+
 static inline bool is_write_backward(struct perf_event *event)
 {
 	return !!event->attr.write_backward;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4842c36fdf80..0524d541d4e3 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -511,7 +511,16 @@ struct perf_event_attr {
 	__u16	sample_max_stack;
 	__u16	__reserved_2;
 	__u32	aux_sample_size;
-	__u32	__reserved_3;
+
+	union {
+		__u32	aux_action;
+		struct {
+			__u32	aux_start_paused :  1, /* start AUX area tracing paused */
+				aux_pause        :  1, /* on overflow, pause AUX area tracing */
+				aux_resume       :  1, /* on overflow, resume AUX area tracing */
+				__reserved_3     : 29;
+		};
+	};
 
 	/*
 	 * User provided data if sigtrap=1, passed back to user via
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e3589c4287cb..0e9cfe6f3535 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2146,7 +2146,7 @@ static void perf_put_aux_event(struct perf_event *event)
 
 static bool perf_need_aux_event(struct perf_event *event)
 {
-	return !!event->attr.aux_output || !!event->attr.aux_sample_size;
+	return event->attr.aux_output || has_aux_action(event);
 }
 
 static int perf_get_aux_event(struct perf_event *event,
@@ -2171,6 +2171,10 @@ static int perf_get_aux_event(struct perf_event *event,
 	    !perf_aux_output_match(event, group_leader))
 		return 0;
 
+	if ((event->attr.aux_pause || event->attr.aux_resume) &&
+	    !(group_leader->pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE))
+		return 0;
+
 	if (event->attr.aux_sample_size && !group_leader->pmu->snapshot_aux)
 		return 0;
 
@@ -8016,6 +8020,49 @@ void perf_prepare_header(struct perf_event_header *header,
 	WARN_ON_ONCE(header->size & 7);
 }
 
+static void __perf_event_aux_pause(struct perf_event *event, bool pause)
+{
+	if (pause) {
+		if (!event->hw.aux_paused) {
+			event->hw.aux_paused = 1;
+			event->pmu->stop(event, PERF_EF_PAUSE);
+		}
+	} else {
+		if (event->hw.aux_paused) {
+			event->hw.aux_paused = 0;
+			event->pmu->start(event, PERF_EF_RESUME);
+		}
+	}
+}
+
+static void perf_event_aux_pause(struct perf_event *event, bool pause)
+{
+	struct perf_buffer *rb;
+
+	if (WARN_ON_ONCE(!event))
+		return;
+
+	rb = ring_buffer_get(event);
+	if (!rb)
+		return;
+
+	scoped_guard (irqsave) {
+		/*
+		 * Guard against self-recursion here. Another event could trip
+		 * this same from NMI context.
+		 */
+		if (READ_ONCE(rb->aux_in_pause_resume))
+			break;
+
+		WRITE_ONCE(rb->aux_in_pause_resume, 1);
+		barrier();
+		__perf_event_aux_pause(event, pause);
+		barrier();
+		WRITE_ONCE(rb->aux_in_pause_resume, 0);
+	}
+	ring_buffer_put(rb);
+}
+
 static __always_inline int
 __perf_event_output(struct perf_event *event,
 		    struct perf_sample_data *data,
@@ -9818,9 +9865,12 @@ static int __perf_event_overflow(struct perf_event *event,
 
 	ret = __perf_event_account_interrupt(event, throttle);
 
+	if (event->attr.aux_pause)
+		perf_event_aux_pause(event->aux_event, true);
+
 	if (event->prog && event->prog->type == BPF_PROG_TYPE_PERF_EVENT &&
 	    !bpf_overflow_handler(event, data, regs))
-		return ret;
+		goto out;
 
 	/*
 	 * XXX event_limit might not quite work as expected on inherited
@@ -9882,6 +9932,9 @@ static int __perf_event_overflow(struct perf_event *event,
 		event->pending_wakeup = 1;
 		irq_work_queue(&event->pending_irq);
 	}
+out:
+	if (event->attr.aux_resume)
+		perf_event_aux_pause(event->aux_event, false);
 
 	return ret;
 }
@@ -12273,11 +12326,25 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	}
 
 	if (event->attr.aux_output &&
-	    !(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT)) {
+	    (!(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT) ||
+	     event->attr.aux_pause || event->attr.aux_resume)) {
 		err = -EOPNOTSUPP;
 		goto err_pmu;
 	}
 
+	if (event->attr.aux_pause && event->attr.aux_resume) {
+		err = -EINVAL;
+		goto err_pmu;
+	}
+
+	if (event->attr.aux_start_paused) {
+		if (!(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE)) {
+			err = -EOPNOTSUPP;
+			goto err_pmu;
+		}
+		event->hw.aux_paused = 1;
+	}
+
 	if (cgroup_fd != -1) {
 		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
 		if (err)
@@ -13073,7 +13140,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 	 * Grouping is not supported for kernel events, neither is 'AUX',
 	 * make sure the caller's intentions are adjusted.
 	 */
-	if (attr->aux_output)
+	if (attr->aux_output || attr->aux_action)
 		return ERR_PTR(-EINVAL);
 
 	event = perf_event_alloc(attr, cpu, task, NULL, NULL,
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index e072d995d670..249288d82b8d 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -52,6 +52,7 @@ struct perf_buffer {
 	void				(*free_aux)(void *);
 	refcount_t			aux_refcount;
 	int				aux_in_sampling;
+	int				aux_in_pause_resume;
 	void				**aux_pages;
 	void				*aux_priv;
 
-- 
2.43.0

[PATCH V13 06/14] perf/x86/intel/pt: Add support for pause / resume

Posted by Adrian Hunter 1 year, 3 months ago

Prevent tracing to start if aux_paused.

Implement support for PERF_EF_PAUSE / PERF_EF_RESUME. When aux_paused, stop
tracing. When not aux_paused, only start tracing if it isn't currently
meant to be stopped.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---


Changes in V12:
	Rebase on current tip plus patch set "KVM: x86: Fix Intel PT Host/Guest
	mode when host tracing"

Changes in V9:
	Add more comments and barriers for resume_allowed and
	pause_allowed
	Always use WRITE_ONCE with resume_allowed


 arch/x86/events/intel/pt.c | 69 ++++++++++++++++++++++++++++++++++++--
 arch/x86/events/intel/pt.h |  4 +++
 2 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index d9469d2d6aa6..b6cfca251c07 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -418,6 +418,9 @@ static void pt_config_start(struct perf_event *event)
 	struct pt *pt = this_cpu_ptr(&pt_ctx);
 	u64 ctl = event->hw.aux_config;
 
+	if (READ_ONCE(event->hw.aux_paused))
+		return;
+
 	ctl |= RTIT_CTL_TRACEEN;
 	if (READ_ONCE(pt->vmx_on))
 		perf_aux_output_flag(&pt->handle, PERF_AUX_FLAG_PARTIAL);
@@ -539,11 +542,23 @@ static void pt_config(struct perf_event *event)
 
 	event->hw.aux_config = reg;
 
+	/*
+	 * Allow resume before starting so as not to overwrite a value set by a
+	 * PMI.
+	 */
+	barrier();
+	WRITE_ONCE(pt->resume_allowed, 1);
 	/* Configuration is complete, it is now OK to handle an NMI */
 	barrier();
 	WRITE_ONCE(pt->handle_nmi, 1);
-
+	barrier();
 	pt_config_start(event);
+	barrier();
+	/*
+	 * Allow pause after starting so its pt_config_stop() doesn't race with
+	 * pt_config_start().
+	 */
+	WRITE_ONCE(pt->pause_allowed, 1);
 }
 
 static void pt_config_stop(struct perf_event *event)
@@ -1526,6 +1541,7 @@ void intel_pt_interrupt(void)
 		buf = perf_aux_output_begin(&pt->handle, event);
 		if (!buf) {
 			event->hw.state = PERF_HES_STOPPED;
+			WRITE_ONCE(pt->resume_allowed, 0);
 			return;
 		}
 
@@ -1534,6 +1550,7 @@ void intel_pt_interrupt(void)
 		ret = pt_buffer_reset_markers(buf, &pt->handle);
 		if (ret) {
 			perf_aux_output_end(&pt->handle, 0);
+			WRITE_ONCE(pt->resume_allowed, 0);
 			return;
 		}
 
@@ -1588,6 +1605,26 @@ static void pt_event_start(struct perf_event *event, int mode)
 	struct pt *pt = this_cpu_ptr(&pt_ctx);
 	struct pt_buffer *buf;
 
+	if (mode & PERF_EF_RESUME) {
+		if (READ_ONCE(pt->resume_allowed)) {
+			u64 status;
+
+			/*
+			 * Only if the trace is not active and the error and
+			 * stopped bits are clear, is it safe to start, but a
+			 * PMI might have just cleared these, so resume_allowed
+			 * must be checked again also.
+			 */
+			rdmsrl(MSR_IA32_RTIT_STATUS, status);
+			if (!(status & (RTIT_STATUS_TRIGGEREN |
+					RTIT_STATUS_ERROR |
+					RTIT_STATUS_STOPPED)) &&
+			   READ_ONCE(pt->resume_allowed))
+				pt_config_start(event);
+		}
+		return;
+	}
+
 	buf = perf_aux_output_begin(&pt->handle, event);
 	if (!buf)
 		goto fail_stop;
@@ -1615,6 +1652,12 @@ static void pt_event_stop(struct perf_event *event, int mode)
 {
 	struct pt *pt = this_cpu_ptr(&pt_ctx);
 
+	if (mode & PERF_EF_PAUSE) {
+		if (READ_ONCE(pt->pause_allowed))
+			pt_config_stop(event);
+		return;
+	}
+
 	/*
 	 * Protect against the PMI racing with disabling wrmsr,
 	 * see comment in intel_pt_interrupt().
@@ -1622,6 +1665,15 @@ static void pt_event_stop(struct perf_event *event, int mode)
 	WRITE_ONCE(pt->handle_nmi, 0);
 	barrier();
 
+	/*
+	 * Prevent a resume from attempting to restart tracing, or a pause
+	 * during a subsequent start. Do this after clearing handle_nmi so that
+	 * pt_event_snapshot_aux() will not re-allow them.
+	 */
+	WRITE_ONCE(pt->pause_allowed, 0);
+	WRITE_ONCE(pt->resume_allowed, 0);
+	barrier();
+
 	pt_config_stop(event);
 
 	if (event->hw.state == PERF_HES_STOPPED)
@@ -1787,6 +1839,10 @@ static long pt_event_snapshot_aux(struct perf_event *event,
 	if (WARN_ON_ONCE(!buf->snapshot))
 		return 0;
 
+	/* Prevent pause/resume from attempting to start/stop tracing */
+	WRITE_ONCE(pt->pause_allowed, 0);
+	WRITE_ONCE(pt->resume_allowed, 0);
+	barrier();
 	/*
 	 * There is no PT interrupt in this mode, so stop the trace and it will
 	 * remain stopped while the buffer is copied.
@@ -1806,8 +1862,13 @@ static long pt_event_snapshot_aux(struct perf_event *event,
 	 * Here, handle_nmi tells us if the tracing was on.
 	 * If the tracing was on, restart it.
 	 */
-	if (READ_ONCE(pt->handle_nmi))
+	if (READ_ONCE(pt->handle_nmi)) {
+		WRITE_ONCE(pt->resume_allowed, 1);
+		barrier();
 		pt_config_start(event);
+		barrier();
+		WRITE_ONCE(pt->pause_allowed, 1);
+	}
 
 	return ret;
 }
@@ -1923,7 +1984,9 @@ static __init int pt_init(void)
 	if (!intel_pt_validate_hw_cap(PT_CAP_topa_multiple_entries))
 		pt_pmu.pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG;
 
-	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
+	pt_pmu.pmu.capabilities		|= PERF_PMU_CAP_EXCLUSIVE |
+					   PERF_PMU_CAP_ITRACE |
+					   PERF_PMU_CAP_AUX_PAUSE;
 	pt_pmu.pmu.attr_groups		 = pt_attr_groups;
 	pt_pmu.pmu.task_ctx_nr		 = perf_sw_context;
 	pt_pmu.pmu.event_init		 = pt_event_init;
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 0428019b92f4..480a5a311148 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -119,6 +119,8 @@ struct pt_filters {
  * @filters:		last configured filters
  * @handle_nmi:		do handle PT PMI on this cpu, there's an active event
  * @vmx_on:		1 if VMX is ON on this cpu
+ * @pause_allowed:	PERF_EF_PAUSE is allowed to stop tracing
+ * @resume_allowed:	PERF_EF_RESUME is allowed to start tracing
  * @output_base:	cached RTIT_OUTPUT_BASE MSR value
  * @output_mask:	cached RTIT_OUTPUT_MASK MSR value
  * @status:		cached RTIT_STATUS MSR value
@@ -132,6 +134,8 @@ struct pt {
 	struct pt_filters	filters;
 	int			handle_nmi;
 	int			vmx_on;
+	int			pause_allowed;
+	int			resume_allowed;
 	u64			output_base;
 	u64			output_mask;
 	u64			status;
-- 
2.43.0

[PATCH V13 07/14] perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling

Posted by Adrian Hunter 1 year, 3 months ago

Events with aux actions or aux sampling expect the PMI to coincide with the
event, which does not happen for large PEBS, so do not enable large PEBS in
that case.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---


Changes in V11:
	Remove definition of has_aux_action() because it has
	already been added as an inline function.


 arch/x86/events/intel/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7ca40002a19b..bb284aff7bfd 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3962,8 +3962,8 @@ static int intel_pmu_hw_config(struct perf_event *event)
 
 		if (!(event->attr.freq || (event->attr.wakeup_events && !event->attr.watermark))) {
 			event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
-			if (!(event->attr.sample_type &
-			      ~intel_pmu_large_pebs_flags(event))) {
+			if (!(event->attr.sample_type & ~intel_pmu_large_pebs_flags(event)) &&
+			    !has_aux_action(event)) {
 				event->hw.flags |= PERF_X86_EVENT_LARGE_PEBS;
 				event->attach_state |= PERF_ATTACH_SCHED_CB;
 			}
-- 
2.43.0

[PATCH V13 08/14] perf tools: Add aux_start_paused, aux_pause and aux_resume

Posted by Adrian Hunter 1 year, 3 months ago

Add struct perf_event_attr members to support pause and resume of AUX area
tracing.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 tools/include/uapi/linux/perf_event.h     | 11 ++++++++++-
 tools/perf/util/perf_event_attr_fprintf.c |  3 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 4842c36fdf80..0524d541d4e3 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -511,7 +511,16 @@ struct perf_event_attr {
 	__u16	sample_max_stack;
 	__u16	__reserved_2;
 	__u32	aux_sample_size;
-	__u32	__reserved_3;
+
+	union {
+		__u32	aux_action;
+		struct {
+			__u32	aux_start_paused :  1, /* start AUX area tracing paused */
+				aux_pause        :  1, /* on overflow, pause AUX area tracing */
+				aux_resume       :  1, /* on overflow, resume AUX area tracing */
+				__reserved_3     : 29;
+		};
+	};
 
 	/*
 	 * User provided data if sigtrap=1, passed back to user via
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 59fbbba79697..29db0aef9a74 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -335,6 +335,9 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(sample_max_stack, p_unsigned);
 	PRINT_ATTRf(aux_sample_size, p_unsigned);
 	PRINT_ATTRf(sig_data, p_unsigned);
+	PRINT_ATTRf(aux_start_paused, p_unsigned);
+	PRINT_ATTRf(aux_pause, p_unsigned);
+	PRINT_ATTRf(aux_resume, p_unsigned);
 
 	return ret;
 }
-- 
2.43.0

[PATCH V13 09/14] perf tools: Add aux-action config term

Posted by Adrian Hunter 1 year, 3 months ago

Add a new common config term "aux-action" to use for configuring AUX area
trace pause / resume. The value is a string that will be parsed in a
subsequent patch.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---


Changes in V7:

	Add aux-action to perf_pmu__for_each_format


 tools/perf/util/evsel.c        |  2 ++
 tools/perf/util/evsel_config.h |  1 +
 tools/perf/util/parse-events.c | 10 ++++++++++
 tools/perf/util/parse-events.h |  1 +
 tools/perf/util/parse-events.l |  1 +
 tools/perf/util/pmu.c          |  1 +
 6 files changed, 16 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index b221459439b8..8155a25554ea 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1015,6 +1015,8 @@ static void evsel__apply_config_terms(struct evsel *evsel,
 		case EVSEL__CONFIG_TERM_AUX_OUTPUT:
 			attr->aux_output = term->val.aux_output ? 1 : 0;
 			break;
+		case EVSEL__CONFIG_TERM_AUX_ACTION:
+			break;
 		case EVSEL__CONFIG_TERM_AUX_SAMPLE_SIZE:
 			/* Already applied by auxtrace */
 			break;
diff --git a/tools/perf/util/evsel_config.h b/tools/perf/util/evsel_config.h
index aee6f808b512..af52a1516d0b 100644
--- a/tools/perf/util/evsel_config.h
+++ b/tools/perf/util/evsel_config.h
@@ -25,6 +25,7 @@ enum evsel_term_type {
 	EVSEL__CONFIG_TERM_BRANCH,
 	EVSEL__CONFIG_TERM_PERCORE,
 	EVSEL__CONFIG_TERM_AUX_OUTPUT,
+	EVSEL__CONFIG_TERM_AUX_ACTION,
 	EVSEL__CONFIG_TERM_AUX_SAMPLE_SIZE,
 	EVSEL__CONFIG_TERM_CFG_CHG,
 };
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index e96cf13dc396..428c1cce73f7 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -827,6 +827,7 @@ static const char *config_term_name(enum parse_events__term_type term_type)
 		[PARSE_EVENTS__TERM_TYPE_DRV_CFG]		= "driver-config",
 		[PARSE_EVENTS__TERM_TYPE_PERCORE]		= "percore",
 		[PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT]		= "aux-output",
+		[PARSE_EVENTS__TERM_TYPE_AUX_ACTION]		= "aux-action",
 		[PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE]	= "aux-sample-size",
 		[PARSE_EVENTS__TERM_TYPE_METRIC_ID]		= "metric-id",
 		[PARSE_EVENTS__TERM_TYPE_RAW]                   = "raw",
@@ -876,6 +877,7 @@ config_term_avail(enum parse_events__term_type term_type, struct parse_events_er
 	case PARSE_EVENTS__TERM_TYPE_OVERWRITE:
 	case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
 	case PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT:
+	case PARSE_EVENTS__TERM_TYPE_AUX_ACTION:
 	case PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE:
 	case PARSE_EVENTS__TERM_TYPE_RAW:
 	case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
@@ -995,6 +997,9 @@ do {									   \
 	case PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT:
 		CHECK_TYPE_VAL(NUM);
 		break;
+	case PARSE_EVENTS__TERM_TYPE_AUX_ACTION:
+		CHECK_TYPE_VAL(STR);
+		break;
 	case PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE:
 		CHECK_TYPE_VAL(NUM);
 		if (term->val.num > UINT_MAX) {
@@ -1113,6 +1118,7 @@ static int config_term_tracepoint(struct perf_event_attr *attr,
 	case PARSE_EVENTS__TERM_TYPE_OVERWRITE:
 	case PARSE_EVENTS__TERM_TYPE_NOOVERWRITE:
 	case PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT:
+	case PARSE_EVENTS__TERM_TYPE_AUX_ACTION:
 	case PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE:
 		return config_term_common(attr, term, err);
 	case PARSE_EVENTS__TERM_TYPE_USER:
@@ -1248,6 +1254,9 @@ do {								\
 			ADD_CONFIG_TERM_VAL(AUX_OUTPUT, aux_output,
 					    term->val.num ? 1 : 0, term->weak);
 			break;
+		case PARSE_EVENTS__TERM_TYPE_AUX_ACTION:
+			ADD_CONFIG_TERM_STR(AUX_ACTION, term->val.str, term->weak);
+			break;
 		case PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE:
 			ADD_CONFIG_TERM_VAL(AUX_SAMPLE_SIZE, aux_sample_size,
 					    term->val.num, term->weak);
@@ -1310,6 +1319,7 @@ static int get_config_chgs(struct perf_pmu *pmu, struct parse_events_terms *head
 		case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
 		case PARSE_EVENTS__TERM_TYPE_PERCORE:
 		case PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT:
+		case PARSE_EVENTS__TERM_TYPE_AUX_ACTION:
 		case PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE:
 		case PARSE_EVENTS__TERM_TYPE_METRIC_ID:
 		case PARSE_EVENTS__TERM_TYPE_RAW:
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 2b52f8d6aa29..8dd480b1d016 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -74,6 +74,7 @@ enum parse_events__term_type {
 	PARSE_EVENTS__TERM_TYPE_DRV_CFG,
 	PARSE_EVENTS__TERM_TYPE_PERCORE,
 	PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT,
+	PARSE_EVENTS__TERM_TYPE_AUX_ACTION,
 	PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE,
 	PARSE_EVENTS__TERM_TYPE_METRIC_ID,
 	PARSE_EVENTS__TERM_TYPE_RAW,
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 5a0bcd7f166a..6fa4b74fe0c3 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -329,6 +329,7 @@ overwrite		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_OVERWRITE); }
 no-overwrite		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_NOOVERWRITE); }
 percore			{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_PERCORE); }
 aux-output		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT); }
+aux-action		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_ACTION); }
 aux-sample-size		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE); }
 metric-id		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_METRIC_ID); }
 cpu-cycles|cycles				{ return hw_term(yyscanner, PERF_COUNT_HW_CPU_CYCLES); }
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 8993b5853687..5e1fea26cafb 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1741,6 +1741,7 @@ int perf_pmu__for_each_format(struct perf_pmu *pmu, void *state, pmu_format_call
 		"no-overwrite",
 		"percore",
 		"aux-output",
+		"aux-action=(pause|resume|start-paused)",
 		"aux-sample-size=number",
 	};
 	struct perf_pmu_format *format;
-- 
2.43.0

[PATCH V13 10/14] perf tools: Parse aux-action

Posted by Adrian Hunter 1 year, 3 months ago

Add parsing for aux-action to accept "pause", "resume" or "start-paused"
values.

"start-paused" is valid only for AUX area events.

"pause" and "resume" are valid only for events grouped with an AUX area
event as the group leader.  However, like with aux-output, the events
will be automatically grouped if they are not currently in a group, and
the AUX area event precedes the other events.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---


Changes in V8:
	Fix clang warning:
	     util/auxtrace.c:821:7: error: missing field 'aux_action' initializer [-Werror,-Wmissing-field-initializers]
	     821 |         {NULL},
	         |              ^


 tools/perf/Documentation/perf-record.txt |  4 ++
 tools/perf/builtin-record.c              |  4 +-
 tools/perf/util/auxtrace.c               | 67 ++++++++++++++++++++++--
 tools/perf/util/auxtrace.h               |  6 ++-
 tools/perf/util/evsel.c                  |  1 +
 5 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 242223240a08..80686d590de2 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -68,6 +68,10 @@ OPTIONS
 		    like this: name=\'CPU_CLK_UNHALTED.THREAD:cmask=0x1\'.
 	  - 'aux-output': Generate AUX records instead of events. This requires
 			  that an AUX area event is also provided.
+	  - 'aux-action': "pause" or "resume" to pause or resume an AUX
+			  area event (the group leader) when this event occurs.
+			  "start-paused" on an AUX area event itself, will
+			  start in a paused state.
 	  - 'aux-sample-size': Set sample size for AUX area sampling. If the
 	  '--aux-sample' option has been used, set aux-sample-size=0 to disable
 	  AUX area sampling for the event.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index adbaf80b398c..a7afde2fbebc 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -860,7 +860,9 @@ static int record__auxtrace_init(struct record *rec)
 	if (err)
 		return err;
 
-	auxtrace_regroup_aux_output(rec->evlist);
+	err = auxtrace_parse_aux_action(rec->evlist);
+	if (err)
+		return err;
 
 	return auxtrace_parse_filters(rec->evlist);
 }
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index ca8682966fae..4d1633d87eff 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -810,19 +810,76 @@ int auxtrace_parse_sample_options(struct auxtrace_record *itr,
 	return auxtrace_validate_aux_sample_size(evlist, opts);
 }
 
-void auxtrace_regroup_aux_output(struct evlist *evlist)
+static struct aux_action_opt {
+	const char *str;
+	u32 aux_action;
+	bool aux_event_opt;
+} aux_action_opts[] = {
+	{"start-paused", BIT(0), true},
+	{"pause",        BIT(1), false},
+	{"resume",       BIT(2), false},
+	{.str = NULL},
+};
+
+static const struct aux_action_opt *auxtrace_parse_aux_action_str(const char *str)
+{
+	const struct aux_action_opt *opt;
+
+	if (!str)
+		return NULL;
+
+	for (opt = aux_action_opts; opt->str; opt++)
+		if (!strcmp(str, opt->str))
+			return opt;
+
+	return NULL;
+}
+
+int auxtrace_parse_aux_action(struct evlist *evlist)
 {
-	struct evsel *evsel, *aux_evsel = NULL;
 	struct evsel_config_term *term;
+	struct evsel *aux_evsel = NULL;
+	struct evsel *evsel;
 
 	evlist__for_each_entry(evlist, evsel) {
-		if (evsel__is_aux_event(evsel))
+		bool is_aux_event = evsel__is_aux_event(evsel);
+		const struct aux_action_opt *opt;
+
+		if (is_aux_event)
 			aux_evsel = evsel;
-		term = evsel__get_config_term(evsel, AUX_OUTPUT);
+		term = evsel__get_config_term(evsel, AUX_ACTION);
+		if (!term) {
+			if (evsel__get_config_term(evsel, AUX_OUTPUT))
+				goto regroup;
+			continue;
+		}
+		opt = auxtrace_parse_aux_action_str(term->val.str);
+		if (!opt) {
+			pr_err("Bad aux-action '%s'\n", term->val.str);
+			return -EINVAL;
+		}
+		if (opt->aux_event_opt && !is_aux_event) {
+			pr_err("aux-action '%s' can only be used with AUX area event\n",
+			       term->val.str);
+			return -EINVAL;
+		}
+		if (!opt->aux_event_opt && is_aux_event) {
+			pr_err("aux-action '%s' cannot be used for AUX area event itself\n",
+			       term->val.str);
+			return -EINVAL;
+		}
+		evsel->core.attr.aux_action = opt->aux_action;
+regroup:
 		/* If possible, group with the AUX event */
-		if (term && aux_evsel)
+		if (aux_evsel)
 			evlist__regroup(evlist, aux_evsel, evsel);
+		if (!evsel__is_aux_event(evsel__leader(evsel))) {
+			pr_err("Events with aux-action must have AUX area event group leader\n");
+			return -EINVAL;
+		}
 	}
+
+	return 0;
 }
 
 struct auxtrace_record *__weak
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index a1895a4f530b..208c15be9221 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -579,7 +579,7 @@ int auxtrace_parse_snapshot_options(struct auxtrace_record *itr,
 int auxtrace_parse_sample_options(struct auxtrace_record *itr,
 				  struct evlist *evlist,
 				  struct record_opts *opts, const char *str);
-void auxtrace_regroup_aux_output(struct evlist *evlist);
+int auxtrace_parse_aux_action(struct evlist *evlist);
 int auxtrace_record__options(struct auxtrace_record *itr,
 			     struct evlist *evlist,
 			     struct record_opts *opts);
@@ -800,8 +800,10 @@ int auxtrace_parse_sample_options(struct auxtrace_record *itr __maybe_unused,
 }
 
 static inline
-void auxtrace_regroup_aux_output(struct evlist *evlist __maybe_unused)
+int auxtrace_parse_aux_action(struct evlist *evlist __maybe_unused)
 {
+	pr_err("AUX area tracing not supported\n");
+	return -EINVAL;
 }
 
 static inline
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 8155a25554ea..9621c8c12406 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1016,6 +1016,7 @@ static void evsel__apply_config_terms(struct evsel *evsel,
 			attr->aux_output = term->val.aux_output ? 1 : 0;
 			break;
 		case EVSEL__CONFIG_TERM_AUX_ACTION:
+			/* Already applied by auxtrace */
 			break;
 		case EVSEL__CONFIG_TERM_AUX_SAMPLE_SIZE:
 			/* Already applied by auxtrace */
-- 
2.43.0

[PATCH V13 11/14] perf tools: Add missing_features for aux_start_paused, aux_pause, aux_resume

Posted by Adrian Hunter 1 year, 3 months ago

Display "feature is not supported" error message if aux_start_paused,
aux_pause or aux_resume result in a perf_event_open() error.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---


Changes in V13:
	Add error message also in EOPNOTSUPP case (Leo)


 tools/perf/util/evsel.c | 12 ++++++++++++
 tools/perf/util/evsel.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 9621c8c12406..fd28ff5437b5 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2177,6 +2177,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
 		perf_missing_features.inherit_sample_read = true;
 		pr_debug2("Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit, falling back to no-inherit.\n");
 		return true;
+	} else if (!perf_missing_features.aux_pause_resume &&
+	    (evsel->core.attr.aux_pause || evsel->core.attr.aux_resume ||
+	     evsel->core.attr.aux_start_paused)) {
+		perf_missing_features.aux_pause_resume = true;
+		pr_debug2_peo("Kernel has no aux_pause/aux_resume support, bailing out\n");
+		return false;
 	} else if (!perf_missing_features.branch_counters &&
 	    (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
 		perf_missing_features.branch_counters = true;
@@ -3397,6 +3403,10 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			return scnprintf(msg, size,
 	"%s: PMU Hardware doesn't support 'aux_output' feature",
 					 evsel__name(evsel));
+		if (evsel->core.attr.aux_action)
+			return scnprintf(msg, size,
+	"%s: PMU Hardware doesn't support 'aux_action' feature",
+					evsel__name(evsel));
 		if (evsel->core.attr.sample_period != 0)
 			return scnprintf(msg, size,
 	"%s: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'",
@@ -3427,6 +3437,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			return scnprintf(msg, size, "clockid feature not supported.");
 		if (perf_missing_features.clockid_wrong)
 			return scnprintf(msg, size, "wrong clockid (%d).", clockid);
+		if (perf_missing_features.aux_pause_resume)
+			return scnprintf(msg, size, "The 'aux_pause / aux_resume' feature is not supported, update the kernel.");
 		if (perf_missing_features.aux_output)
 			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
 		if (!target__has_cpu(target))
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index bd08d94d3f8a..d40df2051718 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -221,6 +221,7 @@ struct perf_missing_features {
 	bool weight_struct;
 	bool read_lost;
 	bool branch_counters;
+	bool aux_pause_resume;
 	bool inherit_sample_read;
 };
 
-- 
2.43.0

Re: [PATCH V13 11/14] perf tools: Add missing_features for aux_start_paused, aux_pause, aux_resume

Posted by Leo Yan 1 year, 3 months ago


On 10/14/24 11:51, Adrian Hunter wrote:
> 
> 
> Display "feature is not supported" error message if aux_start_paused,
> aux_pause or aux_resume result in a perf_event_open() error.
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Acked-by: Ian Rogers <irogers@google.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>

Reviewed-by: Leo Yan <leo.yan@arm.com>

> ---
> 
> 
> Changes in V13:
>          Add error message also in EOPNOTSUPP case (Leo)
> 
> 
>   tools/perf/util/evsel.c | 12 ++++++++++++
>   tools/perf/util/evsel.h |  1 +
>   2 files changed, 13 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 9621c8c12406..fd28ff5437b5 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2177,6 +2177,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
>                  perf_missing_features.inherit_sample_read = true;
>                  pr_debug2("Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit, falling back to no-inherit.\n");
>                  return true;
> +       } else if (!perf_missing_features.aux_pause_resume &&
> +           (evsel->core.attr.aux_pause || evsel->core.attr.aux_resume ||
> +            evsel->core.attr.aux_start_paused)) {
> +               perf_missing_features.aux_pause_resume = true;
> +               pr_debug2_peo("Kernel has no aux_pause/aux_resume support, bailing out\n");
> +               return false;
>          } else if (!perf_missing_features.branch_counters &&
>              (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
>                  perf_missing_features.branch_counters = true;
> @@ -3397,6 +3403,10 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>                          return scnprintf(msg, size,
>          "%s: PMU Hardware doesn't support 'aux_output' feature",
>                                           evsel__name(evsel));
> +               if (evsel->core.attr.aux_action)
> +                       return scnprintf(msg, size,
> +       "%s: PMU Hardware doesn't support 'aux_action' feature",
> +                                       evsel__name(evsel));
>                  if (evsel->core.attr.sample_period != 0)
>                          return scnprintf(msg, size,
>          "%s: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'",
> @@ -3427,6 +3437,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>                          return scnprintf(msg, size, "clockid feature not supported.");
>                  if (perf_missing_features.clockid_wrong)
>                          return scnprintf(msg, size, "wrong clockid (%d).", clockid);
> +               if (perf_missing_features.aux_pause_resume)
> +                       return scnprintf(msg, size, "The 'aux_pause / aux_resume' feature is not supported, update the kernel.");
>                  if (perf_missing_features.aux_output)
>                          return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
>                  if (!target__has_cpu(target))
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index bd08d94d3f8a..d40df2051718 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -221,6 +221,7 @@ struct perf_missing_features {
>          bool weight_struct;
>          bool read_lost;
>          bool branch_counters;
> +       bool aux_pause_resume;
>          bool inherit_sample_read;
>   };
> 
> --
> 2.43.0
> 
>

[PATCH V13 12/14] perf intel-pt: Improve man page format

Posted by Adrian Hunter 1 year, 3 months ago

Improve format of config terms and section references.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-intel-pt.txt | 486 +++++++++++----------
 1 file changed, 267 insertions(+), 219 deletions(-)

diff --git a/tools/perf/Documentation/perf-intel-pt.txt b/tools/perf/Documentation/perf-intel-pt.txt
index 59ab1ff9d75f..ad39bf20f862 100644
--- a/tools/perf/Documentation/perf-intel-pt.txt
+++ b/tools/perf/Documentation/perf-intel-pt.txt
@@ -151,7 +151,7 @@ displayed as follows:
 There are two ways that instructions-per-cycle (IPC) can be calculated depending
 on the recording.
 
-If the 'cyc' config term (see config terms section below) was used, then IPC
+If the 'cyc' config term (see <<_config_terms,config terms>> section below) was used, then IPC
 and cycle events are calculated using the cycle count from CYC packets, otherwise
 MTC packets are used - refer to the 'mtc' config term.  When MTC is used, however,
 the values are less accurate because the timing is less accurate.
@@ -239,7 +239,7 @@ which is the same as
 
 	-e intel_pt/tsc=1,noretcomp=0/
 
-Note there are now new config terms - see section 'config terms' further below.
+Note there are other config terms - see section <<_config_terms,config terms>> further below.
 
 The config terms are listed in /sys/devices/intel_pt/format.  They are bit
 fields within the config member of the struct perf_event_attr which is
@@ -311,217 +311,264 @@ perf_event_attr is displayed if the -vv option is used e.g.
 config terms
 ~~~~~~~~~~~~
 
-The June 2015 version of Intel 64 and IA-32 Architectures Software Developer
-Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features.
-Some of the features are reflect in new config terms.  All the config terms are
-described below.
-
-tsc		Always supported.  Produces TSC timestamp packets to provide
-		timing information.  In some cases it is possible to decode
-		without timing information, for example a per-thread context
-		that does not overlap executable memory maps.
-
-		The default config selects tsc (i.e. tsc=1).
-
-noretcomp	Always supported.  Disables "return compression" so a TIP packet
-		is produced when a function returns.  Causes more packets to be
-		produced but might make decoding more reliable.
-
-		The default config does not select noretcomp (i.e. noretcomp=0).
-
-psb_period	Allows the frequency of PSB packets to be specified.
-
-		The PSB packet is a synchronization packet that provides a
-		starting point for decoding or recovery from errors.
-
-		Support for psb_period is indicated by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
-
-		which contains "1" if the feature is supported and "0"
-		otherwise.
-
-		Valid values are given by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/psb_periods
-
-		which contains a hexadecimal value, the bits of which represent
-		valid values e.g. bit 2 set means value 2 is valid.
-
-		The psb_period value is converted to the approximate number of
-		trace bytes between PSB packets as:
-
-			2 ^ (value + 11)
-
-		e.g. value 3 means 16KiB bytes between PSBs
-
-		If an invalid value is entered, the error message
-		will give a list of valid values e.g.
-
-			$ perf record -e intel_pt/psb_period=15/u uname
-			Invalid psb_period for intel_pt. Valid values are: 0-5
-
-		If MTC packets are selected, the default config selects a value
-		of 3 (i.e. psb_period=3) or the nearest lower value that is
-		supported (0 is always supported).  Otherwise the default is 0.
-
-		If decoding is expected to be reliable and the buffer is large
-		then a large PSB period can be used.
-
-		Because a TSC packet is produced with PSB, the PSB period can
-		also affect the granularity to timing information in the absence
-		of MTC or CYC.
-
-mtc		Produces MTC timing packets.
-
-		MTC packets provide finer grain timestamp information than TSC
-		packets.  MTC packets record time using the hardware crystal
-		clock (CTC) which is related to TSC packets using a TMA packet.
-
-		Support for this feature is indicated by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/mtc
-
-		which contains "1" if the feature is supported and
-		"0" otherwise.
-
-		The frequency of MTC packets can also be specified - see
-		mtc_period below.
-
-mtc_period	Specifies how frequently MTC packets are produced - see mtc
-		above for how to determine if MTC packets are supported.
-
-		Valid values are given by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/mtc_periods
-
-		which contains a hexadecimal value, the bits of which represent
-		valid values e.g. bit 2 set means value 2 is valid.
-
-		The mtc_period value is converted to the MTC frequency as:
-
-			CTC-frequency / (2 ^ value)
-
-		e.g. value 3 means one eighth of CTC-frequency
-
-		Where CTC is the hardware crystal clock, the frequency of which
-		can be related to TSC via values provided in cpuid leaf 0x15.
-
-		If an invalid value is entered, the error message
-		will give a list of valid values e.g.
-
-			$ perf record -e intel_pt/mtc_period=15/u uname
-			Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
-
-		The default value is 3 or the nearest lower value
-		that is supported (0 is always supported).
-
-cyc		Produces CYC timing packets.
-
-		CYC packets provide even finer grain timestamp information than
-		MTC and TSC packets.  A CYC packet contains the number of CPU
-		cycles since the last CYC packet. Unlike MTC and TSC packets,
-		CYC packets are only sent when another packet is also sent.
-
-		Support for this feature is indicated by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
-
-		which contains "1" if the feature is supported and
-		"0" otherwise.
-
-		The number of CYC packets produced can be reduced by specifying
-		a threshold - see cyc_thresh below.
-
-cyc_thresh	Specifies how frequently CYC packets are produced - see cyc
-		above for how to determine if CYC packets are supported.
-
-		Valid cyc_thresh values are given by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
-
-		which contains a hexadecimal value, the bits of which represent
-		valid values e.g. bit 2 set means value 2 is valid.
-
-		The cyc_thresh value represents the minimum number of CPU cycles
-		that must have passed before a CYC packet can be sent.  The
-		number of CPU cycles is:
-
-			2 ^ (value - 1)
-
-		e.g. value 4 means 8 CPU cycles must pass before a CYC packet
-		can be sent.  Note a CYC packet is still only sent when another
-		packet is sent, not at, e.g. every 8 CPU cycles.
-
-		If an invalid value is entered, the error message
-		will give a list of valid values e.g.
-
-			$ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
-			Invalid cyc_thresh for intel_pt. Valid values are: 0-12
-
-		CYC packets are not requested by default.
-
-pt		Specifies pass-through which enables the 'branch' config term.
-
-		The default config selects 'pt' if it is available, so a user will
-		never need to specify this term.
-
-branch		Enable branch tracing.  Branch tracing is enabled by default so to
-		disable branch tracing use 'branch=0'.
-
-		The default config selects 'branch' if it is available.
-
-ptw		Enable PTWRITE packets which are produced when a ptwrite instruction
-		is executed.
-
-		Support for this feature is indicated by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/ptwrite
-
-		which contains "1" if the feature is supported and
-		"0" otherwise.
-
-		As an alternative, refer to "Emulated PTWRITE" further below.
-
-fup_on_ptw	Enable a FUP packet to follow the PTWRITE packet.  The FUP packet
-		provides the address of the ptwrite instruction.  In the absence of
-		fup_on_ptw, the decoder will use the address of the previous branch
-		if branch tracing is enabled, otherwise the address will be zero.
-		Note that fup_on_ptw will work even when branch tracing is disabled.
-
-pwr_evt		Enable power events.  The power events provide information about
-		changes to the CPU C-state.
-
-		Support for this feature is indicated by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/power_event_trace
-
-		which contains "1" if the feature is supported and
-		"0" otherwise.
-
-event		Enable Event Trace.  The events provide information about asynchronous
-		events.
-
-		Support for this feature is indicated by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/event_trace
-
-		which contains "1" if the feature is supported and
-		"0" otherwise.
-
-notnt		Disable TNT packets.  Without TNT packets, it is not possible to walk
-		executable code to reconstruct control flow, however FUP, TIP, TIP.PGE
-		and TIP.PGD packets still indicate asynchronous control flow, and (if
-		return compression is disabled - see noretcomp) return statements.
-		The advantage of eliminating TNT packets is reducing the size of the
-		trace and corresponding tracing overhead.
-
-		Support for this feature is indicated by:
-
-			/sys/bus/event_source/devices/intel_pt/caps/tnt_disable
-
-		which contains "1" if the feature is supported and
-		"0" otherwise.
+Config terms are parameters specified with the -e intel_pt// event option,
+for example:
+
+	-e intel_pt/cyc/
+
+which selects cycle accurate mode. Each config term can have a value which
+defaults to 1, so the above is the same as:
+
+	-e intel_pt/cyc=1/
+
+Some terms are set by default, so must be set to 0 to turn them off. For
+example, to turn off branch tracing:
+
+	-e intel_pt/branch=0/
+
+Multiple config terms are separated by commas, for example:
+
+	-e intel_pt/cyc,mtc_period=9/
+
+There are also common config terms, see linkperf:perf-record[1] documentation.
+
+Intel PT config terms are described below.
+
+*tsc*::
+Always supported.  Produces TSC timestamp packets to provide
+timing information.  In some cases it is possible to decode
+without timing information, for example a per-thread context
+that does not overlap executable memory maps.
++
+The default config selects tsc (i.e. tsc=1).
+
+*noretcomp*::
+Always supported.  Disables "return compression" so a TIP packet
+is produced when a function returns.  Causes more packets to be
+produced but might make decoding more reliable.
++
+The default config does not select noretcomp (i.e. noretcomp=0).
+
+*psb_period*::
+Allows the frequency of PSB packets to be specified.
++
+The PSB packet is a synchronization packet that provides a
+starting point for decoding or recovery from errors.
++
+Support for psb_period is indicated by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
++
+which contains "1" if the feature is supported and "0"
+otherwise.
++
+Valid values are given by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/psb_periods
++
+which contains a hexadecimal value, the bits of which represent
+valid values e.g. bit 2 set means value 2 is valid.
++
+The psb_period value is converted to the approximate number of
+trace bytes between PSB packets as:
++
+	2 ^ (value + 11)
++
+e.g. value 3 means 16KiB bytes between PSBs
++
+If an invalid value is entered, the error message
+will give a list of valid values e.g.
++
+	$ perf record -e intel_pt/psb_period=15/u uname
+	Invalid psb_period for intel_pt. Valid values are: 0-5
++
+If MTC packets are selected, the default config selects a value
+of 3 (i.e. psb_period=3) or the nearest lower value that is
+supported (0 is always supported).  Otherwise the default is 0.
++
+If decoding is expected to be reliable and the buffer is large
+then a large PSB period can be used.
++
+Because a TSC packet is produced with PSB, the PSB period can
+also affect the granularity to timing information in the absence
+of MTC or CYC.
+
+*mtc*::
+Produces MTC timing packets.
++
+MTC packets provide finer grain timestamp information than TSC
+packets.  MTC packets record time using the hardware crystal
+clock (CTC) which is related to TSC packets using a TMA packet.
++
+Support for this feature is indicated by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/mtc
++
+which contains "1" if the feature is supported and
+"0" otherwise.
++
+The frequency of MTC packets can also be specified - see
+mtc_period below.
+
+*mtc_period*::
+Specifies how frequently MTC packets are produced - see mtc
+above for how to determine if MTC packets are supported.
++
+Valid values are given by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/mtc_periods
++
+which contains a hexadecimal value, the bits of which represent
+valid values e.g. bit 2 set means value 2 is valid.
++
+The mtc_period value is converted to the MTC frequency as:
+
+	CTC-frequency / (2 ^ value)
++
+e.g. value 3 means one eighth of CTC-frequency
++
+Where CTC is the hardware crystal clock, the frequency of which
+can be related to TSC via values provided in cpuid leaf 0x15.
++
+If an invalid value is entered, the error message
+will give a list of valid values e.g.
++
+	$ perf record -e intel_pt/mtc_period=15/u uname
+	Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
++
+The default value is 3 or the nearest lower value
+that is supported (0 is always supported).
+
+*cyc*::
+Produces CYC timing packets.
++
+CYC packets provide even finer grain timestamp information than
+MTC and TSC packets.  A CYC packet contains the number of CPU
+cycles since the last CYC packet. Unlike MTC and TSC packets,
+CYC packets are only sent when another packet is also sent.
++
+Support for this feature is indicated by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
++
+which contains "1" if the feature is supported and
+"0" otherwise.
++
+The number of CYC packets produced can be reduced by specifying
+a threshold - see cyc_thresh below.
+
+*cyc_thresh*::
+Specifies how frequently CYC packets are produced - see cyc
+above for how to determine if CYC packets are supported.
++
+Valid cyc_thresh values are given by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
++
+which contains a hexadecimal value, the bits of which represent
+valid values e.g. bit 2 set means value 2 is valid.
++
+The cyc_thresh value represents the minimum number of CPU cycles
+that must have passed before a CYC packet can be sent.  The
+number of CPU cycles is:
++
+	2 ^ (value - 1)
++
+e.g. value 4 means 8 CPU cycles must pass before a CYC packet
+can be sent.  Note a CYC packet is still only sent when another
+packet is sent, not at, e.g. every 8 CPU cycles.
++
+If an invalid value is entered, the error message
+will give a list of valid values e.g.
++
+	$ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
+	Invalid cyc_thresh for intel_pt. Valid values are: 0-12
++
+CYC packets are not requested by default.
+
+*pt*::
+Specifies pass-through which enables the 'branch' config term.
++
+The default config selects 'pt' if it is available, so a user will
+never need to specify this term.
+
+*branch*::
+Enable branch tracing.  Branch tracing is enabled by default so to
+disable branch tracing use 'branch=0'.
++
+The default config selects 'branch' if it is available.
+
+*ptw*::
+Enable PTWRITE packets which are produced when a ptwrite instruction
+is executed.
++
+Support for this feature is indicated by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/ptwrite
++
+which contains "1" if the feature is supported and
+"0" otherwise.
++
+As an alternative, refer to "Emulated PTWRITE" further below.
+
+*fup_on_ptw*::
+Enable a FUP packet to follow the PTWRITE packet.  The FUP packet
+provides the address of the ptwrite instruction.  In the absence of
+fup_on_ptw, the decoder will use the address of the previous branch
+if branch tracing is enabled, otherwise the address will be zero.
+Note that fup_on_ptw will work even when branch tracing is disabled.
+
+*pwr_evt*::
+Enable power events.  The power events provide information about
+changes to the CPU C-state.
++
+Support for this feature is indicated by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/power_event_trace
++
+which contains "1" if the feature is supported and
+"0" otherwise.
+
+*event*::
+Enable Event Trace.  The events provide information about asynchronous
+events.
++
+Support for this feature is indicated by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/event_trace
++
+which contains "1" if the feature is supported and
+"0" otherwise.
+
+*notnt*::
+Disable TNT packets.  Without TNT packets, it is not possible to walk
+executable code to reconstruct control flow, however FUP, TIP, TIP.PGE
+and TIP.PGD packets still indicate asynchronous control flow, and (if
+return compression is disabled - see noretcomp) return statements.
+The advantage of eliminating TNT packets is reducing the size of the
+trace and corresponding tracing overhead.
++
+Support for this feature is indicated by:
++
+	/sys/bus/event_source/devices/intel_pt/caps/tnt_disable
++
+which contains "1" if the feature is supported and
+"0" otherwise.
+
+
+config terms on other events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some Intel PT features work with other events, features such as AUX area sampling
+and PEBS-via-PT.  In those cases, the other events can have config terms below:
+
+*aux-sample-size*::
+		Used to set the AUX area sample size, refer to the section
+		<<_aux_area_sampling_option,AUX area sampling option>>
+
+*aux-output*::
+		Used to select PEBS-via-PT, refer to the
+		section <<_pebs_via_intel_pt,PEBS via Intel PT>>
 
 
 AUX area sampling option
@@ -596,7 +643,8 @@ The default snapshot size is the auxtrace mmap size.  If neither auxtrace mmap s
 nor snapshot size is specified, then the default is 4MiB for privileged users
 (or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
 If an unprivileged user does not specify mmap pages, the mmap pages will be
-reduced as described in the 'new auxtrace mmap size option' section below.
+reduced as described in the <<_new_auxtrace_mmap_size_option,new auxtrace mmap size option>>
+section below.
 
 The snapshot size is displayed if the option -vv is used e.g.
 
@@ -952,11 +1000,11 @@ transaction start, commit or abort.
 
 Note that "instructions", "cycles", "branches" and "transactions" events
 depend on code flow packets which can be disabled by using the config term
-"branch=0".  Refer to the config terms section above.
+"branch=0".  Refer to the <<_config_terms,config terms>> section above.
 
 "ptwrite" events record the payload of the ptwrite instruction and whether
 "fup_on_ptw" was used.  "ptwrite" events depend on PTWRITE packets which are
-recorded only if the "ptw" config term was used.  Refer to the config terms
+recorded only if the "ptw" config term was used.  Refer to the <<_config_terms,config terms>>
 section above.  perf script "synth" field displays "ptwrite" information like
 this: "ip: 0 payload: 0x123456789abcdef0"  where "ip" is 1 if "fup_on_ptw" was
 used.
@@ -964,7 +1012,7 @@ used.
 "Power" events correspond to power event packets and CBR (core-to-bus ratio)
 packets.  While CBR packets are always recorded when tracing is enabled, power
 event packets are recorded only if the "pwr_evt" config term was used.  Refer to
-the config terms section above.  The power events record information about
+the <<_config_terms,config terms>> section above.  The power events record information about
 C-state changes, whereas CBR is indicative of CPU frequency.  perf script
 "event,synth" fields display information like this:
 
@@ -1120,7 +1168,7 @@ What *will* be decoded with the (single) q option:
 	- asynchronous branches such as interrupts
 	- indirect branches
 	- function return target address *if* the noretcomp config term (refer
-	config terms section) was used
+	<<_config_terms,config terms>> section) was used
 	- start of (control-flow) tracing
 	- end of (control-flow) tracing, if it is not out of context
 	- power events, ptwrite, transaction start and abort
@@ -1133,7 +1181,7 @@ Repeating the q option (double-q i.e. qq) results in even faster decoding and ev
 less detail.  The decoder decodes only extended PSB (PSB+) packets, getting the
 instruction pointer if there is a FUP packet within PSB+ (i.e. between PSB and
 PSBEND).  Note PSB packets occur regularly in the trace based on the psb_period
-config term (refer config terms section).  There will be a FUP packet if the
+config term (refer <<_config_terms,config terms>> section).  There will be a FUP packet if the
 PSB+ occurs while control flow is being traced.
 
 What will *not* be decoded with the qq option:
-- 
2.43.0

[PATCH V13 13/14] perf intel-pt: Add documentation for pause / resume

Posted by Adrian Hunter 1 year, 3 months ago

Document the use of aux-action config term and provide a simple example.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---


Changes in V5:
	Added more examples


 tools/perf/Documentation/perf-intel-pt.txt | 108 +++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/tools/perf/Documentation/perf-intel-pt.txt b/tools/perf/Documentation/perf-intel-pt.txt
index ad39bf20f862..cc0f37f0fa5a 100644
--- a/tools/perf/Documentation/perf-intel-pt.txt
+++ b/tools/perf/Documentation/perf-intel-pt.txt
@@ -555,6 +555,9 @@ Support for this feature is indicated by:
 which contains "1" if the feature is supported and
 "0" otherwise.
 
+*aux-action=start-paused*::
+Start tracing paused, refer to the section <<_pause_or_resume_tracing,Pause or Resume Tracing>>
+
 
 config terms on other events
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -570,6 +573,9 @@ and PEBS-via-PT.  In those cases, the other events can have config terms below:
 		Used to select PEBS-via-PT, refer to the
 		section <<_pebs_via_intel_pt,PEBS via Intel PT>>
 
+*aux-action*::
+		Used to pause or resume tracing, refer to the section
+		<<_pause_or_resume_tracing,Pause or Resume Tracing>>
 
 AUX area sampling option
 ~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1915,6 +1921,108 @@ For pipe mode, the order of events and timestamps can presumably
 be messed up.
 
 
+Pause or Resume Tracing
+-----------------------
+
+With newer Kernels, it is possible to use other selected events to pause
+or resume Intel PT tracing.  This is configured by using the "aux-action"
+config term:
+
+"aux-action=pause" is used with events that are to pause Intel PT tracing.
+
+"aux-action=resume" is used with events that are to resume Intel PT tracing.
+
+"aux-action=start-paused" is used with the Intel PT event to start in a
+paused state.
+
+For example, to trace only the uname system call (sys_newuname) when running the
+command line utility uname:
+
+ $ perf record --kcore -e intel_pt/aux-action=start-paused/k,syscalls:sys_enter_newuname/aux-action=resume/,syscalls:sys_exit_newuname/aux-action=pause/ uname
+ Linux
+ [ perf record: Woken up 1 times to write data ]
+ [ perf record: Captured and wrote 0.043 MB perf.data ]
+ $ perf script --call-trace
+ uname   30805 [000] 24001.058782799: name: 0x7ffc9c1865b0
+ uname   30805 [000] 24001.058784424:  psb offs: 0
+ uname   30805 [000] 24001.058784424:  cbr: 39 freq: 3904 MHz (139%)
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])        debug_smp_processor_id
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])        __x64_sys_newuname
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])            down_read
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                __cond_resched
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_add
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                    in_lock_functions
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_sub
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])            up_read
+ uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_add
+ uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                    in_lock_functions
+ uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                preempt_count_sub
+ uname   30805 [000] 24001.058784838: ([kernel.kallsyms])            _copy_to_user
+ uname   30805 [000] 24001.058784838: ([kernel.kallsyms])        syscall_exit_to_user_mode
+ uname   30805 [000] 24001.058784838: ([kernel.kallsyms])            syscall_exit_work
+ uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                perf_syscall_exit
+ uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                    debug_smp_processor_id
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                    perf_trace_buf_alloc
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_swevent_get_recursion_context
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            debug_smp_processor_id
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        debug_smp_processor_id
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                    perf_tp_event
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_trace_buf_update
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            tracing_gen_ctx_irq_test
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_swevent_event
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            __perf_event_account_interrupt
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                __this_cpu_preempt_check
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            perf_event_output_forward
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                perf_event_aux_pause
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                    ring_buffer_get
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                        __rcu_read_lock
+ uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                        __rcu_read_unlock
+ uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                    pt_event_stop
+ uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        debug_smp_processor_id
+ uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        debug_smp_processor_id
+ uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        native_write_msr
+ uname   30805 [000] 24001.058785463: ([kernel.kallsyms])                                        native_write_msr
+ uname   30805 [000] 24001.058785639: 0x0
+
+The example above uses tracepoints, but any kind of sampled event can be used.
+
+For example:
+
+ Tracing between arch_cpu_idle_enter() and arch_cpu_idle_exit() using breakpoint events:
+
+ $ sudo cat /proc/kallsyms | sort | grep ' arch_cpu_idle_enter\| arch_cpu_idle_exit'
+ ffffffffb605bf60 T arch_cpu_idle_enter
+ ffffffffb614d8a0 W arch_cpu_idle_exit
+ $ sudo perf record --kcore -a -e intel_pt/aux-action=start-paused/k -e mem:0xffffffffb605bf60:x/aux-action=resume/ -e mem:0xffffffffb614d8a0:x/aux-action=pause/ -- sleep 1
+ [ perf record: Woken up 1 times to write data ]
+ [ perf record: Captured and wrote 1.387 MB perf.data ]
+
+ Tracing __alloc_pages() using kprobes:
+
+ $ sudo perf probe --add '__alloc_pages order'
+ Added new event:  probe:__alloc_pages  (on __alloc_pages with order)
+ $ sudo perf probe --add __alloc_pages%return
+ Added new event:  probe:__alloc_pages__return (on __alloc_pages%return)
+ $ sudo perf record --kcore -aR -e intel_pt/aux-action=start-paused/k -e probe:__alloc_pages/aux-action=resume/ -e probe:__alloc_pages__return/aux-action=pause/ -- sleep 1
+ [ perf record: Woken up 1 times to write data ]
+ [ perf record: Captured and wrote 1.490 MB perf.data ]
+
+ Tracing starting at main() using a uprobe event:
+
+ $ sudo perf probe -x /usr/bin/uname main
+ Added new event:  probe_uname:main     (on main in /usr/bin/uname)
+ $ sudo perf record -e intel_pt/-aux-action=start-paused/u -e probe_uname:main/aux-action=resume/ -- uname
+ Linux
+ [ perf record: Woken up 1 times to write data ]
+ [ perf record: Captured and wrote 0.031 MB perf.data ]
+
+ Tracing occasionally using cycles events with different periods:
+
+ $ perf record --kcore -a -m,64M -e intel_pt/aux-action=start-paused/k -e cycles/aux-action=pause,period=1000000/Pk -e cycles/aux-action=resume,period=10500000/Pk -- firefox
+ [ perf record: Woken up 19 times to write data ]
+ [ perf record: Captured and wrote 16.561 MB perf.data ]
+
+
 EXAMPLE
 -------
 
-- 
2.43.0

[PATCH V13 14/14] perf intel-pt: Add a test for pause / resume

Posted by Adrian Hunter 1 year, 3 months ago

Add a simple sub-test to the "Miscellaneous Intel PT testing" test to
check pause / resume.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/tests/shell/test_intel_pt.sh | 28 +++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tools/perf/tests/shell/test_intel_pt.sh b/tools/perf/tests/shell/test_intel_pt.sh
index 723ec501f99a..e359db0d0ff2 100755
--- a/tools/perf/tests/shell/test_intel_pt.sh
+++ b/tools/perf/tests/shell/test_intel_pt.sh
@@ -644,6 +644,33 @@ test_pipe()
 	return 0
 }
 
+test_pause_resume()
+{
+	echo "--- Test with pause / resume ---"
+	if ! perf_record_no_decode -o "${perfdatafile}" -e intel_pt/aux-action=start-paused/u uname ; then
+		echo "SKIP: pause / resume is not supported"
+		return 2
+	fi
+	if ! perf_record_no_bpf -o "${perfdatafile}" \
+			-e intel_pt/aux-action=start-paused/u \
+			-e instructions/period=50000,aux-action=resume,name=Resume/u \
+			-e instructions/period=100000,aux-action=pause,name=Pause/u uname  ; then
+		echo "perf record with pause / resume failed"
+		return 1
+	fi
+	if ! perf script -i "${perfdatafile}" --itrace=b -Fperiod,event | \
+			awk 'BEGIN {paused=1;branches=0}
+			     /Resume/ {paused=0}
+			     /branches/ {if (paused) exit 1;branches=1}
+			     /Pause/ {paused=1}
+			     END {if (!branches) exit 1}' ; then
+		echo "perf record with pause / resume failed"
+		return 1
+	fi
+	echo OK
+	return 0
+}
+
 count_result()
 {
 	if [ "$1" -eq 2 ] ; then
@@ -672,6 +699,7 @@ test_power_event			|| ret=$? ; count_result $ret ; ret=0
 test_no_tnt				|| ret=$? ; count_result $ret ; ret=0
 test_event_trace			|| ret=$? ; count_result $ret ; ret=0
 test_pipe				|| ret=$? ; count_result $ret ; ret=0
+test_pause_resume			|| ret=$? ; count_result $ret ; ret=0
 
 cleanup
 
-- 
2.43.0