[PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency

Reinette Chatre posted 2 patches 1 year, 6 months ago
[PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency
Posted by Reinette Chatre 1 year, 6 months ago
From: Isaku Yamahata <isaku.yamahata@intel.com>

Test if KVM emulates the APIC bus clock at the expected frequency when
userspace configures the frequency via KVM_CAP_X86_APIC_BUS_CYCLES_NS.

Set APIC timer's initial count to the maximum value and busy wait for 100
msec (largely arbitrary) using the TSC. Read the APIC timer's "current
count" to calculate the actual APIC bus clock frequency based on TSC
frequency.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes v9:
- Use uint64_t to store APIC frequency (was unsigned long). (Sean)
- Introduce new helpers (apic_enable(), apic_read_reg(), and
  apic_write_reg()) that runs appropriate xAPIC or x2APIC functions
  based on new is_x2apic global. (Sean)
- Use new apic_* helpers in guest test code to eliminate duplicate
  code. (Sean)
- Use constant defined at beginning of function for ICR register
  value. (Sean)
- Drop unnecessary GUEST_ASSERT(tsc_cycles > 0). (Sean)
- Drop unnecessary comments, use provided comments. (Sean)
- Change APIC frequency command line option to "-f" (was "-a"). (Sean)
- Enable user to override delay used in guest with new "-d" command
  line option. (Sean)
- Open code test checks in what is now single guest test function.

Changes v8:
- With TSC frequency switching to unsigned int type the implicit
  type promotion during bus frequency computation results in
  overflow. Use explicit unsigned long type within computation to
  avoid overflow.

Changes v7:
- Drop Maxim Levitsky's Reviewed-by because of significant changes.
- Remove redefine of _GNU_SOURCE. (kernel test robot)
- Rewrite changelog and test description. (Sean)
- Enable user space to set APIC bus frequency. (Sean)
- Use GUEST_ASSERT() from guest instead of TEST_ASSERT() from host. (Sean)
- Test xAPIC as well as x2APIC. (Sean)
- Add check that KVM_CAP_X86_APIC_BUS_CYCLES_NS cannot be set
  after vCPU created. (Sean)
- Use new udelay() utility. (Sean)
- Drop CLI. Mask LVT timer independently. (Sean)
- Use correct LVT timer entry (0x350 -> 0x320) to enable oneshot operation.
- Remove unnecessary static functions from single file test.
- Be consistent in types by using uint32_t/uint64_t instead of
  u32/u64.

Changes v6:
- Use vm_create() wrapper instead of open coding it. (Zide)
- Improve grammar of test description. (Zide)

Changes v5:
- Update to new name of capability KVM_CAP_X86_APIC_BUS_FREQUENCY ->
  KVM_CAP_X86_APIC_BUS_CYCLES_NS. (Xiaoyao Li)

Changes v4:
- Rework changelog.
- Add Sean's "Suggested-by" to acknowledge guidance received in
  https://lore.kernel.org/all/ZU0BASXWcck85r90@google.com/
- Add copyright.
- Add test description to file header.
- Consistent capitalization for acronyms.
- Rebase to kvm-x86/next.
- Update to v4 change of providing bus clock rate in nanoseconds.
- Add a "TEST_REQUIRE()" for the new capability so that the test can
  work on kernels that do not support the new capability.
- Address checkpatch warnings and use tabs instead of spaces in header
  file to match existing code.

Changes v3:
- Use 1.5GHz instead of 1GHz as frequency.

Changes v2:
- Newly added.
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/apic.h       |   8 +
 .../kvm/x86_64/apic_bus_clock_test.c          | 202 ++++++++++++++++++
 3 files changed, 211 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ce8ff8e8ce3a..ad8b5d15f2bd 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -112,6 +112,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_invalid_nested_guest_state
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
+TEST_GEN_PROGS_x86_64 += x86_64/apic_bus_clock_test
 TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test
 TEST_GEN_PROGS_x86_64 += x86_64/xapic_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/xcr0_cpuid_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/apic.h b/tools/testing/selftests/kvm/include/x86_64/apic.h
index bed316fdecd5..0f268b55fa06 100644
--- a/tools/testing/selftests/kvm/include/x86_64/apic.h
+++ b/tools/testing/selftests/kvm/include/x86_64/apic.h
@@ -60,6 +60,14 @@
 #define		APIC_VECTOR_MASK	0x000FF
 #define	APIC_ICR2	0x310
 #define		SET_APIC_DEST_FIELD(x)	((x) << 24)
+#define APIC_LVTT	0x320
+#define		APIC_LVT_TIMER_ONESHOT		(0 << 17)
+#define		APIC_LVT_TIMER_PERIODIC		(1 << 17)
+#define		APIC_LVT_TIMER_TSCDEADLINE	(2 << 17)
+#define		APIC_LVT_MASKED			(1 << 16)
+#define	APIC_TMICT	0x380
+#define	APIC_TMCCT	0x390
+#define	APIC_TDCR	0x3E0
 
 void apic_disable(void);
 void xapic_enable(void);
diff --git a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
new file mode 100644
index 000000000000..bff3cd95134f
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024 Intel Corporation
+ *
+ * Verify KVM correctly emulates the APIC bus frequency when the VMM configures
+ * the frequency via KVM_CAP_X86_APIC_BUS_CYCLES_NS.  Start the APIC timer by
+ * programming TMICT (timer initial count) to the largest value possible (so
+ * that the timer will not expire during the test).  Then, after an arbitrary
+ * amount of time has elapsed, verify TMCCT (timer current count) is within 1%
+ * of the expected value based on the time elapsed, the APIC bus frequency, and
+ * the programmed TDCR (timer divide configuration register).
+ */
+
+#include "apic.h"
+#include "test_util.h"
+
+/*
+ * Pick 25MHz for APIC bus frequency. Different enough from the default 1GHz.
+ * User can override via command line.
+ */
+static uint64_t apic_hz = 25 * 1000 * 1000;
+
+/*
+ * Delay in msec that guest uses to determine APIC bus frequency.
+ * User can override via command line.
+ */
+static unsigned long delay_ms = 100;
+
+/*
+ * Possible TDCR values with matching divide count. Used to modify APIC
+ * timer frequency.
+ */
+static struct {
+	uint32_t tdcr;
+	uint32_t divide_count;
+} tdcrs[] = {
+	{0x0, 2},
+	{0x1, 4},
+	{0x2, 8},
+	{0x3, 16},
+	{0x8, 32},
+	{0x9, 64},
+	{0xa, 128},
+	{0xb, 1},
+};
+
+/* true if x2APIC test is running, false if xAPIC test is running. */
+static bool is_x2apic;
+
+static void apic_enable(void)
+{
+	if (is_x2apic)
+		x2apic_enable();
+	else
+		xapic_enable();
+}
+
+static uint32_t apic_read_reg(unsigned int reg)
+{
+	return is_x2apic ? x2apic_read_reg(reg) : xapic_read_reg(reg);
+}
+
+static void apic_write_reg(unsigned int reg, uint32_t val)
+{
+	if (is_x2apic)
+		x2apic_write_reg(reg, val);
+	else
+		xapic_write_reg(reg, val);
+}
+
+static void apic_guest_code(void)
+{
+	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
+	const uint32_t tmict = ~0u;
+	uint64_t tsc0, tsc1, freq;
+	uint32_t tmcct;
+	int i;
+
+	apic_enable();
+
+	/*
+	 * Setup one-shot timer.  The vector does not matter because the
+	 * interrupt should not fire.
+	 */
+	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
+
+	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
+
+		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
+		apic_write_reg(APIC_TMICT, tmict);
+
+		tsc0 = rdtsc();
+		udelay(delay_ms * 1000);
+		tmcct = apic_read_reg(APIC_TMCCT);
+		tsc1 = rdtsc();
+
+		/*
+		 * Stop the timer _after_ reading the current, final count, as
+		 * writing the initial counter also modifies the current count.
+		 */
+		apic_write_reg(APIC_TMICT, 0);
+
+		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
+		/* Check if measured frequency is within 1% of configured frequency. */
+		GUEST_ASSERT(freq < apic_hz * 101 / 100);
+		GUEST_ASSERT(freq > apic_hz * 99 / 100);
+	}
+
+	GUEST_DONE();
+}
+
+static void test_apic_bus_clock(struct kvm_vcpu *vcpu)
+{
+	bool done = false;
+	struct ucall uc;
+
+	while (!done) {
+		vcpu_run(vcpu);
+
+		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_DONE:
+			done = true;
+			break;
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		default:
+			TEST_FAIL("Unknown ucall %lu", uc.cmd);
+			break;
+		}
+	}
+}
+
+static void run_apic_bus_clock_test(bool x2apic)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int ret;
+
+	is_x2apic = x2apic;
+
+	vm = vm_create(1);
+
+	sync_global_to_guest(vm, apic_hz);
+	sync_global_to_guest(vm, delay_ms);
+	sync_global_to_guest(vm, is_x2apic);
+
+	vm_enable_cap(vm, KVM_CAP_X86_APIC_BUS_CYCLES_NS,
+		      NSEC_PER_SEC / apic_hz);
+
+	vcpu = vm_vcpu_add(vm, 0, apic_guest_code);
+
+	ret = __vm_enable_cap(vm, KVM_CAP_X86_APIC_BUS_CYCLES_NS,
+			      NSEC_PER_SEC / apic_hz);
+	TEST_ASSERT(ret < 0 && errno == EINVAL,
+		    "Setting of APIC bus frequency after vCPU is created should fail.");
+
+	if (!is_x2apic)
+		virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA);
+
+	test_apic_bus_clock(vcpu);
+	kvm_vm_free(vm);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-h] [-d delay] [-f APIC bus freq]\n", name);
+	puts("");
+	printf("-d: Delay (in msec) guest uses to measure APIC bus frequency.\n");
+	printf("-f: The APIC bus frequency (in Hz) to be configured for the guest.\n");
+	puts("");
+}
+
+int main(int argc, char *argv[])
+{
+	int opt;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_APIC_BUS_CYCLES_NS));
+
+	while ((opt = getopt(argc, argv, "d:f:h")) != -1) {
+		switch (opt) {
+		case 'f':
+			apic_hz = atol(optarg);
+			break;
+		case 'd':
+			delay_ms = atol(optarg);
+			break;
+		case 'h':
+			help(argv[0]);
+			exit(0);
+		default:
+			help(argv[0]);
+			exit(1);
+		}
+	}
+
+	run_apic_bus_clock_test(false);
+	run_apic_bus_clock_test(true);
+}
-- 
2.34.1
Re: [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency
Posted by Sean Christopherson 1 year, 5 months ago
On Wed, Jun 12, 2024, Reinette Chatre wrote:
> +/*
> + * Pick 25MHz for APIC bus frequency. Different enough from the default 1GHz.
> + * User can override via command line.
> + */
> +static uint64_t apic_hz = 25 * 1000 * 1000;
> +
> +/*
> + * Delay in msec that guest uses to determine APIC bus frequency.
> + * User can override via command line.
> + */
> +static unsigned long delay_ms = 100;

There's no need for these to be global, it's easy enough to pass them as params
to apic_guest_code().  Making is_x2apic is wortwhile as it cuts down on the noise,
but for these, I think it's better to keep them local.

> +
> +/*
> + * Possible TDCR values with matching divide count. Used to modify APIC
> + * timer frequency.
> + */
> +static struct {
> +	uint32_t tdcr;
> +	uint32_t divide_count;

These can/should all be const.

> +} tdcrs[] = {
> +	{0x0, 2},
> +	{0x1, 4},
> +	{0x2, 8},
> +	{0x3, 16},
> +	{0x8, 32},
> +	{0x9, 64},
> +	{0xa, 128},
> +	{0xb, 1},
> +};
> +
> +/* true if x2APIC test is running, false if xAPIC test is running. */
> +static bool is_x2apic;
> +
> +static void apic_enable(void)
> +{
> +	if (is_x2apic)
> +		x2apic_enable();
> +	else
> +		xapic_enable();
> +}
> +
> +static uint32_t apic_read_reg(unsigned int reg)
> +{
> +	return is_x2apic ? x2apic_read_reg(reg) : xapic_read_reg(reg);
> +}
> +
> +static void apic_write_reg(unsigned int reg, uint32_t val)
> +{
> +	if (is_x2apic)
> +		x2apic_write_reg(reg, val);
> +	else
> +		xapic_write_reg(reg, val);
> +}
> +
> +static void apic_guest_code(void)
> +{
> +	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
> +	const uint32_t tmict = ~0u;
> +	uint64_t tsc0, tsc1, freq;
> +	uint32_t tmcct;
> +	int i;
> +
> +	apic_enable();
> +
> +	/*
> +	 * Setup one-shot timer.  The vector does not matter because the
> +	 * interrupt should not fire.
> +	 */
> +	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
> +
> +	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
> +
> +		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
> +		apic_write_reg(APIC_TMICT, tmict);
> +
> +		tsc0 = rdtsc();
> +		udelay(delay_ms * 1000);
> +		tmcct = apic_read_reg(APIC_TMCCT);
> +		tsc1 = rdtsc();
> +
> +		/*
> +		 * Stop the timer _after_ reading the current, final count, as
> +		 * writing the initial counter also modifies the current count.
> +		 */
> +		apic_write_reg(APIC_TMICT, 0);
> +
> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> +		/* Check if measured frequency is within 1% of configured frequency. */

1% is likely too aggressive, i.e. we'll get false failures due to host activity.
On our systems, even a single pr_warn() in the timer path causes failure.  For
now, I think 5% is good enough, e.g. it'll catch cases where KVM is waaay off.
If we want to do better (or that's still too tight), then we can add a '-t <tolerance'
param or something.

> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);

Combine these into a single assert, and print the params, i.e. don't rely on the
line number to figure out what's up.

		__GUEST_ASSERT(freq < apic_hz * 105 / 100 && freq > apic_hz * 95 / 100,
			       "Frequency = %lu (wanted %lu - %lu), bus = %lu, div = %u, tsc = %lu",
			       freq, apic_hz * 95 / 100, apic_hz * 105 / 100,
			       apic_hz, tdcrs[i].divide_count, tsc_hz);

> +int main(int argc, char *argv[])
> +{
> +	int opt;
> +
> +	TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_APIC_BUS_CYCLES_NS));
> +
> +	while ((opt = getopt(argc, argv, "d:f:h")) != -1) {
> +		switch (opt) {
> +		case 'f':
> +			apic_hz = atol(optarg);
> +			break;
> +		case 'd':
> +			delay_ms = atol(optarg);
> +			break;
> +		case 'h':
> +			help(argv[0]);
> +			exit(0);
> +		default:
> +			help(argv[0]);
> +			exit(1);

Heh, selftests are anything but consistent, but this should be exit(KSFT_SKIP)
for both.

> +		}
> +	}
> +
> +	run_apic_bus_clock_test(false);
> +	run_apic_bus_clock_test(true);
> +}
> -- 
> 2.34.1
>
VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
Posted by Sean Christopherson 1 year, 5 months ago
Forking this off to try and avoid confusion...

On Wed, Jun 12, 2024, Reinette Chatre wrote:
> +/*
> + * Possible TDCR values with matching divide count. Used to modify APIC
> + * timer frequency.
> + */
> +static struct {
> +	uint32_t tdcr;
> +	uint32_t divide_count;
> +} tdcrs[] = {
> +	{0x0, 2},
> +	{0x1, 4},
> +	{0x2, 8},
> +	{0x3, 16},
> +	{0x8, 32},
> +	{0x9, 64},
> +	{0xa, 128},
> +	{0xb, 1},
> +};

...

> +static void apic_guest_code(void)
> +{
> +	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
> +	const uint32_t tmict = ~0u;
> +	uint64_t tsc0, tsc1, freq;
> +	uint32_t tmcct;
> +	int i;
> +
> +	apic_enable();
> +
> +	/*
> +	 * Setup one-shot timer.  The vector does not matter because the
> +	 * interrupt should not fire.
> +	 */
> +	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
> +
> +	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
> +
> +		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
> +		apic_write_reg(APIC_TMICT, tmict);
> +
> +		tsc0 = rdtsc();
> +		udelay(delay_ms * 1000);
> +		tmcct = apic_read_reg(APIC_TMCCT);
> +		tsc1 = rdtsc();
> +
> +		/*
> +		 * Stop the timer _after_ reading the current, final count, as
> +		 * writing the initial counter also modifies the current count.
> +		 */
> +		apic_write_reg(APIC_TMICT, 0);
> +
> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> +		/* Check if measured frequency is within 1% of configured frequency. */
> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> +	}

This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
bug.  It looks like something APICv related is clobbering internal VMX timer state?
Or maybe there's a tearing or truncation issue?

As mentioned ad nauseum at this point, I'm offline all of next week, so hopefully
there's enough info here to get a root cause...


A spurious VM-Exit will occur after programming a vmcs.PREEMPTION_TIMER_VALUE that
shouldn't exit.  Every observed failure occurs when bits 27:16 are zero, with a
small value in bits 15:0, e.g. VM-Enter with a timer value of 0xe0003bf7 or
0xa0006db6 will cause a near-immediate VM-Exit.

Weirdly, it doesn't always happen, e.g. I've observed rollover from 0xa000xxxx
to 0x9fffxxxx without failure.  However, the *test* failure is 100% reproducible,
i.e. it's not _that_ rare of a failure.  So maybe there's state build-up required?
E.g. in the failing cases, there are 10s of entries with a slightly larger timer
value, with no preemption timer exit (the host's tick IRQ interrupts the guest,
and then KVM reprograms the VMX timer).

Even more sketchy, the failure only occurs if APICv is enabled.  Turning off APICv
makes the problem go away (I initially suspected KVM was somehow botching the TMCCT
emulation).  I am 99.9% certain there is no KVM APICv bug that is causing problems.
Our IVB servers don't fail (even with APICv enabled), nor does my Raptor Lake client
box (with APICv enabled).  And I confirmed that the VMX timer is still getting
programmed with the same sequence that fails when APICv is enabled.

(Before I realized the pattern of values), I sanity checked the VMCS field just
before VM-Enter, and again after VM-Exit (KVM runs without the CPU save
vmcs.PREEMPTION_TIMER_VALUE on exit).

I also verified the CPU thinks the timer has expired by enabling "save on exit"
and verifying vmcs.PREEMPTION_TIMER_VALUE is indeed '0', and that KVM really did
get a PREEMPTION_TIMER exit.

Attached is the debug patch I used to get the below data (sort of; one of the
post-exit prints is without saving vmcs.PREEMPTION_TIMER_VALUE on exit).

In kvm_hypercall (ignore the name, I piggybacked a tracepoint because trying to
log to dmesg was too slow, guest literally couldn't make forward progress), a1 is
the VMX timer value programmed by KVM (0xe0003bf7).
 
 apic_bus_clock_-11419   [056] d..2.   146.771179: kvm_hypercall: nr 0x2c8a9e9cc6703c a0 0x2c8b0e9ce46c37 a1 0xe0003bf7 a2 0xe0003bf7 a3 0x7  
 apic_bus_clock_-11419   [056] d..2.   146.771242: kvm_exit: vcpu 0 reason PREEMPTION_TIMER rip 0x402065 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 VMX timer exit, VMCS = e0003bf7, delta = e0003690

and the post-exit print with the attached patch:

  kvm_intel: VMX timer exit, EXIT_REASON = 34, VMCS = 0, delta = e00037e9

And I've been fiddling with the below hack to coerce KVM into programming VMX
timer values.  Had I more time, I would have booted kernels with the ability to
properly fuzz the timer values.

Note, with TMICT=-1, only divide_count=1 will fail, because larger divide counts
result in a timeout that doesn't fit into the 32-bit VMX timer field (don't ask
me how long it took me to realize the divide count affects the time frequency,
not the actual count, *sigh*).

diff --git a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
index f8916bb34405..4eb49e20ff9c 100644
--- a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
+++ b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
@@ -22,13 +22,13 @@ static const struct {
        const uint32_t tdcr;
        const uint32_t divide_count;
 } tdcrs[] = {
-       {0x0, 2},
-       {0x1, 4},
-       {0x2, 8},
-       {0x3, 16},
-       {0x8, 32},
-       {0x9, 64},
-       {0xa, 128},
+       // {0x0, 2},
+       // {0x1, 4},
+       // {0x2, 8},
+       // {0x3, 16},
+       // {0x8, 32},
+       // {0x9, 64},
+       // {0xa, 128},
        {0xb, 1},
 };
 
@@ -55,10 +55,11 @@ static void apic_write_reg(unsigned int reg, uint32_t val)
                xapic_write_reg(reg, val);
 }
 
+uint32_t tmict = ~0u;
+
 static void apic_guest_code(uint64_t apic_hz, uint64_t delay_ms)
 {
        uint64_t tsc_hz = guest_tsc_khz * 1000;
-       const uint32_t tmict = ~0u;
        uint64_t tsc0, tsc1, freq;
        uint32_t tmcct;
        int i;
@@ -133,6 +134,7 @@ static void run_apic_bus_clock_test(uint64_t apic_hz, uint64_t delay_ms,
        vm = vm_create(1);
 
        sync_global_to_guest(vm, is_x2apic);
+       sync_global_to_guest(vm, tmict);
 
        vm_enable_cap(vm, KVM_CAP_X86_APIC_BUS_CYCLES_NS,
                      NSEC_PER_SEC / apic_hz);
@@ -174,7 +176,7 @@ int main(int argc, char *argv[])
 
        TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_APIC_BUS_CYCLES_NS));
 
-       while ((opt = getopt(argc, argv, "d:f:h")) != -1) {
+       while ((opt = getopt(argc, argv, "d:f:i:h")) != -1) {
                switch (opt) {
                case 'f':
                        apic_hz = atoi_positive("APIC bus frequency", optarg) * 1000 * 1000;
@@ -182,6 +184,9 @@ int main(int argc, char *argv[])
                case 'd':
                        delay_ms = atoi_positive("Delay in milliseconds", optarg);
                        break;
+               case 'i':
+                       tmict = ~0u - atoi_positive("offset", optarg);
+                       break;
                case 'h':
                default:
                        help(argv[0]);
Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
Posted by Reinette Chatre 1 year, 5 months ago
Hi Sean,

On 6/28/24 5:39 PM, Sean Christopherson wrote:
> Forking this off to try and avoid confusion...
> 
> On Wed, Jun 12, 2024, Reinette Chatre wrote:

...

>> +
>> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
>> +		/* Check if measured frequency is within 1% of configured frequency. */
>> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
>> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
>> +	}
> 
> This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> bug.  It looks like something APICv related is clobbering internal VMX timer state?
> Or maybe there's a tearing or truncation issue?

It has been a few days. Just a note to let you know that we are investigating this.
On my side I have not yet been able to reproduce this issue. I tested
kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
apic_bus_clock_test and they all passed. Since I have lack of experience here there are
some Intel virtualization experts helping out with this investigation and I hope that
they will be some insights from the analysis and testing that you already provided.

Reinette
Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
Posted by Yuan Yao 1 year, 5 months ago
On Wed, Jul 03, 2024 at 01:14:09PM -0700, Reinette Chatre wrote:
> Hi Sean,
>
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
> > Forking this off to try and avoid confusion...
> >
> > On Wed, Jun 12, 2024, Reinette Chatre wrote:
>
> ...
>
> > > +
> > > +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> > > +		/* Check if measured frequency is within 1% of configured frequency. */
> > > +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> > > +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> > > +	}
> >
> > This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> > bug.  It looks like something APICv related is clobbering internal VMX timer state?
> > Or maybe there's a tearing or truncation issue?
>
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.
>
> Reinette
>

I can reproduce this on my side, even with apicv disabled by
'insmod $kernel_path/arch/x86/kvm/kvm-intel.ko enable_apicv=N'.
@Sean I think we're observing same issue, please see the details
below:

apic_bus_clock_test can't reproduce this may because the
preemption timer value calculation relies on apic bus
frequency/TMICT/Divide count/host TSC frequency and ratio
between preemption timer and host TSC frequency, too many
factors to generate the 'magic' value there. So I changed
KVM and added a small KVM kselftest tool to set the
preemption timer value directly from guest, this makes the
reproducing easily. The changes are attached at end of this
comment.

The trace I captured below came form host with 1.7GHz TSC,
the VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is enabled to get the
cpu saved vmcs.VMX_PREEMPTION_TIMER_VALUE after VMEXIT. I
set the vmcs.VMX_PREEMPTION_TIMER_VALUE to 0x880042ad which
is the 'magic' number on this 1.7Ghz TSC machine:

preempt_test  20677.199521: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5554 a1:0x880042ad a2:0x880042ad a3:0x20462e98d9b9
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The value read back from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, for double confirmation.
  a3: The host tsc at the time point, debug only.

preempt_test  20677.199579:      kvm:kvm_exit: reason PREEMPTION_TIMER rip 0x40274d info 0 0 intr 0

preempt_test  20677.199579: kvm:kvm_vmx_debug: kvm_vmx_debug: 2, a0:0x34 a1:0x0 a2:0x87fea9b0 a3:0x20462e9a5749
  a0: The VMEXIT reason, 0x34 is preemption timer VMEXIT.
  a1: The read back vmcs.VMX_PREEMPTION_TIMER_VALUE value, here 0.
  a2: The next preemption timer value should be written to
      vmcs, calculates from the (target tsc - current tsc) >>
      7. Now the preemption timer vmexit happend only after
      ~58 microseconds elapsed, it should happen after
      ~171.79 seconds but not such soon, the issue is
      reproduced.

Another more easy way to observe this symptom w/o care the
'magic' preemption timer vlaue is use the maximum preemption
timer value 0xffffffff, below log w/ 0xffffffff is captured
from same machine:

preempt_test 20530.456589: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5551 a1:0xffffffff a2:0xffffffff a3:0x200c1971ca5d

  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The read back value from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, double confirmation.
  a3: the host tsc at the time point, debug only.

preempt_test 20530.456690:      kvm:kvm_exit: reason VMCALL rip 0x4131a0 info 0 0 intr 0
  The preempt_test checks preemption timer state every
  100us, this VMEXIT is expected behavior.

preempt_test 20530.456691:     kvm:kvm_entry: vcpu 0, rip 0x4131a3

preempt_test 20530.456691: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77ff82cc a1:0xfffe900b a2:0xfffe900b a3:0x200c19746d45
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT. The difference value
      shouldn't be such huge number 0xffffffff - 0x77ff82cc
      = 0x88007D33 when just ~100us elapsed from previous
      VMENTRY, the issue is reproduced.

Use 0x88000000 as preemption timer value to verify this
preempt_test tool, the preemption timer VMEXIT happend after
~171.45 seconds which is expected behavior on the host
1.7Ghz TSC system:

The preemption timer VMEXIT should happen after:
2281701376 × 128 / 1700000000 = 171.79 seconds.

Attached my changes in KVM and kselftest tool for
reproducing here, based on:
https://github.com/kvm-x86/linux.git
tag:kvm-x86-next-2024.06.28

Patch 01:

From a977bf12a8cd1bbe401e68d3702c0b5aa3bf66e4 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@intel.com>
Date: Thu, 4 Jul 2024 09:59:55 +0800
Subject: [PATCH 1/2] KVM: x86: Introudce trace_kvm_vmx_debug()

debug only common trace.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
---
 arch/x86/kvm/trace.h | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c   |  1 +
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index d3aeffd6ae75..7b9eb23d71d3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -34,6 +34,34 @@ TRACE_EVENT(kvm_entry,
 		  __entry->immediate_exit ? "[immediate exit]" : "")
 );

+TRACE_EVENT(kvm_vmx_debug,
+	    TP_PROTO(unsigned long n, unsigned long a0,
+		     unsigned long a1,
+		     unsigned long a2,
+		     unsigned long a3),
+	    TP_ARGS(n, a0, a1, a2, a3),
+
+	TP_STRUCT__entry(
+		 __field(	unsigned long,	n		)
+		__field(	unsigned long,	a0		)
+		__field(	unsigned long,	a1		)
+		__field(	unsigned long,	a2		)
+		__field(	unsigned long,	a3		)
+	),
+
+	TP_fast_assign(
+		__entry->n      = n;
+		__entry->a0		= a0;
+		__entry->a1		= a1;
+		__entry->a2		= a2;
+		__entry->a3		= a3;
+	),
+
+	TP_printk("kvm_vmx_debug: %ld, a0:0x%lx a1:0x%lx a2:0x%lx a3:0x%lx",
+		  __entry->n, __entry->a0, __entry->a1, __entry->a2, __entry->a3)
+);
+
+
 /*
  * Tracepoint for hypercall.
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..6d1972d6c988 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14036,6 +14036,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmx_debug);

 static int __init kvm_x86_init(void)
 {
--
2.27.0

Patch 02:

From a3bdce4f2f93810372f6068396776ac702946d16 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@intel.com>
Date: Wed, 3 Jul 2024 14:08:02 +0800
Subject: [PATCH 2/2] [DEBUG] preempt timer debug test

A specific kselftesting based program to allow set the VMX
preempt timer value from VM directly.

Introduce 2 hypercall 0x56780001/2, 01 to set the preempt
timer value, 02 to wait for the preemption time expired.

Usage:
Reload kvm applied this change, then:
$KRNEL_SRC_ROOT/tools/testing/selftests/kvm/x86_64/preempt_test -p 'preempt_timer_vale'

'preempt_timer_vale' is the preempt timer value in DEC format, HEX is not supported.

For example:

perf record -e "kvm:*" tools/testing/selftests/kvm/x86_64/preempt_test -p 2281718445

Above set the preempt value to 2281718445(0x880042AD) and
capture the trace, then check the kvm_vmx_debug in the trace
to know the preempt timer behavior.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 arch/x86/kvm/vmx/vmx.h                        |   5 +
 arch/x86/kvm/vmx/vmx.c                        | 113 +++++++++++++++++-
 .../selftests/kvm/x86_64/preempt_test.c       |  82 +++++++++++++
 4 files changed, 198 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/preempt_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ad8b5d15f2bd..957509957f80 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -129,6 +129,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/amx_test
 TEST_GEN_PROGS_x86_64 += x86_64/max_vcpuid_cap_test
 TEST_GEN_PROGS_x86_64 += x86_64/triple_fault_event_test
 TEST_GEN_PROGS_x86_64 += x86_64/recalc_apic_map_test
+TEST_GEN_PROGS_x86_64 += x86_64/preempt_test
 TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..82ea0ccc7a63 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -368,6 +368,11 @@ struct vcpu_vmx {

 	/* ve_info must be page aligned. */
 	struct vmx_ve_information *ve_info;
+
+	volatile bool debug_timer;
+	bool debug_timer_set_to_hardware;
+	u32 debug_timer_val;
+	u64 debug_timer_deadline_tsc;
 };

 struct kvm_vmx {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..73f084c29f9a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4431,8 +4431,9 @@ static u32 vmx_vmexit_ctrl(void)
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
-	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
-			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER);
+	pr_info("Set VM_EXIT_SAVE_VMX_PREEMPTION_TIMER forcedly for preempt timer debug\n");
+

 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
@@ -5993,11 +5994,41 @@ static int handle_pml_full(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static fastpath_t handle_fastpath_debug_timer(struct kvm_vcpu *vcpu,
+					      bool force_immediate_exit)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u64 tscl;
+	u32 delta;
+
+	tscl = rdtsc();
+
+	if (vmx->debug_timer_deadline_tsc > tscl)
+		delta = (u32)((vmx->debug_timer_deadline_tsc - tscl) >>
+			      cpu_preemption_timer_multi);
+	else
+		delta = 0;
+
+	trace_kvm_vmx_debug(2UL,
+			    (unsigned long)vmcs_read32(VM_EXIT_REASON),
+			    (unsigned long)vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+			    (unsigned long)delta, tscl);
+
+	vmx->debug_timer = false;
+
+	return EXIT_FASTPATH_REENTER_GUEST;
+}
+
 static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 						   bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);

+	WARN_ON(vmx->debug_timer && force_immediate_exit);
+	if (vmx->debug_timer)
+		return handle_fastpath_debug_timer(vcpu,
+						   force_immediate_exit);
+
 	/*
 	 * In the *extremely* unlikely scenario that this is a spurious VM-Exit
 	 * due to the timer expiring while it was "soft" disabled, just eat the
@@ -6096,6 +6127,60 @@ static int handle_notify(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static unsigned long vmx_debug_set_preempt_timer(struct kvm_vcpu *vcpu,
+						 unsigned long a0,
+						 unsigned long a1,
+						 unsigned long a2,
+						 unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->debug_timer = true;
+	vmx->debug_timer_set_to_hardware = false;
+	vmx->debug_timer_val = a0;
+	vmx->debug_timer_deadline_tsc = rdtsc() + (a0 << cpu_preemption_timer_multi);
+	pr_info("debug_timer = %u\n", (u32)a0);
+
+	return 0;
+}
+
+
+static unsigned long vmx_debug_get_preempt_timer_result(struct kvm_vcpu *vcpu,
+							unsigned long a0,
+							unsigned long a1,
+							unsigned long a2,
+							unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (vmx->debug_timer)
+		return 1;
+	return 0;
+}
+
+static int vmx_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3;
+	unsigned long ret;
+
+	nr = kvm_rax_read(vcpu);
+	if (nr != 0x87650001 && nr != 0x87650002)
+		return kvm_emulate_hypercall(vcpu);
+
+	a0 = kvm_rbx_read(vcpu);
+	a1 = kvm_rcx_read(vcpu);
+	a2 = kvm_rdx_read(vcpu);
+	a3 = kvm_rsi_read(vcpu);
+
+	if (nr == 0x87650001)
+		ret = vmx_debug_set_preempt_timer(vcpu, a0, a1, a2, a3);
+	else
+		ret = vmx_debug_get_preempt_timer_result(vcpu, a0, a1, a2, a3);
+
+	kvm_rax_write(vcpu, ret);
+	return kvm_skip_emulated_instruction(vcpu);
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -6117,7 +6202,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_INVD]		      = kvm_emulate_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_RDPMC]                   = kvm_emulate_rdpmc,
-	[EXIT_REASON_VMCALL]                  = kvm_emulate_hypercall,
+	[EXIT_REASON_VMCALL]                  = vmx_emulate_hypercall,
 	[EXIT_REASON_VMCLEAR]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMLAUNCH]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMPTRLD]		      = handle_vmx_instruction,
@@ -7199,6 +7284,28 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 	if (force_immediate_exit) {
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
+	} else if (vmx->debug_timer) {
+		u32 old;
+
+		tscl = rdtsc();
+
+		if (!vmx->debug_timer_set_to_hardware) {
+			delta_tsc = vmx->debug_timer_val;
+			vmx->debug_timer_set_to_hardware = true;
+		} else {
+			if (vmx->debug_timer_deadline_tsc > tscl)
+				delta_tsc = (u32)((vmx->debug_timer_deadline_tsc - tscl)
+						  >> cpu_preemption_timer_multi);
+			else
+				delta_tsc = 0;
+		}
+
+		old = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
+		trace_kvm_vmx_debug(3UL, old,
+				    vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+				    delta_tsc, tscl);
+		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (vmx->hv_deadline_tsc != -1) {
 		tscl = rdtsc();
 		if (vmx->hv_deadline_tsc > tscl)
diff --git a/tools/testing/selftests/kvm/x86_64/preempt_test.c b/tools/testing/selftests/kvm/x86_64/preempt_test.c
new file mode 100644
index 000000000000..2e58cfee61d0
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/preempt_test.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024 Intel Corporation
+ *
+ * Debug the preemption timer behavior
+ */
+
+#include "test_util.h"
+#include "processor.h"
+#include "ucall_common.h"
+
+uint32_t preempt_timer_val = 0x1000000;
+static void guest_code(uint64_t apic_hz, uint64_t delay_ms)
+{
+	volatile unsigned long r;
+
+	kvm_hypercall(0x87650001, preempt_timer_val, 0, 0, 0);
+	do {
+		udelay(100);
+		r = kvm_hypercall(0x87650002, 0, 0, 0, 0);
+	} while(r != 0);
+
+	GUEST_DONE();
+}
+
+static void do_test(struct kvm_vcpu *vcpu)
+{
+	bool done = false;
+	struct ucall uc;
+
+	while (!done) {
+		vcpu_run(vcpu);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_DONE:
+			done = true;
+			break;
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		default:
+			break;
+		}
+	}
+}
+
+static void run_test(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	vm = vm_create(1);
+
+	sync_global_to_guest(vm, preempt_timer_val);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code);
+
+	do_test(vcpu);
+
+	kvm_vm_free(vm);
+}
+
+
+int main(int argc, char *argv[])
+{
+	int opt;
+
+	while ((opt = getopt(argc, argv, "p:h")) != -1) {
+		switch (opt) {
+		case 'p':
+			preempt_timer_val = atoi(optarg);
+			break;
+		default:
+			exit(KSFT_SKIP);
+		}
+	}
+
+	printf("preempt timer value:%u(0x%x)\n",
+	       preempt_timer_val, preempt_timer_val);
+
+	run_test();
+}
--
2.27.0
Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
Posted by Reinette Chatre 1 year, 5 months ago
(a short update ...)

On 7/3/24 1:14 PM, Reinette Chatre wrote:
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
>> Forking this off to try and avoid confusion...
>>
>> On Wed, Jun 12, 2024, Reinette Chatre wrote:
> 
> ...
> 
>>> +
>>> +        freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
>>> +        /* Check if measured frequency is within 1% of configured frequency. */
>>> +        GUEST_ASSERT(freq < apic_hz * 101 / 100);
>>> +        GUEST_ASSERT(freq > apic_hz * 99 / 100);
>>> +    }
>>
>> This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
>> bug.  It looks like something APICv related is clobbering internal VMX timer state?
>> Or maybe there's a tearing or truncation issue?
> 
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.

I have now been able to test on SKX also and I am not yet able to reproduce. For
reference, the systems I tested on are:
SKX: https://ark.intel.com/content/www/us/en/ark/products/120507/intel-xeon-platinum-8170m-processor-35-75m-cache-2-10-ghz.html
ICX: https://ark.intel.com/content/www/us/en/ark/products/212459/intel-xeon-platinum-8360y-processor-54m-cache-2-40-ghz.html
CLX: https://ark.intel.com/content/www/us/en/ark/products/192476/intel-xeon-platinum-8260l-processor-35-75m-cache-2-40-ghz.html

Reinette