[PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes

Sebastian Ott posted 3 patches 6 months, 3 weeks ago
There is a newer version of this series
.../kvm/arm64/arch_timer_edge_cases.c         | 37 ++++++++++++-------
1 file changed, 24 insertions(+), 13 deletions(-)
[PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
Posted by Sebastian Ott 6 months, 3 weeks ago
Some small fixes for arch_timer_edge_cases that I stumbled upon
while debugging failures for this selftest on ampere-one.

Changes since v1: modified patch 3 based on suggestions from Marc.

I've done some tests with this on various machines - seems to be all
good, however on ampere-one I now hit this in 10% of the runs:
==== Test Assertion Failure ====
  arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
  pid=166657 tid=166657 errno=4 - Interrupted system call
     1  0x0000000000404db3: test_run at arch_timer_edge_cases.c:933
     2  0x0000000000401f9f: main at arch_timer_edge_cases.c:1062
     3  0x0000ffffaedd625b: ?? ??:0
     4  0x0000ffffaedd633b: ?? ??:0
     5  0x00000000004020af: _start at ??:?
  timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)

This is not new, it was just hidden behind the other failure. I'll
try to figure out what this is about (seems to be independent of
the wait time)..


Sebastian Ott (3):
  KVM: arm64: selftests: fix help text for arch_timer_edge_cases
  KVM: arm64: selftests: fix thread migration in arch_timer_edge_cases
  KVM: arm64: selftests: arch_timer_edge_cases - determine effective counter width

 .../kvm/arm64/arch_timer_edge_cases.c         | 37 ++++++++++++-------
 1 file changed, 24 insertions(+), 13 deletions(-)


base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
-- 
2.49.0
Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
Posted by Zenghui Yu 6 months, 2 weeks ago
Hi Sebastian,

On 2025/5/27 22:24, Sebastian Ott wrote:
> Some small fixes for arch_timer_edge_cases that I stumbled upon
> while debugging failures for this selftest on ampere-one.
> 
> Changes since v1: modified patch 3 based on suggestions from Marc.
> 
> I've done some tests with this on various machines - seems to be all
> good, however on ampere-one I now hit this in 10% of the runs:
> ==== Test Assertion Failure ====
>   arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
>   pid=166657 tid=166657 errno=4 - Interrupted system call
>      1  0x0000000000404db3: test_run at arch_timer_edge_cases.c:933
>      2  0x0000000000401f9f: main at arch_timer_edge_cases.c:1062
>      3  0x0000ffffaedd625b: ?? ??:0
>      4  0x0000ffffaedd633b: ?? ??:0
>      5  0x00000000004020af: _start at ??:?
>   timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
> 
> This is not new, it was just hidden behind the other failure. I'll
> try to figure out what this is about (seems to be independent of
> the wait time)..

Not sure if you have figured it out. I can easily reproduce it on my box
and I *guess* it is that we have some random XVAL values when we enable
the timer..

test_reprogramming_timer()
{
	local_irq_disable();
	reset_timer_state(timer, DEF_CNT);

	/* Program the timer to DEF_CNT + delta_1_ms. */
	set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);

	[...]
}

set_tval_irq()
{
	timer_set_ctl(timer, ctl);

	// There is a window that we enable the timer with *random* XVAL
	// values and we may get the unexpected interrupt.. And it's
	// unlikely that KVM can be aware of TVAL's change (and
	// re-evaluate the interrupt's pending state) before hitting the
	// GUEST_ASSERT().

	timer_set_tval(timer, tval_cycles);
}

I'm not familiar with the test so I'm not 100% sure that this is the
root cause. But I hope this helps with your analysis ;-) .

Thanks,
Zenghui
Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
Posted by Sebastian Ott 6 months, 2 weeks ago
Hi Zenghui,

On Tue, 3 Jun 2025, Zenghui Yu wrote:
> On 2025/5/27 22:24, Sebastian Ott wrote:
>> Some small fixes for arch_timer_edge_cases that I stumbled upon
>> while debugging failures for this selftest on ampere-one.
>>
>> Changes since v1: modified patch 3 based on suggestions from Marc.
>>
>> I've done some tests with this on various machines - seems to be all
>> good, however on ampere-one I now hit this in 10% of the runs:
>> ==== Test Assertion Failure ====
>>   arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
>>   pid=166657 tid=166657 errno=4 - Interrupted system call
>>      1  0x0000000000404db3: test_run at arch_timer_edge_cases.c:933
>>      2  0x0000000000401f9f: main at arch_timer_edge_cases.c:1062
>>      3  0x0000ffffaedd625b: ?? ??:0
>>      4  0x0000ffffaedd633b: ?? ??:0
>>      5  0x00000000004020af: _start at ??:?
>>   timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
>>
>> This is not new, it was just hidden behind the other failure. I'll
>> try to figure out what this is about (seems to be independent of
>> the wait time)..
>
> Not sure if you have figured it out. I can easily reproduce it on my box
> and I *guess* it is that we have some random XVAL values when we enable
> the timer..

Yes, I think so, too.

> test_reprogramming_timer()
> {
> 	local_irq_disable();
> 	reset_timer_state(timer, DEF_CNT);

My first attempt was to also initialize cval here

>
> 	/* Program the timer to DEF_CNT + delta_1_ms. */
> 	set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);
>
> 	[...]
> }
>
> set_tval_irq()
> {
> 	timer_set_ctl(timer, ctl);
>
> 	// There is a window that we enable the timer with *random* XVAL
> 	// values and we may get the unexpected interrupt.. And it's
> 	// unlikely that KVM can be aware of TVAL's change (and
> 	// re-evaluate the interrupt's pending state) before hitting the
> 	// GUEST_ASSERT().
>
> 	timer_set_tval(timer, tval_cycles);

Yes, I stumbled over this as well. I've always assumed that this order is
becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the 
value returned is UNKNOWN." However re-reading that part today I realized
that this only concerns register reads.

Maybe somone on cc knows why it's in that order?

I'm currently testing this with the above swapped and it's looking good,
so far.

> }
>
> I'm not familiar with the test so I'm not 100% sure that this is the
> root cause. But I hope this helps with your analysis ;-) .

It did, thanks!

Sebastian
Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
Posted by Marc Zyngier 6 months, 2 weeks ago
On Wed, 04 Jun 2025 21:58:48 +0100,
Sebastian Ott <sebott@redhat.com> wrote:
> 
> Hi Zenghui,
> 
> On Tue, 3 Jun 2025, Zenghui Yu wrote:
> > On 2025/5/27 22:24, Sebastian Ott wrote:
> >> Some small fixes for arch_timer_edge_cases that I stumbled upon
> >> while debugging failures for this selftest on ampere-one.
> >> 
> >> Changes since v1: modified patch 3 based on suggestions from Marc.
> >> 
> >> I've done some tests with this on various machines - seems to be all
> >> good, however on ampere-one I now hit this in 10% of the runs:
> >> ==== Test Assertion Failure ====
> >>   arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
> >>   pid=166657 tid=166657 errno=4 - Interrupted system call
> >>      1  0x0000000000404db3: test_run at arch_timer_edge_cases.c:933
> >>      2  0x0000000000401f9f: main at arch_timer_edge_cases.c:1062
> >>      3  0x0000ffffaedd625b: ?? ??:0
> >>      4  0x0000ffffaedd633b: ?? ??:0
> >>      5  0x00000000004020af: _start at ??:?
> >>   timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
> >> 
> >> This is not new, it was just hidden behind the other failure. I'll
> >> try to figure out what this is about (seems to be independent of
> >> the wait time)..
> > 
> > Not sure if you have figured it out. I can easily reproduce it on my box
> > and I *guess* it is that we have some random XVAL values when we enable
> > the timer..
> 
> Yes, I think so, too.
> 
> > test_reprogramming_timer()
> > {
> > 	local_irq_disable();
> > 	reset_timer_state(timer, DEF_CNT);
> 
> My first attempt was to also initialize cval here

Note that CVAL and TVAL are two views of the same thing. It should be
enough to initialise one of them (though TVAL is stupidly narrow).

> 
> > 
> > 	/* Program the timer to DEF_CNT + delta_1_ms. */
> > 	set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);
> > 
> > 	[...]
> > }
> > 
> > set_tval_irq()
> > {
> > 	timer_set_ctl(timer, ctl);
> > 
> > 	// There is a window that we enable the timer with *random* XVAL
> > 	// values and we may get the unexpected interrupt.. And it's
> > 	// unlikely that KVM can be aware of TVAL's change (and
> > 	// re-evaluate the interrupt's pending state) before hitting the
> > 	// GUEST_ASSERT().
> > 
> > 	timer_set_tval(timer, tval_cycles);
> 
> Yes, I stumbled over this as well. I've always assumed that this order is
> becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0,
> the value returned is UNKNOWN." However re-reading that part today I
> realized
> that this only concerns register reads.
> 
> Maybe somone on cc knows why it's in that order?

I can't think of a valid reason. Enabling the timer without having set
the deadline first is just silly.

> I'm currently testing this with the above swapped and it's looking good,
> so far.

The swapped version (set xVAL, then enable the timer) is the only one
that makes any sense.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.
Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
Posted by Sebastian Ott 6 months, 2 weeks ago
On Wed, 4 Jun 2025, Sebastian Ott wrote:
> On Tue, 3 Jun 2025, Zenghui Yu wrote:
>>  On 2025/5/27 22:24, Sebastian Ott wrote:
>>>  Some small fixes for arch_timer_edge_cases that I stumbled upon
>>>  while debugging failures for this selftest on ampere-one.
>>>
>>>  Changes since v1: modified patch 3 based on suggestions from Marc.
>>>
>>>  I've done some tests with this on various machines - seems to be all
>>>  good, however on ampere-one I now hit this in 10% of the runs:
>>>  ==== Test Assertion Failure ====
>>>    arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT +
>>>    (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
>>>    pid=166657 tid=166657 errno=4 - Interrupted system call
>>>       1  0x0000000000404db3: test_run at arch_timer_edge_cases.c:933
>>>       2  0x0000000000401f9f: main at arch_timer_edge_cases.c:1062
>>>       3  0x0000ffffaedd625b: ?? ??:0
>>>       4  0x0000ffffaedd633b: ?? ??:0
>>>       5  0x00000000004020af: _start at ??:?
>>>    timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
>>>
>>>  This is not new, it was just hidden behind the other failure. I'll
>>>  try to figure out what this is about (seems to be independent of
>>>  the wait time)..
>>
>>  Not sure if you have figured it out. I can easily reproduce it on my box
>>  and I *guess* it is that we have some random XVAL values when we enable
>>  the timer..
>
> Yes, I think so, too.
>
>>  test_reprogramming_timer()
>>  {
>>   local_irq_disable();
>>   reset_timer_state(timer, DEF_CNT);
>
> My first attempt was to also initialize cval here

Forgot to mention that I did this because my tests have shown
that the interrupt didn't only trigger early (like before the
reprogrammed delta) but instantly. This seemed to work but I think
the order in set_tval_irq() is the actual issue.


>
>>
>>   /* Program the timer to DEF_CNT + delta_1_ms. */
>>   set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);
>>
>>  	[...]
>>  }
>>
>>  set_tval_irq()
>>  {
>>   timer_set_ctl(timer, ctl);
>>
>>   // There is a window that we enable the timer with *random* XVAL
>>   // values and we may get the unexpected interrupt.. And it's
>>   // unlikely that KVM can be aware of TVAL's change (and
>>   // re-evaluate the interrupt's pending state) before hitting the
>>   // GUEST_ASSERT().
>>
>>   timer_set_tval(timer, tval_cycles);
>
> Yes, I stumbled over this as well. I've always assumed that this order is
> becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the 
> value returned is UNKNOWN." However re-reading that part today I realized
> that this only concerns register reads.
>
> Maybe somone on cc knows why it's in that order?
>
> I'm currently testing this with the above swapped and it's looking good,
> so far.
>
>>  }
>>
>>  I'm not familiar with the test so I'm not 100% sure that this is the
>>  root cause. But I hope this helps with your analysis ;-) .
>
> It did, thanks!
>
> Sebastian
>