x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

[PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
exposed a problem with the marking of the respective vector as
pending: For quite some time Linux has been checking whether any stale
ISR or IRR bits would still be set while preparing the LAPIC for use.
This check is now triggering on the upcall vector, as the registration,
at least for APs, happens before the LAPIC is actually enabled.

In software-disabled state an LAPIC would not accept any interrupt
requests and hence no IRR bit would newly become set while in this
state. As a result it is also wrong for us to mark the upcall vector as
having a pending request when the vLAPIC is in this state.

To compensate for the "enabled" check added to the assertion logic, add
logic to (conditionally) mark the upcall vector as having a request
pending at the time the LAPIC is being software-enabled by the guest.

Fixes: 7b5b8ca7dffd ("x86/upcall: inject a spurious event after setting upcall vector")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Don't one or both of the Viridian uses of vlapic_set_irq() need similar
guarding?

Is it actually necessary for hvmop_set_evtchn_upcall_vector() and
hvm_set_callback_via() to call hvm_assert_evtchn_irq() when
evtchn_upcall_pending is false?

--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -321,9 +321,10 @@ void hvm_assert_evtchn_irq(struct vcpu *
 
     if ( v->arch.hvm.evtchn_upcall_vector != 0 )
     {
-        uint8_t vector = v->arch.hvm.evtchn_upcall_vector;
+        struct vlapic *vlapic = vcpu_vlapic(v);
 
-        vlapic_set_irq(vcpu_vlapic(v), vector, 0);
+        if ( vlapic_enabled(vlapic) )
+           vlapic_set_irq(vlapic, v->arch.hvm.evtchn_upcall_vector, 0);
     }
     else if ( is_hvm_pv_evtchn_domain(v->domain) )
         vcpu_kick(v);
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -829,6 +829,9 @@ void vlapic_reg_write(struct vcpu *v, un
         {
             vlapic->hw.disabled &= ~VLAPIC_SW_DISABLED;
             pt_may_unmask_irq(vlapic_domain(vlapic), &vlapic->pt);
+            if ( v->arch.hvm.evtchn_upcall_vector &&
+                 vcpu_info(v, evtchn_upcall_pending) )
+                vlapic_set_irq(vlapic, v->arch.hvm.evtchn_upcall_vector, 0);
         }
         break;

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Fri, Nov 18, 2022 at 11:31:28AM +0100, Jan Beulich wrote:
> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> exposed a problem with the marking of the respective vector as
> pending: For quite some time Linux has been checking whether any stale
> ISR or IRR bits would still be set while preparing the LAPIC for use.
> This check is now triggering on the upcall vector, as the registration,
> at least for APs, happens before the LAPIC is actually enabled.
> 
> In software-disabled state an LAPIC would not accept any interrupt
> requests and hence no IRR bit would newly become set while in this
> state. As a result it is also wrong for us to mark the upcall vector as
> having a pending request when the vLAPIC is in this state.
> 
> To compensate for the "enabled" check added to the assertion logic, add
> logic to (conditionally) mark the upcall vector as having a request
> pending at the time the LAPIC is being software-enabled by the guest.
> 
> Fixes: 7b5b8ca7dffd ("x86/upcall: inject a spurious event after setting upcall vector")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Don't one or both of the Viridian uses of vlapic_set_irq() need similar
> guarding?
> 
> Is it actually necessary for hvmop_set_evtchn_upcall_vector() and
> hvm_set_callback_via() to call hvm_assert_evtchn_irq() when
> evtchn_upcall_pending is false?
> 
> --- a/xen/arch/x86/hvm/irq.c
> +++ b/xen/arch/x86/hvm/irq.c
> @@ -321,9 +321,10 @@ void hvm_assert_evtchn_irq(struct vcpu *
>  
>      if ( v->arch.hvm.evtchn_upcall_vector != 0 )
>      {
> -        uint8_t vector = v->arch.hvm.evtchn_upcall_vector;
> +        struct vlapic *vlapic = vcpu_vlapic(v);
>  
> -        vlapic_set_irq(vcpu_vlapic(v), vector, 0);
> +        if ( vlapic_enabled(vlapic) )
> +           vlapic_set_irq(vlapic, v->arch.hvm.evtchn_upcall_vector, 0);

Shouldn't the vlapic_enabled() check itself be in vlapic_set_irq()? We
certainly don't want any vectors set until the vlapic is enabled, be
it event channel upcalls or any other sources.

Maybe best to add an ASSERT in vlapic_set_irq() to be sure the lapic is
enabled, as other callers already check this before trying to inject?

Also, and not strictly related to your change, isn't this possibly
racy, as by the time you evaluate the return of vlapic_enabled() it is
already stale, as there's no lock to protect it from changing?

Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 18.11.2022 15:26, Roger Pau Monné wrote:
> On Fri, Nov 18, 2022 at 11:31:28AM +0100, Jan Beulich wrote:
>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>> exposed a problem with the marking of the respective vector as
>> pending: For quite some time Linux has been checking whether any stale
>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>> This check is now triggering on the upcall vector, as the registration,
>> at least for APs, happens before the LAPIC is actually enabled.
>>
>> In software-disabled state an LAPIC would not accept any interrupt
>> requests and hence no IRR bit would newly become set while in this
>> state. As a result it is also wrong for us to mark the upcall vector as
>> having a pending request when the vLAPIC is in this state.
>>
>> To compensate for the "enabled" check added to the assertion logic, add
>> logic to (conditionally) mark the upcall vector as having a request
>> pending at the time the LAPIC is being software-enabled by the guest.
>>
>> Fixes: 7b5b8ca7dffd ("x86/upcall: inject a spurious event after setting upcall vector")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Don't one or both of the Viridian uses of vlapic_set_irq() need similar
>> guarding?
>>
>> Is it actually necessary for hvmop_set_evtchn_upcall_vector() and
>> hvm_set_callback_via() to call hvm_assert_evtchn_irq() when
>> evtchn_upcall_pending is false?
>>
>> --- a/xen/arch/x86/hvm/irq.c
>> +++ b/xen/arch/x86/hvm/irq.c
>> @@ -321,9 +321,10 @@ void hvm_assert_evtchn_irq(struct vcpu *
>>  
>>      if ( v->arch.hvm.evtchn_upcall_vector != 0 )
>>      {
>> -        uint8_t vector = v->arch.hvm.evtchn_upcall_vector;
>> +        struct vlapic *vlapic = vcpu_vlapic(v);
>>  
>> -        vlapic_set_irq(vcpu_vlapic(v), vector, 0);
>> +        if ( vlapic_enabled(vlapic) )
>> +           vlapic_set_irq(vlapic, v->arch.hvm.evtchn_upcall_vector, 0);
> 
> Shouldn't the vlapic_enabled() check itself be in vlapic_set_irq()? We
> certainly don't want any vectors set until the vlapic is enabled, be
> it event channel upcalls or any other sources.

In principle yes, and I did consider doing so, but for several callers
(potentially used frequently) this would be redundant with other
checking they do already (first and foremost callers using
vlapic_lowest_prio()). However, looking again I think vioapic_deliver()
and vmsi_deliver() violate this as well in their dest_Fixed handling.
(In both cases I'm actually inclined to also remove the odd *_inj_irq()
helper functions.)

> Maybe best to add an ASSERT in vlapic_set_irq() to be sure the lapic is
> enabled, as other callers already check this before trying to inject?

Perhaps, yes (once we've fixed paths where the check is presently
missing).

> Also, and not strictly related to your change, isn't this possibly
> racy, as by the time you evaluate the return of vlapic_enabled() it is
> already stale, as there's no lock to protect it from changing?

Wouldn't this simply match a signal arriving to a physical LAPIC just
the moment before it is enabled?

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 21.11.2022 09:33, Jan Beulich wrote:
> On 18.11.2022 15:26, Roger Pau Monné wrote:
>> Maybe best to add an ASSERT in vlapic_set_irq() to be sure the lapic is
>> enabled, as other callers already check this before trying to inject?
> 
> Perhaps, yes (once we've fixed paths where the check is presently
> missing).

Actually - no, such an ASSERT() would then be racy against the vLAPIC
being disabled elsewhere at this very moment. It would at best be valid
when done on the vCPU in question. The SDM also provides for this: "The
reception of any interrupt or transmission of any IPIs that are in
progress when the local APIC is disabled are completed before the local
APIC enters the software-disabled state." We don't follow this to the
letter, but you get the point.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Mon, Nov 21, 2022 at 09:33:53AM +0100, Jan Beulich wrote:
> On 18.11.2022 15:26, Roger Pau Monné wrote:
> > On Fri, Nov 18, 2022 at 11:31:28AM +0100, Jan Beulich wrote:
> >> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> >> exposed a problem with the marking of the respective vector as
> >> pending: For quite some time Linux has been checking whether any stale
> >> ISR or IRR bits would still be set while preparing the LAPIC for use.
> >> This check is now triggering on the upcall vector, as the registration,
> >> at least for APs, happens before the LAPIC is actually enabled.
> >>
> >> In software-disabled state an LAPIC would not accept any interrupt
> >> requests and hence no IRR bit would newly become set while in this
> >> state. As a result it is also wrong for us to mark the upcall vector as
> >> having a pending request when the vLAPIC is in this state.
> >>
> >> To compensate for the "enabled" check added to the assertion logic, add
> >> logic to (conditionally) mark the upcall vector as having a request
> >> pending at the time the LAPIC is being software-enabled by the guest.
> >>
> >> Fixes: 7b5b8ca7dffd ("x86/upcall: inject a spurious event after setting upcall vector")
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> Don't one or both of the Viridian uses of vlapic_set_irq() need similar
> >> guarding?
> >>
> >> Is it actually necessary for hvmop_set_evtchn_upcall_vector() and
> >> hvm_set_callback_via() to call hvm_assert_evtchn_irq() when
> >> evtchn_upcall_pending is false?
> >>
> >> --- a/xen/arch/x86/hvm/irq.c
> >> +++ b/xen/arch/x86/hvm/irq.c
> >> @@ -321,9 +321,10 @@ void hvm_assert_evtchn_irq(struct vcpu *
> >>  
> >>      if ( v->arch.hvm.evtchn_upcall_vector != 0 )
> >>      {
> >> -        uint8_t vector = v->arch.hvm.evtchn_upcall_vector;
> >> +        struct vlapic *vlapic = vcpu_vlapic(v);
> >>  
> >> -        vlapic_set_irq(vcpu_vlapic(v), vector, 0);
> >> +        if ( vlapic_enabled(vlapic) )
> >> +           vlapic_set_irq(vlapic, v->arch.hvm.evtchn_upcall_vector, 0);
> > 
> > Shouldn't the vlapic_enabled() check itself be in vlapic_set_irq()? We
> > certainly don't want any vectors set until the vlapic is enabled, be
> > it event channel upcalls or any other sources.
> 
> In principle yes, and I did consider doing so, but for several callers
> (potentially used frequently) this would be redundant with other
> checking they do already (first and foremost callers using
> vlapic_lowest_prio()). However, looking again I think vioapic_deliver()
> and vmsi_deliver() violate this as well in their dest_Fixed handling.
> (In both cases I'm actually inclined to also remove the odd *_inj_irq()
> helper functions.)
> 
> > Maybe best to add an ASSERT in vlapic_set_irq() to be sure the lapic is
> > enabled, as other callers already check this before trying to inject?
> 
> Perhaps, yes (once we've fixed paths where the check is presently
> missing).

Another option would be to unconditionally return 0 for IRR and ISR
reads when the LAPIC is disabled, that would avoid having to force the
event channel injection when the LAPIC is enabled, but there could be
more than just the event channel vector queued in that way which would
be against the spec.

> > Also, and not strictly related to your change, isn't this possibly
> > racy, as by the time you evaluate the return of vlapic_enabled() it is
> > already stale, as there's no lock to protect it from changing?
> 
> Wouldn't this simply match a signal arriving to a physical LAPIC just
> the moment before it is enabled?

Hm, yes, any guest trying to play this kind of games with the APIC is
free to keep the pieces.

Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 21.11.2022 11:53, Roger Pau Monné wrote:
> On Mon, Nov 21, 2022 at 09:33:53AM +0100, Jan Beulich wrote:
>> On 18.11.2022 15:26, Roger Pau Monné wrote:
>>> Maybe best to add an ASSERT in vlapic_set_irq() to be sure the lapic is
>>> enabled, as other callers already check this before trying to inject?
>>
>> Perhaps, yes (once we've fixed paths where the check is presently
>> missing).
> 
> Another option would be to unconditionally return 0 for IRR and ISR
> reads when the LAPIC is disabled, that would avoid having to force the
> event channel injection when the LAPIC is enabled, but there could be
> more than just the event channel vector queued in that way which would
> be against the spec.

Pre-existing set IRR bits remain set when moving into disabled mode. If
we faked zeros for reads, we'd violate the spec and undermine Linux'es
checking of the bits.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Andrew Cooper 1 year, 8 months ago

On 18/11/2022 10:31, Jan Beulich wrote:
> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> exposed a problem with the marking of the respective vector as
> pending: For quite some time Linux has been checking whether any stale
> ISR or IRR bits would still be set while preparing the LAPIC for use.
> This check is now triggering on the upcall vector, as the registration,
> at least for APs, happens before the LAPIC is actually enabled.
>
> In software-disabled state an LAPIC would not accept any interrupt
> requests and hence no IRR bit would newly become set while in this
> state. As a result it is also wrong for us to mark the upcall vector as
> having a pending request when the vLAPIC is in this state.

I agree with this.

> To compensate for the "enabled" check added to the assertion logic, add
> logic to (conditionally) mark the upcall vector as having a request
> pending at the time the LAPIC is being software-enabled by the guest.

But this, I don't think is appropriate.

The point of raising on enable is allegedly to work around setup race
conditions.  I'm unconvinced by this reasoning, but it is what it is,
and the stated behaviour is to raise there and then.

If a guest enables evtchn while the LAPIC is disabled, then the
interrupt is lost.  Like every other interrupt in an x86 system.

I don't think there is any credible way a guest kernel author can expect
the weird evtchn edgecase to wait for an arbitrary point in the future,
and it's a corner case that I think is worth not keeping.

~Andrew

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 18.11.2022 13:33, Andrew Cooper wrote:
> On 18/11/2022 10:31, Jan Beulich wrote:
>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>> exposed a problem with the marking of the respective vector as
>> pending: For quite some time Linux has been checking whether any stale
>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>> This check is now triggering on the upcall vector, as the registration,
>> at least for APs, happens before the LAPIC is actually enabled.
>>
>> In software-disabled state an LAPIC would not accept any interrupt
>> requests and hence no IRR bit would newly become set while in this
>> state. As a result it is also wrong for us to mark the upcall vector as
>> having a pending request when the vLAPIC is in this state.
> 
> I agree with this.
> 
>> To compensate for the "enabled" check added to the assertion logic, add
>> logic to (conditionally) mark the upcall vector as having a request
>> pending at the time the LAPIC is being software-enabled by the guest.
> 
> But this, I don't think is appropriate.
> 
> The point of raising on enable is allegedly to work around setup race
> conditions.  I'm unconvinced by this reasoning, but it is what it is,
> and the stated behaviour is to raise there and then.
> 
> If a guest enables evtchn while the LAPIC is disabled, then the
> interrupt is lost.  Like every other interrupt in an x86 system.

Edge triggered ones you mean, I suppose, but yes.

> I don't think there is any credible way a guest kernel author can expect
> the weird evtchn edgecase to wait for an arbitrary point in the future,
> and it's a corner case that I think is worth not keeping.

Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
after setting upcall vector"), referenced by the Fixes: tag? The issue is
that with evtchn_upcall_pending once set, there would never again be a
notification. So if what you say is to be the model we follow, then that
earlier change was perhaps wrong as well. Instead it should then have
been a guest change (as also implicit from your reply) to clear
evtchn_upcall_pending after vCPU info registration (there) or LAPIC
enabling (here), perhaps by way of "manually" invoking the handling of
that pending event, or by issuing a self-IPI with that vector.
Especially the LAPIC enabling case would then be yet another Xen-specific
on a guest code path which better wouldn't have to be aware of Xen.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Andrew Cooper 1 year, 8 months ago

On 18/11/2022 12:54, Jan Beulich wrote:
> On 18.11.2022 13:33, Andrew Cooper wrote:
>> On 18/11/2022 10:31, Jan Beulich wrote:
>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>>> exposed a problem with the marking of the respective vector as
>>> pending: For quite some time Linux has been checking whether any stale
>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>>> This check is now triggering on the upcall vector, as the registration,
>>> at least for APs, happens before the LAPIC is actually enabled.
>>>
>>> In software-disabled state an LAPIC would not accept any interrupt
>>> requests and hence no IRR bit would newly become set while in this
>>> state. As a result it is also wrong for us to mark the upcall vector as
>>> having a pending request when the vLAPIC is in this state.
>> I agree with this.
>>
>>> To compensate for the "enabled" check added to the assertion logic, add
>>> logic to (conditionally) mark the upcall vector as having a request
>>> pending at the time the LAPIC is being software-enabled by the guest.
>> But this, I don't think is appropriate.
>>
>> The point of raising on enable is allegedly to work around setup race
>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
>> and the stated behaviour is to raise there and then.
>>
>> If a guest enables evtchn while the LAPIC is disabled, then the
>> interrupt is lost.  Like every other interrupt in an x86 system.
> Edge triggered ones you mean, I suppose, but yes.

For IO-APIC systems, you mostly lose line interrupts too, don't you?

The line will remain pending at the IO-APIC, but nothing in the system
will unwedge until someone polls the IO-APIC.

Either way...

>
>> I don't think there is any credible way a guest kernel author can expect
>> the weird evtchn edgecase to wait for an arbitrary point in the future,
>> and it's a corner case that I think is worth not keeping.
> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
> after setting upcall vector"), referenced by the Fixes: tag? The issue is
> that with evtchn_upcall_pending once set, there would never again be a
> notification.

Ok, so we do need to do something.

>  So if what you say is to be the model we follow, then that
> earlier change was perhaps wrong as well. Instead it should then have
> been a guest change (as also implicit from your reply) to clear
> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
> enabling (here), perhaps by way of "manually" invoking the handling of
> that pending event, or by issuing a self-IPI with that vector.
> Especially the LAPIC enabling case would then be yet another Xen-specific
> on a guest code path which better wouldn't have to be aware of Xen. 

Without trying to prescribe how to fix this specific issue, wherever
possible we should be trying to limit the Xen-isms from non-Xen areas. 
There's a whole lot of poorly described and surprising behaviours which
have not stood the test of time.

In this case, it seems that we have yet another x86 PV-ism which hasn't
translated well x86 HVM.  Specifically, we're trying to overlay an
entirely shared-memory (and delayed return-to-guest) interrupt
controller onto one which is properly constructed to handle events in
realtime.

I even got as far as writing that maybe leaving it as-is was the best
option (principle of least surprise for Xen developers), but our
"friend" apic acceleration strikes again.

Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
because microcode may have accelerated it.

A consequence of this observation is that Xen cannot have
non-LAPIC-archtiectural behaviour in the vlapic emulation.  So I think
we need to find a solution to this problem that doesn't hook APIC_SPIV.

~Andrew

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 18.11.2022 15:27, Andrew Cooper wrote:
> On 18/11/2022 12:54, Jan Beulich wrote:
>> On 18.11.2022 13:33, Andrew Cooper wrote:
>>> On 18/11/2022 10:31, Jan Beulich wrote:
>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>>>> exposed a problem with the marking of the respective vector as
>>>> pending: For quite some time Linux has been checking whether any stale
>>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>>>> This check is now triggering on the upcall vector, as the registration,
>>>> at least for APs, happens before the LAPIC is actually enabled.
>>>>
>>>> In software-disabled state an LAPIC would not accept any interrupt
>>>> requests and hence no IRR bit would newly become set while in this
>>>> state. As a result it is also wrong for us to mark the upcall vector as
>>>> having a pending request when the vLAPIC is in this state.
>>> I agree with this.
>>>
>>>> To compensate for the "enabled" check added to the assertion logic, add
>>>> logic to (conditionally) mark the upcall vector as having a request
>>>> pending at the time the LAPIC is being software-enabled by the guest.
>>> But this, I don't think is appropriate.
>>>
>>> The point of raising on enable is allegedly to work around setup race
>>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
>>> and the stated behaviour is to raise there and then.
>>>
>>> If a guest enables evtchn while the LAPIC is disabled, then the
>>> interrupt is lost.  Like every other interrupt in an x86 system.
>> Edge triggered ones you mean, I suppose, but yes.
> 
> For IO-APIC systems, you mostly lose line interrupts too, don't you?
> 
> The line will remain pending at the IO-APIC, but nothing in the system
> will unwedge until someone polls the IO-APIC.
> 
> Either way...
> 
>>
>>> I don't think there is any credible way a guest kernel author can expect
>>> the weird evtchn edgecase to wait for an arbitrary point in the future,
>>> and it's a corner case that I think is worth not keeping.
>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
>> after setting upcall vector"), referenced by the Fixes: tag? The issue is
>> that with evtchn_upcall_pending once set, there would never again be a
>> notification.
> 
> Ok, so we do need to do something.
> 
>>  So if what you say is to be the model we follow, then that
>> earlier change was perhaps wrong as well. Instead it should then have
>> been a guest change (as also implicit from your reply) to clear
>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
>> enabling (here), perhaps by way of "manually" invoking the handling of
>> that pending event, or by issuing a self-IPI with that vector.
>> Especially the LAPIC enabling case would then be yet another Xen-specific
>> on a guest code path which better wouldn't have to be aware of Xen. 
> 
> Without trying to prescribe how to fix this specific issue, wherever
> possible we should be trying to limit the Xen-isms from non-Xen areas. 
> There's a whole lot of poorly described and surprising behaviours which
> have not stood the test of time.
> 
> In this case, it seems that we have yet another x86 PV-ism which hasn't
> translated well x86 HVM.  Specifically, we're trying to overlay an
> entirely shared-memory (and delayed return-to-guest) interrupt
> controller onto one which is properly constructed to handle events in
> realtime.
> 
> 
> I even got as far as writing that maybe leaving it as-is was the best
> option (principle of least surprise for Xen developers), but our
> "friend" apic acceleration strikes again.
> 
> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
> because microcode may have accelerated it.

But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
VM exit. If we didn't, how would our internal accounting of APIC enabled
state (VLAPIC_SW_DISABLED) work? And the neighboring (to where I'm adding
the new code) pt_may_unmask_irq() call then also wouldn't occur.

I'm actually pretty sure we do too much in this case - in particular none
of the vlapic_set_reg() should be necessary. But we certainly can't get
away with doing nothing, and hence we depend on that VM exit to actually
occur. Plus simply making the vlapic_set_reg() conditional also likely
wouldn't do any good, so if anything we may want to split
vlapic_reg_write() and invoke only the "2nd half" from
vlapic_apicv_write().

Jan

> A consequence of this observation is that Xen cannot have
> non-LAPIC-archtiectural behaviour in the vlapic emulation.  So I think
> we need to find a solution to this problem that doesn't hook APIC_SPIV.
> 
> ~Andrew

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Andrew Cooper 1 year, 8 months ago

On 21/11/2022 08:56, Jan Beulich wrote:
> On 18.11.2022 15:27, Andrew Cooper wrote:
>> On 18/11/2022 12:54, Jan Beulich wrote:
>>> On 18.11.2022 13:33, Andrew Cooper wrote:
>>>> On 18/11/2022 10:31, Jan Beulich wrote:
>>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>>>>> exposed a problem with the marking of the respective vector as
>>>>> pending: For quite some time Linux has been checking whether any stale
>>>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>>>>> This check is now triggering on the upcall vector, as the registration,
>>>>> at least for APs, happens before the LAPIC is actually enabled.
>>>>>
>>>>> In software-disabled state an LAPIC would not accept any interrupt
>>>>> requests and hence no IRR bit would newly become set while in this
>>>>> state. As a result it is also wrong for us to mark the upcall vector as
>>>>> having a pending request when the vLAPIC is in this state.
>>>> I agree with this.
>>>>
>>>>> To compensate for the "enabled" check added to the assertion logic, add
>>>>> logic to (conditionally) mark the upcall vector as having a request
>>>>> pending at the time the LAPIC is being software-enabled by the guest.
>>>> But this, I don't think is appropriate.
>>>>
>>>> The point of raising on enable is allegedly to work around setup race
>>>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
>>>> and the stated behaviour is to raise there and then.
>>>>
>>>> If a guest enables evtchn while the LAPIC is disabled, then the
>>>> interrupt is lost.  Like every other interrupt in an x86 system.
>>> Edge triggered ones you mean, I suppose, but yes.
>> For IO-APIC systems, you mostly lose line interrupts too, don't you?
>>
>> The line will remain pending at the IO-APIC, but nothing in the system
>> will unwedge until someone polls the IO-APIC.
>>
>> Either way...
>>
>>>> I don't think there is any credible way a guest kernel author can expect
>>>> the weird evtchn edgecase to wait for an arbitrary point in the future,
>>>> and it's a corner case that I think is worth not keeping.
>>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
>>> after setting upcall vector"), referenced by the Fixes: tag? The issue is
>>> that with evtchn_upcall_pending once set, there would never again be a
>>> notification.
>> Ok, so we do need to do something.
>>
>>>  So if what you say is to be the model we follow, then that
>>> earlier change was perhaps wrong as well. Instead it should then have
>>> been a guest change (as also implicit from your reply) to clear
>>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
>>> enabling (here), perhaps by way of "manually" invoking the handling of
>>> that pending event, or by issuing a self-IPI with that vector.
>>> Especially the LAPIC enabling case would then be yet another Xen-specific
>>> on a guest code path which better wouldn't have to be aware of Xen. 
>> Without trying to prescribe how to fix this specific issue, wherever
>> possible we should be trying to limit the Xen-isms from non-Xen areas. 
>> There's a whole lot of poorly described and surprising behaviours which
>> have not stood the test of time.
>>
>> In this case, it seems that we have yet another x86 PV-ism which hasn't
>> translated well x86 HVM.  Specifically, we're trying to overlay an
>> entirely shared-memory (and delayed return-to-guest) interrupt
>> controller onto one which is properly constructed to handle events in
>> realtime.
>>
>>
>> I even got as far as writing that maybe leaving it as-is was the best
>> option (principle of least surprise for Xen developers), but our
>> "friend" apic acceleration strikes again.
>>
>> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
>> because microcode may have accelerated it.
> But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
> VM exit.

Intel isn't the only accelerated implementation, and there future
details not in the public docs.

There will be an implementation we will want to support where Xen
doesn't get a vmexit for a write to SPIV.

> If we didn't, how would our internal accounting of APIC enabled
> state (VLAPIC_SW_DISABLED) work?

It doesn't.

One of many problems on the "known errors" list from an incomplete
original attempt to get acceleration working.

There's no good reason to cache those disables in the first place (both
are both trivially available from other positions in memory), and
correctness reasons not to.

>  And the neighboring (to where I'm adding
> the new code) pt_may_unmask_irq() call then also wouldn't occur.
>
> I'm actually pretty sure we do too much in this case - in particular none
> of the vlapic_set_reg() should be necessary. But we certainly can't get
> away with doing nothing, and hence we depend on that VM exit to actually
> occur.

We must do exactly and only what real hardware does, so that the state
changes emulated by Xen are identical to those accelerated by microcode.

Other than that, I really wouldn't make any presumptions about the
existing vLAPIC logic being correct.

It is, at best, enough of an approximation to the spec for major OSes to
function, with multiple known bugs and no coherent testing.

~Andrew

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 7 months ago

On 21.11.2022 13:23, Andrew Cooper wrote:
> On 21/11/2022 08:56, Jan Beulich wrote:
>> On 18.11.2022 15:27, Andrew Cooper wrote:
>>> On 18/11/2022 12:54, Jan Beulich wrote:
>>>> On 18.11.2022 13:33, Andrew Cooper wrote:
>>>>> On 18/11/2022 10:31, Jan Beulich wrote:
>>>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>>>>>> exposed a problem with the marking of the respective vector as
>>>>>> pending: For quite some time Linux has been checking whether any stale
>>>>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>>>>>> This check is now triggering on the upcall vector, as the registration,
>>>>>> at least for APs, happens before the LAPIC is actually enabled.
>>>>>>
>>>>>> In software-disabled state an LAPIC would not accept any interrupt
>>>>>> requests and hence no IRR bit would newly become set while in this
>>>>>> state. As a result it is also wrong for us to mark the upcall vector as
>>>>>> having a pending request when the vLAPIC is in this state.
>>>>> I agree with this.
>>>>>
>>>>>> To compensate for the "enabled" check added to the assertion logic, add
>>>>>> logic to (conditionally) mark the upcall vector as having a request
>>>>>> pending at the time the LAPIC is being software-enabled by the guest.
>>>>> But this, I don't think is appropriate.
>>>>>
>>>>> The point of raising on enable is allegedly to work around setup race
>>>>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
>>>>> and the stated behaviour is to raise there and then.
>>>>>
>>>>> If a guest enables evtchn while the LAPIC is disabled, then the
>>>>> interrupt is lost.  Like every other interrupt in an x86 system.
>>>> Edge triggered ones you mean, I suppose, but yes.
>>> For IO-APIC systems, you mostly lose line interrupts too, don't you?
>>>
>>> The line will remain pending at the IO-APIC, but nothing in the system
>>> will unwedge until someone polls the IO-APIC.
>>>
>>> Either way...
>>>
>>>>> I don't think there is any credible way a guest kernel author can expect
>>>>> the weird evtchn edgecase to wait for an arbitrary point in the future,
>>>>> and it's a corner case that I think is worth not keeping.
>>>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
>>>> after setting upcall vector"), referenced by the Fixes: tag? The issue is
>>>> that with evtchn_upcall_pending once set, there would never again be a
>>>> notification.
>>> Ok, so we do need to do something.
>>>
>>>>  So if what you say is to be the model we follow, then that
>>>> earlier change was perhaps wrong as well. Instead it should then have
>>>> been a guest change (as also implicit from your reply) to clear
>>>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
>>>> enabling (here), perhaps by way of "manually" invoking the handling of
>>>> that pending event, or by issuing a self-IPI with that vector.
>>>> Especially the LAPIC enabling case would then be yet another Xen-specific
>>>> on a guest code path which better wouldn't have to be aware of Xen. 
>>> Without trying to prescribe how to fix this specific issue, wherever
>>> possible we should be trying to limit the Xen-isms from non-Xen areas. 
>>> There's a whole lot of poorly described and surprising behaviours which
>>> have not stood the test of time.
>>>
>>> In this case, it seems that we have yet another x86 PV-ism which hasn't
>>> translated well x86 HVM.  Specifically, we're trying to overlay an
>>> entirely shared-memory (and delayed return-to-guest) interrupt
>>> controller onto one which is properly constructed to handle events in
>>> realtime.
>>>
>>>
>>> I even got as far as writing that maybe leaving it as-is was the best
>>> option (principle of least surprise for Xen developers), but our
>>> "friend" apic acceleration strikes again.
>>>
>>> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
>>> because microcode may have accelerated it.
>> But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
>> VM exit.
> 
> Intel isn't the only accelerated implementation, and there future
> details not in the public docs.
> 
> There will be an implementation we will want to support where Xen
> doesn't get a vmexit for a write to SPIV.

To cover for that, as just discussed, I've added

"Note however that, like for the pt_may_unmask_irq() we already have
 there, long term we may need to find a different solution. This will be
 especially relevant in case yet better LAPIC acceleration would
 eliminate notifications of guest writes to this and other registers."

to the last paragraph of the commit message.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 21.11.2022 13:23, Andrew Cooper wrote:
> On 21/11/2022 08:56, Jan Beulich wrote:
>> On 18.11.2022 15:27, Andrew Cooper wrote:
>>> On 18/11/2022 12:54, Jan Beulich wrote:
>>>> On 18.11.2022 13:33, Andrew Cooper wrote:
>>>>> On 18/11/2022 10:31, Jan Beulich wrote:
>>>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>>>>>> exposed a problem with the marking of the respective vector as
>>>>>> pending: For quite some time Linux has been checking whether any stale
>>>>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>>>>>> This check is now triggering on the upcall vector, as the registration,
>>>>>> at least for APs, happens before the LAPIC is actually enabled.
>>>>>>
>>>>>> In software-disabled state an LAPIC would not accept any interrupt
>>>>>> requests and hence no IRR bit would newly become set while in this
>>>>>> state. As a result it is also wrong for us to mark the upcall vector as
>>>>>> having a pending request when the vLAPIC is in this state.
>>>>> I agree with this.
>>>>>
>>>>>> To compensate for the "enabled" check added to the assertion logic, add
>>>>>> logic to (conditionally) mark the upcall vector as having a request
>>>>>> pending at the time the LAPIC is being software-enabled by the guest.
>>>>> But this, I don't think is appropriate.
>>>>>
>>>>> The point of raising on enable is allegedly to work around setup race
>>>>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
>>>>> and the stated behaviour is to raise there and then.
>>>>>
>>>>> If a guest enables evtchn while the LAPIC is disabled, then the
>>>>> interrupt is lost.  Like every other interrupt in an x86 system.
>>>> Edge triggered ones you mean, I suppose, but yes.
>>> For IO-APIC systems, you mostly lose line interrupts too, don't you?
>>>
>>> The line will remain pending at the IO-APIC, but nothing in the system
>>> will unwedge until someone polls the IO-APIC.
>>>
>>> Either way...
>>>
>>>>> I don't think there is any credible way a guest kernel author can expect
>>>>> the weird evtchn edgecase to wait for an arbitrary point in the future,
>>>>> and it's a corner case that I think is worth not keeping.
>>>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
>>>> after setting upcall vector"), referenced by the Fixes: tag? The issue is
>>>> that with evtchn_upcall_pending once set, there would never again be a
>>>> notification.
>>> Ok, so we do need to do something.
>>>
>>>>  So if what you say is to be the model we follow, then that
>>>> earlier change was perhaps wrong as well. Instead it should then have
>>>> been a guest change (as also implicit from your reply) to clear
>>>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
>>>> enabling (here), perhaps by way of "manually" invoking the handling of
>>>> that pending event, or by issuing a self-IPI with that vector.
>>>> Especially the LAPIC enabling case would then be yet another Xen-specific
>>>> on a guest code path which better wouldn't have to be aware of Xen. 
>>> Without trying to prescribe how to fix this specific issue, wherever
>>> possible we should be trying to limit the Xen-isms from non-Xen areas. 
>>> There's a whole lot of poorly described and surprising behaviours which
>>> have not stood the test of time.
>>>
>>> In this case, it seems that we have yet another x86 PV-ism which hasn't
>>> translated well x86 HVM.  Specifically, we're trying to overlay an
>>> entirely shared-memory (and delayed return-to-guest) interrupt
>>> controller onto one which is properly constructed to handle events in
>>> realtime.
>>>
>>>
>>> I even got as far as writing that maybe leaving it as-is was the best
>>> option (principle of least surprise for Xen developers), but our
>>> "friend" apic acceleration strikes again.
>>>
>>> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
>>> because microcode may have accelerated it.
>> But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
>> VM exit.
> 
> Intel isn't the only accelerated implementation, and there future
> details not in the public docs.
> 
> There will be an implementation we will want to support where Xen
> doesn't get a vmexit for a write to SPIV.

I see.

>> If we didn't, how would our internal accounting of APIC enabled
>> state (VLAPIC_SW_DISABLED) work?
> 
> It doesn't.
> 
> One of many problems on the "known errors" list from an incomplete
> original attempt to get acceleration working.
> 
> There's no good reason to cache those disables in the first place (both
> are both trivially available from other positions in memory), and
> correctness reasons not to.
> 
>>  And the neighboring (to where I'm adding
>> the new code) pt_may_unmask_irq() call then also wouldn't occur.
>>
>> I'm actually pretty sure we do too much in this case - in particular none
>> of the vlapic_set_reg() should be necessary. But we certainly can't get
>> away with doing nothing, and hence we depend on that VM exit to actually
>> occur.
> 
> We must do exactly and only what real hardware does, so that the state
> changes emulated by Xen are identical to those accelerated by microcode.
> 
> Other than that, I really wouldn't make any presumptions about the
> existing vLAPIC logic being correct.
> 
> It is, at best, enough of an approximation to the spec for major OSes to
> function, with multiple known bugs and no coherent testing.

But can we leave resolving of the wider issue then separate, and leave
the change here as it presently is? Yes, mimic-ing the same behavior
later may be "interesting", but if we can't achieve consistent behavior
with yet more advanced acceleration, maybe we simply can't use that
(perhaps until a complete overhaul of everything involved in LAPIC
handling, possibly including a guest side indicator that they're happy
without the extra signaling, at which point that yet-more-advanced
acceleration could then be enabled for that guest).

Otherwise - do you have any suggestion as to alternative logic which I
might use in this patch?

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Mon, Nov 21, 2022 at 01:34:38PM +0100, Jan Beulich wrote:
> On 21.11.2022 13:23, Andrew Cooper wrote:
> > On 21/11/2022 08:56, Jan Beulich wrote:
> >> On 18.11.2022 15:27, Andrew Cooper wrote:
> >>> On 18/11/2022 12:54, Jan Beulich wrote:
> >>>> On 18.11.2022 13:33, Andrew Cooper wrote:
> >>>>> On 18/11/2022 10:31, Jan Beulich wrote:
> >>>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> >>>>>> exposed a problem with the marking of the respective vector as
> >>>>>> pending: For quite some time Linux has been checking whether any stale
> >>>>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
> >>>>>> This check is now triggering on the upcall vector, as the registration,
> >>>>>> at least for APs, happens before the LAPIC is actually enabled.
> >>>>>>
> >>>>>> In software-disabled state an LAPIC would not accept any interrupt
> >>>>>> requests and hence no IRR bit would newly become set while in this
> >>>>>> state. As a result it is also wrong for us to mark the upcall vector as
> >>>>>> having a pending request when the vLAPIC is in this state.
> >>>>> I agree with this.
> >>>>>
> >>>>>> To compensate for the "enabled" check added to the assertion logic, add
> >>>>>> logic to (conditionally) mark the upcall vector as having a request
> >>>>>> pending at the time the LAPIC is being software-enabled by the guest.
> >>>>> But this, I don't think is appropriate.
> >>>>>
> >>>>> The point of raising on enable is allegedly to work around setup race
> >>>>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
> >>>>> and the stated behaviour is to raise there and then.
> >>>>>
> >>>>> If a guest enables evtchn while the LAPIC is disabled, then the
> >>>>> interrupt is lost.  Like every other interrupt in an x86 system.
> >>>> Edge triggered ones you mean, I suppose, but yes.
> >>> For IO-APIC systems, you mostly lose line interrupts too, don't you?
> >>>
> >>> The line will remain pending at the IO-APIC, but nothing in the system
> >>> will unwedge until someone polls the IO-APIC.
> >>>
> >>> Either way...
> >>>
> >>>>> I don't think there is any credible way a guest kernel author can expect
> >>>>> the weird evtchn edgecase to wait for an arbitrary point in the future,
> >>>>> and it's a corner case that I think is worth not keeping.
> >>>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
> >>>> after setting upcall vector"), referenced by the Fixes: tag? The issue is
> >>>> that with evtchn_upcall_pending once set, there would never again be a
> >>>> notification.
> >>> Ok, so we do need to do something.
> >>>
> >>>>  So if what you say is to be the model we follow, then that
> >>>> earlier change was perhaps wrong as well. Instead it should then have
> >>>> been a guest change (as also implicit from your reply) to clear
> >>>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
> >>>> enabling (here), perhaps by way of "manually" invoking the handling of
> >>>> that pending event, or by issuing a self-IPI with that vector.
> >>>> Especially the LAPIC enabling case would then be yet another Xen-specific
> >>>> on a guest code path which better wouldn't have to be aware of Xen. 
> >>> Without trying to prescribe how to fix this specific issue, wherever
> >>> possible we should be trying to limit the Xen-isms from non-Xen areas. 
> >>> There's a whole lot of poorly described and surprising behaviours which
> >>> have not stood the test of time.
> >>>
> >>> In this case, it seems that we have yet another x86 PV-ism which hasn't
> >>> translated well x86 HVM.  Specifically, we're trying to overlay an
> >>> entirely shared-memory (and delayed return-to-guest) interrupt
> >>> controller onto one which is properly constructed to handle events in
> >>> realtime.
> >>>
> >>>
> >>> I even got as far as writing that maybe leaving it as-is was the best
> >>> option (principle of least surprise for Xen developers), but our
> >>> "friend" apic acceleration strikes again.
> >>>
> >>> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
> >>> because microcode may have accelerated it.
> >> But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
> >> VM exit.
> > 
> > Intel isn't the only accelerated implementation, and there future
> > details not in the public docs.
> > 
> > There will be an implementation we will want to support where Xen
> > doesn't get a vmexit for a write to SPIV.
> 
> I see.
> 
> >> If we didn't, how would our internal accounting of APIC enabled
> >> state (VLAPIC_SW_DISABLED) work?
> > 
> > It doesn't.
> > 
> > One of many problems on the "known errors" list from an incomplete
> > original attempt to get acceleration working.
> > 
> > There's no good reason to cache those disables in the first place (both
> > are both trivially available from other positions in memory), and
> > correctness reasons not to.
> > 
> >>  And the neighboring (to where I'm adding
> >> the new code) pt_may_unmask_irq() call then also wouldn't occur.
> >>
> >> I'm actually pretty sure we do too much in this case - in particular none
> >> of the vlapic_set_reg() should be necessary. But we certainly can't get
> >> away with doing nothing, and hence we depend on that VM exit to actually
> >> occur.
> > 
> > We must do exactly and only what real hardware does, so that the state
> > changes emulated by Xen are identical to those accelerated by microcode.
> > 
> > Other than that, I really wouldn't make any presumptions about the
> > existing vLAPIC logic being correct.
> > 
> > It is, at best, enough of an approximation to the spec for major OSes to
> > function, with multiple known bugs and no coherent testing.
> 
> But can we leave resolving of the wider issue then separate, and leave
> the change here as it presently is? Yes, mimic-ing the same behavior
> later may be "interesting", but if we can't achieve consistent behavior
> with yet more advanced acceleration, maybe we simply can't use that
> (perhaps until a complete overhaul of everything involved in LAPIC
> handling, possibly including a guest side indicator that they're happy
> without the extra signaling, at which point that yet-more-advanced
> acceleration could then be enabled for that guest).
> 
> Otherwise - do you have any suggestion as to alternative logic which I
> might use in this patch?

Maybe the underlying issue is that we allow executing
HVMOP_set_evtchn_upcall_vector against remote vCPU.  This could be
solved by only allowing HVMOP_set_evtchn_upcall_vector against the
current vCPU and with the LAPIC enabled, but I guess we are too late
for that.

Actually, what about only injecting the spurious event if the vCPU is
online, ie:

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index ae4368ec4b..4b84706579 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4105,7 +4105,8 @@ static int hvmop_set_evtchn_upcall_vector(
     printk(XENLOG_G_INFO "%pv: upcall vector %02x\n", v, op.vector);
 
     v->arch.hvm.evtchn_upcall_vector = op.vector;
-    hvm_assert_evtchn_irq(v);
+    if ( is_vcpu_online(v) )
+        hvm_assert_evtchn_irq(v);
     return 0;
 }
 

I think that should fix the issue seen on Linux.  We would
additionally need to fix hvm_assert_evtchn_irq() to only set the
vector if the vLAPIC is enabled.

Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 23.11.2022 13:03, Roger Pau Monné wrote:
> On Mon, Nov 21, 2022 at 01:34:38PM +0100, Jan Beulich wrote:
>> On 21.11.2022 13:23, Andrew Cooper wrote:
>>> On 21/11/2022 08:56, Jan Beulich wrote:
>>>> On 18.11.2022 15:27, Andrew Cooper wrote:
>>>>> I even got as far as writing that maybe leaving it as-is was the best
>>>>> option (principle of least surprise for Xen developers), but our
>>>>> "friend" apic acceleration strikes again.
>>>>>
>>>>> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
>>>>> because microcode may have accelerated it.
>>>> But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
>>>> VM exit.
>>>
>>> Intel isn't the only accelerated implementation, and there future
>>> details not in the public docs.
>>>
>>> There will be an implementation we will want to support where Xen
>>> doesn't get a vmexit for a write to SPIV.
>>
>> I see.
>>
>>>> If we didn't, how would our internal accounting of APIC enabled
>>>> state (VLAPIC_SW_DISABLED) work?
>>>
>>> It doesn't.
>>>
>>> One of many problems on the "known errors" list from an incomplete
>>> original attempt to get acceleration working.
>>>
>>> There's no good reason to cache those disables in the first place (both
>>> are both trivially available from other positions in memory), and
>>> correctness reasons not to.
>>>
>>>>  And the neighboring (to where I'm adding
>>>> the new code) pt_may_unmask_irq() call then also wouldn't occur.
>>>>
>>>> I'm actually pretty sure we do too much in this case - in particular none
>>>> of the vlapic_set_reg() should be necessary. But we certainly can't get
>>>> away with doing nothing, and hence we depend on that VM exit to actually
>>>> occur.
>>>
>>> We must do exactly and only what real hardware does, so that the state
>>> changes emulated by Xen are identical to those accelerated by microcode.
>>>
>>> Other than that, I really wouldn't make any presumptions about the
>>> existing vLAPIC logic being correct.
>>>
>>> It is, at best, enough of an approximation to the spec for major OSes to
>>> function, with multiple known bugs and no coherent testing.
>>
>> But can we leave resolving of the wider issue then separate, and leave
>> the change here as it presently is? Yes, mimic-ing the same behavior
>> later may be "interesting", but if we can't achieve consistent behavior
>> with yet more advanced acceleration, maybe we simply can't use that
>> (perhaps until a complete overhaul of everything involved in LAPIC
>> handling, possibly including a guest side indicator that they're happy
>> without the extra signaling, at which point that yet-more-advanced
>> acceleration could then be enabled for that guest).
>>
>> Otherwise - do you have any suggestion as to alternative logic which I
>> might use in this patch?
> 
> Maybe the underlying issue is that we allow executing
> HVMOP_set_evtchn_upcall_vector against remote vCPU.  This could be
> solved by only allowing HVMOP_set_evtchn_upcall_vector against the
> current vCPU and with the LAPIC enabled, but I guess we are too late
> for that.

Allowing things like this ahead of bringing a vCPU online can be
helpful for the OS side implementation, I think.

> Actually, what about only injecting the spurious event if the vCPU is
> online, ie:
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index ae4368ec4b..4b84706579 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4105,7 +4105,8 @@ static int hvmop_set_evtchn_upcall_vector(
>      printk(XENLOG_G_INFO "%pv: upcall vector %02x\n", v, op.vector);
>  
>      v->arch.hvm.evtchn_upcall_vector = op.vector;
> -    hvm_assert_evtchn_irq(v);
> +    if ( is_vcpu_online(v) )
> +        hvm_assert_evtchn_irq(v);

While this would match what hvm_set_callback_via() does, see my post-
commit-message remark suggesting to key this to evtchn_upcall_pending.
Tying the call to the vCPU being online looks pretty arbitrary to me,
the more that this would continue to be
- racy with the vCPU coming online, and perhaps already being past
  VCPUOP_initialise - after all VCPUOP_up is little more than clearing
  VPF_down,
- problematic wrt evtchn_upcall_pending, once set, preventing event
  injection later on.
As you may have inferred already, I'm inclined to suggest to drop the
the is_vcpu_online() check from hvm_set_callback_via().

One related question here is whether vlapic_do_init() shouldn't have
the non-architectural side effect of clearing evtchn_upcall_pending.
While this again violates the principle of the hypervisor only ever
setting that bit, it would deal with the risk of no further event
injection once the flag is set, considering that vlapic_do_init()
clears IRR (and ISR).

Jan

> I think that should fix the issue seen on Linux.  We would
> additionally need to fix hvm_assert_evtchn_irq() to only set the
> vector if the vLAPIC is enabled.
> 
> Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
> On 23.11.2022 13:03, Roger Pau Monné wrote:
> > On Mon, Nov 21, 2022 at 01:34:38PM +0100, Jan Beulich wrote:
> >> On 21.11.2022 13:23, Andrew Cooper wrote:
> >>> On 21/11/2022 08:56, Jan Beulich wrote:
> >>>> On 18.11.2022 15:27, Andrew Cooper wrote:
> >>>>> I even got as far as writing that maybe leaving it as-is was the best
> >>>>> option (principle of least surprise for Xen developers), but our
> >>>>> "friend" apic acceleration strikes again.
> >>>>>
> >>>>> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
> >>>>> because microcode may have accelerated it.
> >>>> But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
> >>>> VM exit.
> >>>
> >>> Intel isn't the only accelerated implementation, and there future
> >>> details not in the public docs.
> >>>
> >>> There will be an implementation we will want to support where Xen
> >>> doesn't get a vmexit for a write to SPIV.
> >>
> >> I see.
> >>
> >>>> If we didn't, how would our internal accounting of APIC enabled
> >>>> state (VLAPIC_SW_DISABLED) work?
> >>>
> >>> It doesn't.
> >>>
> >>> One of many problems on the "known errors" list from an incomplete
> >>> original attempt to get acceleration working.
> >>>
> >>> There's no good reason to cache those disables in the first place (both
> >>> are both trivially available from other positions in memory), and
> >>> correctness reasons not to.
> >>>
> >>>>  And the neighboring (to where I'm adding
> >>>> the new code) pt_may_unmask_irq() call then also wouldn't occur.
> >>>>
> >>>> I'm actually pretty sure we do too much in this case - in particular none
> >>>> of the vlapic_set_reg() should be necessary. But we certainly can't get
> >>>> away with doing nothing, and hence we depend on that VM exit to actually
> >>>> occur.
> >>>
> >>> We must do exactly and only what real hardware does, so that the state
> >>> changes emulated by Xen are identical to those accelerated by microcode.
> >>>
> >>> Other than that, I really wouldn't make any presumptions about the
> >>> existing vLAPIC logic being correct.
> >>>
> >>> It is, at best, enough of an approximation to the spec for major OSes to
> >>> function, with multiple known bugs and no coherent testing.
> >>
> >> But can we leave resolving of the wider issue then separate, and leave
> >> the change here as it presently is? Yes, mimic-ing the same behavior
> >> later may be "interesting", but if we can't achieve consistent behavior
> >> with yet more advanced acceleration, maybe we simply can't use that
> >> (perhaps until a complete overhaul of everything involved in LAPIC
> >> handling, possibly including a guest side indicator that they're happy
> >> without the extra signaling, at which point that yet-more-advanced
> >> acceleration could then be enabled for that guest).
> >>
> >> Otherwise - do you have any suggestion as to alternative logic which I
> >> might use in this patch?
> > 
> > Maybe the underlying issue is that we allow executing
> > HVMOP_set_evtchn_upcall_vector against remote vCPU.  This could be
> > solved by only allowing HVMOP_set_evtchn_upcall_vector against the
> > current vCPU and with the LAPIC enabled, but I guess we are too late
> > for that.
> 
> Allowing things like this ahead of bringing a vCPU online can be
> helpful for the OS side implementation, I think.

I thinks it's more natural to do the vector callback setup as part of
the CPU bringup path, like any other CPU related setting that need to
be applied, but I guess that's a question of taste.

> > Actually, what about only injecting the spurious event if the vCPU is
> > online, ie:
> > 
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index ae4368ec4b..4b84706579 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -4105,7 +4105,8 @@ static int hvmop_set_evtchn_upcall_vector(
> >      printk(XENLOG_G_INFO "%pv: upcall vector %02x\n", v, op.vector);
> >  
> >      v->arch.hvm.evtchn_upcall_vector = op.vector;
> > -    hvm_assert_evtchn_irq(v);
> > +    if ( is_vcpu_online(v) )
> > +        hvm_assert_evtchn_irq(v);
> 
> While this would match what hvm_set_callback_via() does, see my post-
> commit-message remark suggesting to key this to evtchn_upcall_pending.
> Tying the call to the vCPU being online looks pretty arbitrary to me,
> the more that this would continue to be
> - racy with the vCPU coming online, and perhaps already being past
>   VCPUOP_initialise - after all VCPUOP_up is little more than clearing
>   VPF_down,

If the OS attempts to setup the callback at the same time the CPU is
being brought online all bets are off I think, and the OS gets to keep
the pieces.

> - problematic wrt evtchn_upcall_pending, once set, preventing event
>   injection later on.
> As you may have inferred already, I'm inclined to suggest to drop the
> the is_vcpu_online() check from hvm_set_callback_via().
> 
> One related question here is whether vlapic_do_init() shouldn't have
> the non-architectural side effect of clearing evtchn_upcall_pending.
> While this again violates the principle of the hypervisor only ever
> setting that bit, it would deal with the risk of no further event
> injection once the flag is set, considering that vlapic_do_init()
> clears IRR (and ISR).

That would seem sensible to me, and was kind of what I was suggesting
in:

https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/

Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 24.11.2022 09:42, Roger Pau Monné wrote:
> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
>> One related question here is whether vlapic_do_init() shouldn't have
>> the non-architectural side effect of clearing evtchn_upcall_pending.
>> While this again violates the principle of the hypervisor only ever
>> setting that bit, it would deal with the risk of no further event
>> injection once the flag is set, considering that vlapic_do_init()
>> clears IRR (and ISR).
> 
> That would seem sensible to me, and was kind of what I was suggesting
> in:
> 
> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/

Oh, I didn't read it that way, as there you said "when it is enabled",
whereas here I'm suggesting to do so when it is being (re)initialized.
But, as said both above and in the reply, we'd need to be sure of the
resulting behavioral change (kind of an ABI one) being acceptable.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Thu, Nov 24, 2022 at 09:42:40AM +0100, Roger Pau Monné wrote:
> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
> > - problematic wrt evtchn_upcall_pending, once set, preventing event
> >   injection later on.
> > As you may have inferred already, I'm inclined to suggest to drop the
> > the is_vcpu_online() check from hvm_set_callback_via().
> > 
> > One related question here is whether vlapic_do_init() shouldn't have
> > the non-architectural side effect of clearing evtchn_upcall_pending.
> > While this again violates the principle of the hypervisor only ever
> > setting that bit, it would deal with the risk of no further event
> > injection once the flag is set, considering that vlapic_do_init()
> > clears IRR (and ISR).
> 
> That would seem sensible to me, and was kind of what I was suggesting
> in:
> 
> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/

Another option would be for vcpu_mark_events_pending() to
unconditionally call hvm_assert_evtchn_irq() regardless of the state
of evtchn_upcall_pending.  This will create some spurious events.

Regards, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 24.11.2022 10:06, Roger Pau Monné wrote:
> On Thu, Nov 24, 2022 at 09:42:40AM +0100, Roger Pau Monné wrote:
>> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
>>> - problematic wrt evtchn_upcall_pending, once set, preventing event
>>>   injection later on.
>>> As you may have inferred already, I'm inclined to suggest to drop the
>>> the is_vcpu_online() check from hvm_set_callback_via().
>>>
>>> One related question here is whether vlapic_do_init() shouldn't have
>>> the non-architectural side effect of clearing evtchn_upcall_pending.
>>> While this again violates the principle of the hypervisor only ever
>>> setting that bit, it would deal with the risk of no further event
>>> injection once the flag is set, considering that vlapic_do_init()
>>> clears IRR (and ISR).
>>
>> That would seem sensible to me, and was kind of what I was suggesting
>> in:
>>
>> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/
> 
> Another option would be for vcpu_mark_events_pending() to
> unconditionally call hvm_assert_evtchn_irq() regardless of the state
> of evtchn_upcall_pending.

I think you said so before, and ...

>  This will create some spurious events.

... I continue to be afraid of s/some/many/. This event delivery is meant
to be kind of edge triggered, and I think it is a reasonable model to treat
the flag going from 0 to 1 as the "edge".

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Thu, Nov 24, 2022 at 10:11:05AM +0100, Jan Beulich wrote:
> On 24.11.2022 10:06, Roger Pau Monné wrote:
> > On Thu, Nov 24, 2022 at 09:42:40AM +0100, Roger Pau Monné wrote:
> >> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
> >>> - problematic wrt evtchn_upcall_pending, once set, preventing event
> >>>   injection later on.
> >>> As you may have inferred already, I'm inclined to suggest to drop the
> >>> the is_vcpu_online() check from hvm_set_callback_via().
> >>>
> >>> One related question here is whether vlapic_do_init() shouldn't have
> >>> the non-architectural side effect of clearing evtchn_upcall_pending.
> >>> While this again violates the principle of the hypervisor only ever
> >>> setting that bit, it would deal with the risk of no further event
> >>> injection once the flag is set, considering that vlapic_do_init()
> >>> clears IRR (and ISR).
> >>
> >> That would seem sensible to me, and was kind of what I was suggesting
> >> in:
> >>
> >> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/
> > 
> > Another option would be for vcpu_mark_events_pending() to
> > unconditionally call hvm_assert_evtchn_irq() regardless of the state
> > of evtchn_upcall_pending.
> 
> I think you said so before, and ...
> 
> >  This will create some spurious events.
> 
> ... I continue to be afraid of s/some/many/.

Not _that_ many I think, as we can only queue one pending interrupt in
IRR.

> This event delivery is meant
> to be kind of edge triggered, and I think it is a reasonable model to treat
> the flag going from 0 to 1 as the "edge".

Hm, it's a weird interrupt model I would say.  In some aspects it's
similar to level (as the line stays asserted until it's cleared), but
we don't get new interrupts injected into the APIC.

Maybe the right mode would be to treat evtchn_upcall_pending as a
level triggered line and keep injecting interrupts until the line is
deasserted (ie: evtchn_upcall_pending == 0)?

Sorry, I feel like I'm asking more questions that providing answers.

Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 24.11.2022 10:33, Roger Pau Monné wrote:
> On Thu, Nov 24, 2022 at 10:11:05AM +0100, Jan Beulich wrote:
>> On 24.11.2022 10:06, Roger Pau Monné wrote:
>>> On Thu, Nov 24, 2022 at 09:42:40AM +0100, Roger Pau Monné wrote:
>>>> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
>>>>> - problematic wrt evtchn_upcall_pending, once set, preventing event
>>>>>   injection later on.
>>>>> As you may have inferred already, I'm inclined to suggest to drop the
>>>>> the is_vcpu_online() check from hvm_set_callback_via().
>>>>>
>>>>> One related question here is whether vlapic_do_init() shouldn't have
>>>>> the non-architectural side effect of clearing evtchn_upcall_pending.
>>>>> While this again violates the principle of the hypervisor only ever
>>>>> setting that bit, it would deal with the risk of no further event
>>>>> injection once the flag is set, considering that vlapic_do_init()
>>>>> clears IRR (and ISR).
>>>>
>>>> That would seem sensible to me, and was kind of what I was suggesting
>>>> in:
>>>>
>>>> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/
>>>
>>> Another option would be for vcpu_mark_events_pending() to
>>> unconditionally call hvm_assert_evtchn_irq() regardless of the state
>>> of evtchn_upcall_pending.
>>
>> I think you said so before, and ...
>>
>>>  This will create some spurious events.
>>
>> ... I continue to be afraid of s/some/many/.
> 
> Not _that_ many I think, as we can only queue one pending interrupt in
> IRR.

We need to be careful here - the kernel treating it as "edge" (like
any other interrupt coming directly from the LAPIC), it ack-s it
before calling the handler, i.e. before evtchn_upcall_pending would
have a chance to be cleared. See Linux'es sysvec_xen_hvm_callback().

>> This event delivery is meant
>> to be kind of edge triggered, and I think it is a reasonable model to treat
>> the flag going from 0 to 1 as the "edge".
> 
> Hm, it's a weird interrupt model I would say.  In some aspects it's
> similar to level (as the line stays asserted until it's cleared), but
> we don't get new interrupts injected into the APIC.
> 
> Maybe the right mode would be to treat evtchn_upcall_pending as a
> level triggered line and keep injecting interrupts until the line is
> deasserted (ie: evtchn_upcall_pending == 0)?

As said above, and as also pointed out by Andrew on another sub-
thread when discussing the (dis)similarity with IO-APIC connected
interrupts, at the LAPIC there's no edge/level distinction anymore,
as we're not dealing with "asserted" signals there, but just with
messages sent on the bus.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Thu, Nov 24, 2022 at 12:16:13PM +0100, Jan Beulich wrote:
> On 24.11.2022 10:33, Roger Pau Monné wrote:
> > On Thu, Nov 24, 2022 at 10:11:05AM +0100, Jan Beulich wrote:
> >> On 24.11.2022 10:06, Roger Pau Monné wrote:
> >>> On Thu, Nov 24, 2022 at 09:42:40AM +0100, Roger Pau Monné wrote:
> >>>> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
> >>>>> - problematic wrt evtchn_upcall_pending, once set, preventing event
> >>>>>   injection later on.
> >>>>> As you may have inferred already, I'm inclined to suggest to drop the
> >>>>> the is_vcpu_online() check from hvm_set_callback_via().
> >>>>>
> >>>>> One related question here is whether vlapic_do_init() shouldn't have
> >>>>> the non-architectural side effect of clearing evtchn_upcall_pending.
> >>>>> While this again violates the principle of the hypervisor only ever
> >>>>> setting that bit, it would deal with the risk of no further event
> >>>>> injection once the flag is set, considering that vlapic_do_init()
> >>>>> clears IRR (and ISR).
> >>>>
> >>>> That would seem sensible to me, and was kind of what I was suggesting
> >>>> in:
> >>>>
> >>>> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/
> >>>
> >>> Another option would be for vcpu_mark_events_pending() to
> >>> unconditionally call hvm_assert_evtchn_irq() regardless of the state
> >>> of evtchn_upcall_pending.
> >>
> >> I think you said so before, and ...
> >>
> >>>  This will create some spurious events.
> >>
> >> ... I continue to be afraid of s/some/many/.
> > 
> > Not _that_ many I think, as we can only queue one pending interrupt in
> > IRR.
> 
> We need to be careful here - the kernel treating it as "edge" (like
> any other interrupt coming directly from the LAPIC), it ack-s it
> before calling the handler, i.e. before evtchn_upcall_pending would
> have a chance to be cleared. See Linux'es sysvec_xen_hvm_callback().

Hm, that's not how I handle it on FreeBSD, where the vector is acked
after calling the handler (evtchn_upcall_pending gets cleared before
the EOI).  Maybe there's some corner case I'm missing that requires
the EOI to be performed before clearing evtchn_upcall_pending?

> >> This event delivery is meant
> >> to be kind of edge triggered, and I think it is a reasonable model to treat
> >> the flag going from 0 to 1 as the "edge".
> > 
> > Hm, it's a weird interrupt model I would say.  In some aspects it's
> > similar to level (as the line stays asserted until it's cleared), but
> > we don't get new interrupts injected into the APIC.
> > 
> > Maybe the right mode would be to treat evtchn_upcall_pending as a
> > level triggered line and keep injecting interrupts until the line is
> > deasserted (ie: evtchn_upcall_pending == 0)?
> 
> As said above, and as also pointed out by Andrew on another sub-
> thread when discussing the (dis)similarity with IO-APIC connected
> interrupts, at the LAPIC there's no edge/level distinction anymore,
> as we're not dealing with "asserted" signals there, but just with
> messages sent on the bus.

Right, we could likely fake the behavior I've listed above, but not
sure it's worth it.

We should likely have some of this documented in xen.h next to the
declaration of evtchn_upcall_pending, at least to note that once
evtchn_upcall_pending is set no further vector callbacks will get
injected until the field is cleared.

Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 24.11.2022 16:12, Roger Pau Monné wrote:
> On Thu, Nov 24, 2022 at 12:16:13PM +0100, Jan Beulich wrote:
>> On 24.11.2022 10:33, Roger Pau Monné wrote:
>>> On Thu, Nov 24, 2022 at 10:11:05AM +0100, Jan Beulich wrote:
>>>> On 24.11.2022 10:06, Roger Pau Monné wrote:
>>>>> On Thu, Nov 24, 2022 at 09:42:40AM +0100, Roger Pau Monné wrote:
>>>>>> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
>>>>>>> - problematic wrt evtchn_upcall_pending, once set, preventing event
>>>>>>>   injection later on.
>>>>>>> As you may have inferred already, I'm inclined to suggest to drop the
>>>>>>> the is_vcpu_online() check from hvm_set_callback_via().
>>>>>>>
>>>>>>> One related question here is whether vlapic_do_init() shouldn't have
>>>>>>> the non-architectural side effect of clearing evtchn_upcall_pending.
>>>>>>> While this again violates the principle of the hypervisor only ever
>>>>>>> setting that bit, it would deal with the risk of no further event
>>>>>>> injection once the flag is set, considering that vlapic_do_init()
>>>>>>> clears IRR (and ISR).
>>>>>>
>>>>>> That would seem sensible to me, and was kind of what I was suggesting
>>>>>> in:
>>>>>>
>>>>>> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/
>>>>>
>>>>> Another option would be for vcpu_mark_events_pending() to
>>>>> unconditionally call hvm_assert_evtchn_irq() regardless of the state
>>>>> of evtchn_upcall_pending.
>>>>
>>>> I think you said so before, and ...
>>>>
>>>>>  This will create some spurious events.
>>>>
>>>> ... I continue to be afraid of s/some/many/.
>>>
>>> Not _that_ many I think, as we can only queue one pending interrupt in
>>> IRR.
>>
>> We need to be careful here - the kernel treating it as "edge" (like
>> any other interrupt coming directly from the LAPIC), it ack-s it
>> before calling the handler, i.e. before evtchn_upcall_pending would
>> have a chance to be cleared. See Linux'es sysvec_xen_hvm_callback().
> 
> Hm, that's not how I handle it on FreeBSD, where the vector is acked
> after calling the handler (evtchn_upcall_pending gets cleared before
> the EOI).  Maybe there's some corner case I'm missing that requires
> the EOI to be performed before clearing evtchn_upcall_pending?

I think for the purpose of the one vector doing the EOI late is okay,
but aiui the goal of doing it early for edge triggered interrupts in
general (and yet more generally as early as possible) is to unmask
lower priority vectors as well. Of course that's useful only if IRQs
as a whole are unmasked during (part of) the handling.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Fri, Nov 25, 2022 at 09:43:59AM +0100, Jan Beulich wrote:
> On 24.11.2022 16:12, Roger Pau Monné wrote:
> > On Thu, Nov 24, 2022 at 12:16:13PM +0100, Jan Beulich wrote:
> >> On 24.11.2022 10:33, Roger Pau Monné wrote:
> >>> On Thu, Nov 24, 2022 at 10:11:05AM +0100, Jan Beulich wrote:
> >>>> On 24.11.2022 10:06, Roger Pau Monné wrote:
> >>>>> On Thu, Nov 24, 2022 at 09:42:40AM +0100, Roger Pau Monné wrote:
> >>>>>> On Thu, Nov 24, 2022 at 08:59:00AM +0100, Jan Beulich wrote:
> >>>>>>> - problematic wrt evtchn_upcall_pending, once set, preventing event
> >>>>>>>   injection later on.
> >>>>>>> As you may have inferred already, I'm inclined to suggest to drop the
> >>>>>>> the is_vcpu_online() check from hvm_set_callback_via().
> >>>>>>>
> >>>>>>> One related question here is whether vlapic_do_init() shouldn't have
> >>>>>>> the non-architectural side effect of clearing evtchn_upcall_pending.
> >>>>>>> While this again violates the principle of the hypervisor only ever
> >>>>>>> setting that bit, it would deal with the risk of no further event
> >>>>>>> injection once the flag is set, considering that vlapic_do_init()
> >>>>>>> clears IRR (and ISR).
> >>>>>>
> >>>>>> That would seem sensible to me, and was kind of what I was suggesting
> >>>>>> in:
> >>>>>>
> >>>>>> https://lore.kernel.org/xen-devel/Y3eO0bMKRPYJc2yQ@Air-de-Roger/
> >>>>>
> >>>>> Another option would be for vcpu_mark_events_pending() to
> >>>>> unconditionally call hvm_assert_evtchn_irq() regardless of the state
> >>>>> of evtchn_upcall_pending.
> >>>>
> >>>> I think you said so before, and ...
> >>>>
> >>>>>  This will create some spurious events.
> >>>>
> >>>> ... I continue to be afraid of s/some/many/.
> >>>
> >>> Not _that_ many I think, as we can only queue one pending interrupt in
> >>> IRR.
> >>
> >> We need to be careful here - the kernel treating it as "edge" (like
> >> any other interrupt coming directly from the LAPIC), it ack-s it
> >> before calling the handler, i.e. before evtchn_upcall_pending would
> >> have a chance to be cleared. See Linux'es sysvec_xen_hvm_callback().
> > 
> > Hm, that's not how I handle it on FreeBSD, where the vector is acked
> > after calling the handler (evtchn_upcall_pending gets cleared before
> > the EOI).  Maybe there's some corner case I'm missing that requires
> > the EOI to be performed before clearing evtchn_upcall_pending?
> 
> I think for the purpose of the one vector doing the EOI late is okay,
> but aiui the goal of doing it early for edge triggered interrupts in
> general (and yet more generally as early as possible) is to unmask
> lower priority vectors as well.

My reasoning for doing it late was in order to avoid adding extra
latency to things like the timer handling, as the EOI will likely
trigger a vmexit.

> Of course that's useful only if IRQs
> as a whole are unmasked during (part of) the handling.

What do you mean with IRQs as a whole?  Are you referring to setting
the interrupt flag?

Thanks for the input, it's appreciated, and sorry for diverging the
conversation so much.

Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 25.11.2022 10:00, Roger Pau Monné wrote:
> On Fri, Nov 25, 2022 at 09:43:59AM +0100, Jan Beulich wrote:
>> On 24.11.2022 16:12, Roger Pau Monné wrote:
>>> On Thu, Nov 24, 2022 at 12:16:13PM +0100, Jan Beulich wrote:
>>>> We need to be careful here - the kernel treating it as "edge" (like
>>>> any other interrupt coming directly from the LAPIC), it ack-s it
>>>> before calling the handler, i.e. before evtchn_upcall_pending would
>>>> have a chance to be cleared. See Linux'es sysvec_xen_hvm_callback().
>>>
>>> Hm, that's not how I handle it on FreeBSD, where the vector is acked
>>> after calling the handler (evtchn_upcall_pending gets cleared before
>>> the EOI).  Maybe there's some corner case I'm missing that requires
>>> the EOI to be performed before clearing evtchn_upcall_pending?
>>
>> I think for the purpose of the one vector doing the EOI late is okay,
>> but aiui the goal of doing it early for edge triggered interrupts in
>> general (and yet more generally as early as possible) is to unmask
>> lower priority vectors as well.
> 
> My reasoning for doing it late was in order to avoid adding extra
> latency to things like the timer handling, as the EOI will likely
> trigger a vmexit.
> 
>> Of course that's useful only if IRQs
>> as a whole are unmasked during (part of) the handling.
> 
> What do you mean with IRQs as a whole?  Are you referring to setting
> the interrupt flag?

Yes (it being cleared).

> Thanks for the input, it's appreciated, and sorry for diverging the
> conversation so much.

I think that's quite fine, because aspects like the one still in context
are at least potentially relevant. Especially since here we're not
dealing with architectural behavior, but with an extrapolation thereof.
And views may (and apparently do) differ as to what the correct
extrapolation would be, even if just for corner aspects.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Fri, Nov 18, 2022 at 01:54:33PM +0100, Jan Beulich wrote:
> On 18.11.2022 13:33, Andrew Cooper wrote:
> > On 18/11/2022 10:31, Jan Beulich wrote:
> >> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> >> exposed a problem with the marking of the respective vector as
> >> pending: For quite some time Linux has been checking whether any stale
> >> ISR or IRR bits would still be set while preparing the LAPIC for use.
> >> This check is now triggering on the upcall vector, as the registration,
> >> at least for APs, happens before the LAPIC is actually enabled.
> >>
> >> In software-disabled state an LAPIC would not accept any interrupt
> >> requests and hence no IRR bit would newly become set while in this
> >> state. As a result it is also wrong for us to mark the upcall vector as
> >> having a pending request when the vLAPIC is in this state.
> > 
> > I agree with this.
> > 
> >> To compensate for the "enabled" check added to the assertion logic, add
> >> logic to (conditionally) mark the upcall vector as having a request
> >> pending at the time the LAPIC is being software-enabled by the guest.
> > 
> > But this, I don't think is appropriate.
> > 
> > The point of raising on enable is allegedly to work around setup race
> > conditions.  I'm unconvinced by this reasoning, but it is what it is,
> > and the stated behaviour is to raise there and then.
> > 
> > If a guest enables evtchn while the LAPIC is disabled, then the
> > interrupt is lost.  Like every other interrupt in an x86 system.
> 
> Edge triggered ones you mean, I suppose, but yes.
> 
> > I don't think there is any credible way a guest kernel author can expect
> > the weird evtchn edgecase to wait for an arbitrary point in the future,
> > and it's a corner case that I think is worth not keeping.
> 
> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
> after setting upcall vector"), referenced by the Fixes: tag? The issue is
> that with evtchn_upcall_pending once set, there would never again be a
> notification. So if what you say is to be the model we follow, then that
> earlier change was perhaps wrong as well. Instead it should then have
> been a guest change (as also implicit from your reply) to clear
> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
> enabling (here), perhaps by way of "manually" invoking the handling of
> that pending event, or by issuing a self-IPI with that vector.
> Especially the LAPIC enabling case would then be yet another Xen-specific
> on a guest code path which better wouldn't have to be aware of Xen.

Another option might be to clear evtchn_upcall_pending once the vLAPIC
is enabled, so that further setting of evtchn_upcall_pending will
inject the vector.  I'm worried however whether that could break
existing users, as this would be an interface behavior change.

Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Jan Beulich 1 year, 8 months ago

On 18.11.2022 14:55, Roger Pau Monné wrote:
> On Fri, Nov 18, 2022 at 01:54:33PM +0100, Jan Beulich wrote:
>> On 18.11.2022 13:33, Andrew Cooper wrote:
>>> On 18/11/2022 10:31, Jan Beulich wrote:
>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>>>> exposed a problem with the marking of the respective vector as
>>>> pending: For quite some time Linux has been checking whether any stale
>>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>>>> This check is now triggering on the upcall vector, as the registration,
>>>> at least for APs, happens before the LAPIC is actually enabled.
>>>>
>>>> In software-disabled state an LAPIC would not accept any interrupt
>>>> requests and hence no IRR bit would newly become set while in this
>>>> state. As a result it is also wrong for us to mark the upcall vector as
>>>> having a pending request when the vLAPIC is in this state.
>>>
>>> I agree with this.
>>>
>>>> To compensate for the "enabled" check added to the assertion logic, add
>>>> logic to (conditionally) mark the upcall vector as having a request
>>>> pending at the time the LAPIC is being software-enabled by the guest.
>>>
>>> But this, I don't think is appropriate.
>>>
>>> The point of raising on enable is allegedly to work around setup race
>>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
>>> and the stated behaviour is to raise there and then.
>>>
>>> If a guest enables evtchn while the LAPIC is disabled, then the
>>> interrupt is lost.  Like every other interrupt in an x86 system.
>>
>> Edge triggered ones you mean, I suppose, but yes.
>>
>>> I don't think there is any credible way a guest kernel author can expect
>>> the weird evtchn edgecase to wait for an arbitrary point in the future,
>>> and it's a corner case that I think is worth not keeping.
>>
>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
>> after setting upcall vector"), referenced by the Fixes: tag? The issue is
>> that with evtchn_upcall_pending once set, there would never again be a
>> notification. So if what you say is to be the model we follow, then that
>> earlier change was perhaps wrong as well. Instead it should then have
>> been a guest change (as also implicit from your reply) to clear
>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
>> enabling (here), perhaps by way of "manually" invoking the handling of
>> that pending event, or by issuing a self-IPI with that vector.
>> Especially the LAPIC enabling case would then be yet another Xen-specific
>> on a guest code path which better wouldn't have to be aware of Xen.
> 
> Another option might be to clear evtchn_upcall_pending once the vLAPIC
> is enabled, so that further setting of evtchn_upcall_pending will
> inject the vector.  I'm worried however whether that could break
> existing users, as this would be an interface behavior change.

You mean _Xen_ clearing the flag? No, that breaks firmly documented
behavior. Xen only ever sets this field.

Jan

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Fri, Nov 18, 2022 at 02:58:10PM +0100, Jan Beulich wrote:
> On 18.11.2022 14:55, Roger Pau Monné wrote:
> > On Fri, Nov 18, 2022 at 01:54:33PM +0100, Jan Beulich wrote:
> >> On 18.11.2022 13:33, Andrew Cooper wrote:
> >>> On 18/11/2022 10:31, Jan Beulich wrote:
> >>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> >>>> exposed a problem with the marking of the respective vector as
> >>>> pending: For quite some time Linux has been checking whether any stale
> >>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
> >>>> This check is now triggering on the upcall vector, as the registration,
> >>>> at least for APs, happens before the LAPIC is actually enabled.
> >>>>
> >>>> In software-disabled state an LAPIC would not accept any interrupt
> >>>> requests and hence no IRR bit would newly become set while in this
> >>>> state. As a result it is also wrong for us to mark the upcall vector as
> >>>> having a pending request when the vLAPIC is in this state.
> >>>
> >>> I agree with this.
> >>>
> >>>> To compensate for the "enabled" check added to the assertion logic, add
> >>>> logic to (conditionally) mark the upcall vector as having a request
> >>>> pending at the time the LAPIC is being software-enabled by the guest.
> >>>
> >>> But this, I don't think is appropriate.
> >>>
> >>> The point of raising on enable is allegedly to work around setup race
> >>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
> >>> and the stated behaviour is to raise there and then.
> >>>
> >>> If a guest enables evtchn while the LAPIC is disabled, then the
> >>> interrupt is lost.  Like every other interrupt in an x86 system.
> >>
> >> Edge triggered ones you mean, I suppose, but yes.
> >>
> >>> I don't think there is any credible way a guest kernel author can expect
> >>> the weird evtchn edgecase to wait for an arbitrary point in the future,
> >>> and it's a corner case that I think is worth not keeping.
> >>
> >> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
> >> after setting upcall vector"), referenced by the Fixes: tag? The issue is
> >> that with evtchn_upcall_pending once set, there would never again be a
> >> notification. So if what you say is to be the model we follow, then that
> >> earlier change was perhaps wrong as well. Instead it should then have
> >> been a guest change (as also implicit from your reply) to clear
> >> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
> >> enabling (here), perhaps by way of "manually" invoking the handling of
> >> that pending event, or by issuing a self-IPI with that vector.
> >> Especially the LAPIC enabling case would then be yet another Xen-specific
> >> on a guest code path which better wouldn't have to be aware of Xen.
> > 
> > Another option might be to clear evtchn_upcall_pending once the vLAPIC
> > is enabled, so that further setting of evtchn_upcall_pending will
> > inject the vector.  I'm worried however whether that could break
> > existing users, as this would be an interface behavior change.
> 
> You mean _Xen_ clearing the flag? No, that breaks firmly documented
> behavior. Xen only ever sets this field.

So the only other option would be for Xen to ignore
evtchn_upcall_pending and always inject the interrupt in
vcpu_mark_events_pending(), but that would then lead to spurious
interrupts if an event channel triggers while the pending upcall
vector is still set in the ISR and evtchn_upcall_pending has already
been cleared.

Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Roger Pau Monné 1 year, 8 months ago

On Fri, Nov 18, 2022 at 12:33:00PM +0000, Andrew Cooper wrote:
> On 18/11/2022 10:31, Jan Beulich wrote:
> > Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> > exposed a problem with the marking of the respective vector as
> > pending: For quite some time Linux has been checking whether any stale
> > ISR or IRR bits would still be set while preparing the LAPIC for use.
> > This check is now triggering on the upcall vector, as the registration,
> > at least for APs, happens before the LAPIC is actually enabled.
> >
> > In software-disabled state an LAPIC would not accept any interrupt
> > requests and hence no IRR bit would newly become set while in this
> > state. As a result it is also wrong for us to mark the upcall vector as
> > having a pending request when the vLAPIC is in this state.
> 
> I agree with this.
> 
> > To compensate for the "enabled" check added to the assertion logic, add
> > logic to (conditionally) mark the upcall vector as having a request
> > pending at the time the LAPIC is being software-enabled by the guest.
> 
> But this, I don't think is appropriate.
> 
> The point of raising on enable is allegedly to work around setup race
> conditions.  I'm unconvinced by this reasoning, but it is what it is,
> and the stated behaviour is to raise there and then.
> 
> If a guest enables evtchn while the LAPIC is disabled, then the
> interrupt is lost.  Like every other interrupt in an x86 system.
> 
> I don't think there is any credible way a guest kernel author can expect
> the weird evtchn edgecase to wait for an arbitrary point in the future,
> and it's a corner case that I think is worth not keeping.

We would then need some kind of fix in order to clear
evtchn_upcall_pending, because having that set without an interrupt
pending on the vLAPIC will result in no further event channel callback
interrupts being injected (see vcpu_mark_events_pending()).

Maybe we want to change vcpu_mark_events_pending() so that it always
tries to inject the vector even if evtchn_upcall_pending is already
set by calling hvm_assert_evtchn_irq() unconditionally?

Thanks, Roger.

Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled

Posted by Juergen Gross 1 year, 8 months ago

On 18.11.22 11:31, Jan Beulich wrote:
> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
> exposed a problem with the marking of the respective vector as
> pending: For quite some time Linux has been checking whether any stale
> ISR or IRR bits would still be set while preparing the LAPIC for use.
> This check is now triggering on the upcall vector, as the registration,
> at least for APs, happens before the LAPIC is actually enabled.
> 
> In software-disabled state an LAPIC would not accept any interrupt
> requests and hence no IRR bit would newly become set while in this
> state. As a result it is also wrong for us to mark the upcall vector as
> having a pending request when the vLAPIC is in this state.
> 
> To compensate for the "enabled" check added to the assertion logic, add
> logic to (conditionally) mark the upcall vector as having a request
> pending at the time the LAPIC is being software-enabled by the guest.
> 
> Fixes: 7b5b8ca7dffd ("x86/upcall: inject a spurious event after setting upcall vector")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen