[PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding

Dmytro Maluka posted 5 patches 6 days, 5 hours ago
arch/x86/include/asm/kvm_host.h |  17 +---
arch/x86/kvm/i8259.c            |   6 ++
arch/x86/kvm/ioapic.c           |   8 +-
arch/x86/kvm/ioapic.h           |   1 +
arch/x86/kvm/irq_comm.c         |  74 +++++++++++------
arch/x86/kvm/x86.c              |   1 -
include/linux/kvm_host.h        |  21 ++++-
include/linux/kvm_irqfd.h       |  16 +++-
virt/kvm/eventfd.c              | 136 ++++++++++++++++++++++++++++----
virt/kvm/kvm_main.c             |   1 +
10 files changed, 221 insertions(+), 60 deletions(-)
[PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 6 days, 5 hours ago
The existing KVM mechanism for forwarding of level-triggered interrupts
using resample eventfd doesn't work quite correctly in the case of
interrupts that are handled in a Linux guest as oneshot interrupts
(IRQF_ONESHOT). Such an interrupt is acked to the device in its
threaded irq handler, i.e. later than it is acked to the interrupt
controller (EOI at the end of hardirq), not earlier. The existing KVM
code doesn't take that into account, which results in erroneous extra
interrupts in the guest caused by premature re-assert of an
unacknowledged IRQ by the host.

This patch series fixes this issue (for now on x86 only) by checking if
the interrupt is unmasked when we receive irq ack (EOI) and, in case if
it's masked, postponing resamplefd notify until the guest unmasks it.

Patches 1 and 2 extend the existing support for irq mask notifiers in
KVM, which is a prerequisite needed for KVM irqfd to use mask notifiers
to know when an interrupt is masked or unmasked.

Patch 3 implements the actual fix: postponing resamplefd notify in irqfd
until the irq is unmasked.

Patches 4 and 5 just do some optional renaming for consistency, as we
are now using irq mask notifiers in irqfd along with irq ack notifiers.

Please see individual patches for more details.

v2:
  - Fixed compilation failure on non-x86: mask_notifier_list moved from
    x86 "struct kvm_arch" to generic "struct kvm".
  - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
    though it is not called on other architectures for now.
  - Instead of kvm_irq_is_masked() implemented
    kvm_register_and_fire_irq_mask_notifier() to fix potential race
    when reading the initial IRQ mask state.
  - Renamed for clarity:
      - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
      - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
      - resampler->notifier -> resampler->ack_notifier
  - Reorganized code in irqfd_resampler_ack() and
    irqfd_resampler_mask_notify() to make it easier to follow.
  - Don't follow unwanted "return type on separate line" style for
    irqfd_resampler_mask_notify().

Dmytro Maluka (5):
  KVM: x86: Move irq mask notifiers from x86 to generic KVM
  KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
  KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
  KVM: irqfd: Rename resampler->notifier
  KVM: Rename kvm_irq_has_notifier()

 arch/x86/include/asm/kvm_host.h |  17 +---
 arch/x86/kvm/i8259.c            |   6 ++
 arch/x86/kvm/ioapic.c           |   8 +-
 arch/x86/kvm/ioapic.h           |   1 +
 arch/x86/kvm/irq_comm.c         |  74 +++++++++++------
 arch/x86/kvm/x86.c              |   1 -
 include/linux/kvm_host.h        |  21 ++++-
 include/linux/kvm_irqfd.h       |  16 +++-
 virt/kvm/eventfd.c              | 136 ++++++++++++++++++++++++++++----
 virt/kvm/kvm_main.c             |   1 +
 10 files changed, 221 insertions(+), 60 deletions(-)

-- 
2.37.1.559.g78731f0fdb-goog
RE: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dong, Eddie 3 days, 1 hour ago
> 
> The existing KVM mechanism for forwarding of level-triggered interrupts using
> resample eventfd doesn't work quite correctly in the case of interrupts that are
> handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an
> interrupt is acked to the device in its threaded irq handler, i.e. later than it is
> acked to the interrupt controller (EOI at the end of hardirq), not earlier. The
> existing KVM code doesn't take that into account, which results in erroneous
> extra interrupts in the guest caused by premature re-assert of an
> unacknowledged IRQ by the host.

Interesting...  How it behaviors in native side? 

> 
> This patch series fixes this issue (for now on x86 only) by checking if the
> interrupt is unmasked when we receive irq ack (EOI) and, in case if it's masked,
> postponing resamplefd notify until the guest unmasks it.
> 
> Patches 1 and 2 extend the existing support for irq mask notifiers in KVM,
> which is a prerequisite needed for KVM irqfd to use mask notifiers to know
> when an interrupt is masked or unmasked.
> 
> Patch 3 implements the actual fix: postponing resamplefd notify in irqfd until
> the irq is unmasked.
> 
> Patches 4 and 5 just do some optional renaming for consistency, as we are now
> using irq mask notifiers in irqfd along with irq ack notifiers.
> 
> Please see individual patches for more details.
> 
> v2:
>   - Fixed compilation failure on non-x86: mask_notifier_list moved from
>     x86 "struct kvm_arch" to generic "struct kvm".
>   - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
>     though it is not called on other architectures for now.
>   - Instead of kvm_irq_is_masked() implemented
>     kvm_register_and_fire_irq_mask_notifier() to fix potential race
>     when reading the initial IRQ mask state.
>   - Renamed for clarity:
>       - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
>       - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
>       - resampler->notifier -> resampler->ack_notifier
>   - Reorganized code in irqfd_resampler_ack() and
>     irqfd_resampler_mask_notify() to make it easier to follow.
>   - Don't follow unwanted "return type on separate line" style for
>     irqfd_resampler_mask_notify().
> 
> Dmytro Maluka (5):
>   KVM: x86: Move irq mask notifiers from x86 to generic KVM
>   KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
>   KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
>   KVM: irqfd: Rename resampler->notifier
>   KVM: Rename kvm_irq_has_notifier()
> 
>  arch/x86/include/asm/kvm_host.h |  17 +---
>  arch/x86/kvm/i8259.c            |   6 ++
>  arch/x86/kvm/ioapic.c           |   8 +-
>  arch/x86/kvm/ioapic.h           |   1 +
>  arch/x86/kvm/irq_comm.c         |  74 +++++++++++------
>  arch/x86/kvm/x86.c              |   1 -
>  include/linux/kvm_host.h        |  21 ++++-
>  include/linux/kvm_irqfd.h       |  16 +++-
>  virt/kvm/eventfd.c              | 136 ++++++++++++++++++++++++++++----
>  virt/kvm/kvm_main.c             |   1 +
>  10 files changed, 221 insertions(+), 60 deletions(-)
> 
> --
> 2.37.1.559.g78731f0fdb-goog

Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 2 days, 17 hours ago
On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>
>> The existing KVM mechanism for forwarding of level-triggered interrupts using
>> resample eventfd doesn't work quite correctly in the case of interrupts that are
>> handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an
>> interrupt is acked to the device in its threaded irq handler, i.e. later than it is
>> acked to the interrupt controller (EOI at the end of hardirq), not earlier. The
>> existing KVM code doesn't take that into account, which results in erroneous
>> extra interrupts in the guest caused by premature re-assert of an
>> unacknowledged IRQ by the host.
> 
> Interesting...  How it behaviors in native side? 

In native it behaves correctly, since Linux masks such a oneshot
interrupt at the beginning of hardirq, so that the EOI at the end of
hardirq doesn't result in its immediate re-assert, and then unmasks it
later, after its threaded irq handler completes.

In handle_fasteoi_irq():

	if (desc->istate & IRQS_ONESHOT)
		mask_irq(desc);

	handle_irq_event(desc);

	cond_unmask_eoi_irq(desc, chip);


and later in unmask_threaded_irq():

	unmask_irq(desc);

I also mentioned that in patch #3 description:
"Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we don't check that the interrupt is still masked in the guest at the
moment of EOI. Resamplefd is notified regardless, so vfio prematurely
unmasks the host physical IRQ, thus a new (unwanted) physical interrupt
is generated in the host and queued for injection to the guest."

> 
>>
>> This patch series fixes this issue (for now on x86 only) by checking if the
>> interrupt is unmasked when we receive irq ack (EOI) and, in case if it's masked,
>> postponing resamplefd notify until the guest unmasks it.
>>
>> Patches 1 and 2 extend the existing support for irq mask notifiers in KVM,
>> which is a prerequisite needed for KVM irqfd to use mask notifiers to know
>> when an interrupt is masked or unmasked.
>>
>> Patch 3 implements the actual fix: postponing resamplefd notify in irqfd until
>> the irq is unmasked.
>>
>> Patches 4 and 5 just do some optional renaming for consistency, as we are now
>> using irq mask notifiers in irqfd along with irq ack notifiers.
>>
>> Please see individual patches for more details.
>>
>> v2:
>>   - Fixed compilation failure on non-x86: mask_notifier_list moved from
>>     x86 "struct kvm_arch" to generic "struct kvm".
>>   - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
>>     though it is not called on other architectures for now.
>>   - Instead of kvm_irq_is_masked() implemented
>>     kvm_register_and_fire_irq_mask_notifier() to fix potential race
>>     when reading the initial IRQ mask state.
>>   - Renamed for clarity:
>>       - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
>>       - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
>>       - resampler->notifier -> resampler->ack_notifier
>>   - Reorganized code in irqfd_resampler_ack() and
>>     irqfd_resampler_mask_notify() to make it easier to follow.
>>   - Don't follow unwanted "return type on separate line" style for
>>     irqfd_resampler_mask_notify().
>>
>> Dmytro Maluka (5):
>>   KVM: x86: Move irq mask notifiers from x86 to generic KVM
>>   KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
>>   KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
>>   KVM: irqfd: Rename resampler->notifier
>>   KVM: Rename kvm_irq_has_notifier()
>>
>>  arch/x86/include/asm/kvm_host.h |  17 +---
>>  arch/x86/kvm/i8259.c            |   6 ++
>>  arch/x86/kvm/ioapic.c           |   8 +-
>>  arch/x86/kvm/ioapic.h           |   1 +
>>  arch/x86/kvm/irq_comm.c         |  74 +++++++++++------
>>  arch/x86/kvm/x86.c              |   1 -
>>  include/linux/kvm_host.h        |  21 ++++-
>>  include/linux/kvm_irqfd.h       |  16 +++-
>>  virt/kvm/eventfd.c              | 136 ++++++++++++++++++++++++++++----
>>  virt/kvm/kvm_main.c             |   1 +
>>  10 files changed, 221 insertions(+), 60 deletions(-)
>>
>> --
>> 2.37.1.559.g78731f0fdb-goog
>
RE: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dong, Eddie 2 days, 4 hours ago

> -----Original Message-----
> From: Dmytro Maluka <dmy@semihalf.com>
> Sent: Tuesday, August 9, 2022 12:24 AM
> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
> kvm@vger.kernel.org
> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> 
> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>
> >> The existing KVM mechanism for forwarding of level-triggered
> >> interrupts using resample eventfd doesn't work quite correctly in the
> >> case of interrupts that are handled in a Linux guest as oneshot
> >> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >> in its threaded irq handler, i.e. later than it is acked to the
> >> interrupt controller (EOI at the end of hardirq), not earlier. The
> >> existing KVM code doesn't take that into account, which results in
> >> erroneous extra interrupts in the guest caused by premature re-assert of an
> unacknowledged IRQ by the host.
> >
> > Interesting...  How it behaviors in native side?
> 
> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> immediate re-assert, and then unmasks it later, after its threaded irq handler
> completes.
> 
> In handle_fasteoi_irq():
> 
> 	if (desc->istate & IRQS_ONESHOT)
> 		mask_irq(desc);
> 
> 	handle_irq_event(desc);
> 
> 	cond_unmask_eoi_irq(desc, chip);
> 
> 
> and later in unmask_threaded_irq():
> 
> 	unmask_irq(desc);
> 
> I also mentioned that in patch #3 description:
> "Linux keeps such interrupt masked until its threaded handler finishes, to
> prevent the EOI from re-asserting an unacknowledged interrupt.

That makes sense. Can you include the full story in cover letter too?


> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> check that the interrupt is still masked in the guest at the moment of EOI.
> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> and queued for injection to the guest."
> 

Emulation of level triggered IRQ is a pain point ☹
I read we need to emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e. ioapic here).
Technically, the guest can change the polarity of vIOAPIC, which will lead to a new  virtual IRQ 
even w/o host side interrupt.  
"pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an event rather than an interrupt level.


> >
> >>
> >> This patch series fixes this issue (for now on x86 only) by checking
> >> if the interrupt is unmasked when we receive irq ack (EOI) and, in
> >> case if it's masked, postponing resamplefd notify until the guest unmasks it.
> >>
> >> Patches 1 and 2 extend the existing support for irq mask notifiers in
> >> KVM, which is a prerequisite needed for KVM irqfd to use mask
> >> notifiers to know when an interrupt is masked or unmasked.
> >>
> >> Patch 3 implements the actual fix: postponing resamplefd notify in
> >> irqfd until the irq is unmasked.
> >>
> >> Patches 4 and 5 just do some optional renaming for consistency, as we
> >> are now using irq mask notifiers in irqfd along with irq ack notifiers.
> >>
> >> Please see individual patches for more details.
> >>
> >> v2:
> >>   - Fixed compilation failure on non-x86: mask_notifier_list moved from
> >>     x86 "struct kvm_arch" to generic "struct kvm".
> >>   - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
> >>     though it is not called on other architectures for now.
> >>   - Instead of kvm_irq_is_masked() implemented
> >>     kvm_register_and_fire_irq_mask_notifier() to fix potential race
> >>     when reading the initial IRQ mask state.
> >>   - Renamed for clarity:
> >>       - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
> >>       - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
> >>       - resampler->notifier -> resampler->ack_notifier
> >>   - Reorganized code in irqfd_resampler_ack() and
> >>     irqfd_resampler_mask_notify() to make it easier to follow.
> >>   - Don't follow unwanted "return type on separate line" style for
> >>     irqfd_resampler_mask_notify().
> >>
> >> Dmytro Maluka (5):
> >>   KVM: x86: Move irq mask notifiers from x86 to generic KVM
> >>   KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
> >>   KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
> >>   KVM: irqfd: Rename resampler->notifier
> >>   KVM: Rename kvm_irq_has_notifier()
> >>
> >>  arch/x86/include/asm/kvm_host.h |  17 +---
> >>  arch/x86/kvm/i8259.c            |   6 ++
> >>  arch/x86/kvm/ioapic.c           |   8 +-
> >>  arch/x86/kvm/ioapic.h           |   1 +
> >>  arch/x86/kvm/irq_comm.c         |  74 +++++++++++------
> >>  arch/x86/kvm/x86.c              |   1 -
> >>  include/linux/kvm_host.h        |  21 ++++-
> >>  include/linux/kvm_irqfd.h       |  16 +++-
> >>  virt/kvm/eventfd.c              | 136 ++++++++++++++++++++++++++++----
> >>  virt/kvm/kvm_main.c             |   1 +
> >>  10 files changed, 221 insertions(+), 60 deletions(-)
> >>
> >> --
> >> 2.37.1.559.g78731f0fdb-goog
> >
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 2 days, 1 hour ago
On 8/9/22 10:01 PM, Dong, Eddie wrote:
> 
> 
>> -----Original Message-----
>> From: Dmytro Maluka <dmy@semihalf.com>
>> Sent: Tuesday, August 9, 2022 12:24 AM
>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
>> kvm@vger.kernel.org
>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>
>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>
>>>> The existing KVM mechanism for forwarding of level-triggered
>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>> existing KVM code doesn't take that into account, which results in
>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>> unacknowledged IRQ by the host.
>>>
>>> Interesting...  How it behaviors in native side?
>>
>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>> completes.
>>
>> In handle_fasteoi_irq():
>>
>> 	if (desc->istate & IRQS_ONESHOT)
>> 		mask_irq(desc);
>>
>> 	handle_irq_event(desc);
>>
>> 	cond_unmask_eoi_irq(desc, chip);
>>
>>
>> and later in unmask_threaded_irq():
>>
>> 	unmask_irq(desc);
>>
>> I also mentioned that in patch #3 description:
>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>> prevent the EOI from re-asserting an unacknowledged interrupt.
> 
> That makes sense. Can you include the full story in cover letter too?

Ok, I will.

> 
> 
>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>> check that the interrupt is still masked in the guest at the moment of EOI.
>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>> and queued for injection to the guest."
>>
> 
> Emulation of level triggered IRQ is a pain point ☹
> I read we need to emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e. ioapic here).
> Technically, the guest can change the polarity of vIOAPIC, which will lead to a new  virtual IRQ 
> even w/o host side interrupt.  

Thanks, interesting point. Do you mean that this behavior (a new vIRQ as
a result of polarity change) may already happen with the existing KVM code?

It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC
polarity bit, in particular it doesn't handle change of the polarity by
the guest (i.e. doesn't update the virtual IRR register, and so on), so
it shouldn't result in a new interrupt.

Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there
seems to be an assumption that KVM interpretes the IRQ level value as
active (asserted) vs inactive (deasserted) rather than high vs low, i.e.
the polarity doesn't matter to KVM.

So, since both sides (KVM emulating the IOAPIC, and vfio/whatever
emulating an external interrupt source) seem to operate on a level of
abstraction of "asserted" vs "de-asserted" interrupt state regardless of
the polarity, and that seems not a bug but a feature, it seems that we
don't need to emulate the IRQ level as such. Or am I missing something?

OTOH, I guess this means that the existing KVM's emulation of
level-triggered interrupts is somewhat limited (a guest may legitimately
expect an interrupt fired as a result of polarity change, and that case
is not supported by KVM). But that is rather out of scope of the oneshot
interrupts issue addressed by this patchset.

> "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an event rather than an interrupt level.
> 
> 
>>>
>>>>
>>>> This patch series fixes this issue (for now on x86 only) by checking
>>>> if the interrupt is unmasked when we receive irq ack (EOI) and, in
>>>> case if it's masked, postponing resamplefd notify until the guest unmasks it.
>>>>
>>>> Patches 1 and 2 extend the existing support for irq mask notifiers in
>>>> KVM, which is a prerequisite needed for KVM irqfd to use mask
>>>> notifiers to know when an interrupt is masked or unmasked.
>>>>
>>>> Patch 3 implements the actual fix: postponing resamplefd notify in
>>>> irqfd until the irq is unmasked.
>>>>
>>>> Patches 4 and 5 just do some optional renaming for consistency, as we
>>>> are now using irq mask notifiers in irqfd along with irq ack notifiers.
>>>>
>>>> Please see individual patches for more details.
>>>>
>>>> v2:
>>>>   - Fixed compilation failure on non-x86: mask_notifier_list moved from
>>>>     x86 "struct kvm_arch" to generic "struct kvm".
>>>>   - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
>>>>     though it is not called on other architectures for now.
>>>>   - Instead of kvm_irq_is_masked() implemented
>>>>     kvm_register_and_fire_irq_mask_notifier() to fix potential race
>>>>     when reading the initial IRQ mask state.
>>>>   - Renamed for clarity:
>>>>       - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
>>>>       - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
>>>>       - resampler->notifier -> resampler->ack_notifier
>>>>   - Reorganized code in irqfd_resampler_ack() and
>>>>     irqfd_resampler_mask_notify() to make it easier to follow.
>>>>   - Don't follow unwanted "return type on separate line" style for
>>>>     irqfd_resampler_mask_notify().
>>>>
>>>> Dmytro Maluka (5):
>>>>   KVM: x86: Move irq mask notifiers from x86 to generic KVM
>>>>   KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
>>>>   KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
>>>>   KVM: irqfd: Rename resampler->notifier
>>>>   KVM: Rename kvm_irq_has_notifier()
>>>>
>>>>  arch/x86/include/asm/kvm_host.h |  17 +---
>>>>  arch/x86/kvm/i8259.c            |   6 ++
>>>>  arch/x86/kvm/ioapic.c           |   8 +-
>>>>  arch/x86/kvm/ioapic.h           |   1 +
>>>>  arch/x86/kvm/irq_comm.c         |  74 +++++++++++------
>>>>  arch/x86/kvm/x86.c              |   1 -
>>>>  include/linux/kvm_host.h        |  21 ++++-
>>>>  include/linux/kvm_irqfd.h       |  16 +++-
>>>>  virt/kvm/eventfd.c              | 136 ++++++++++++++++++++++++++++----
>>>>  virt/kvm/kvm_main.c             |   1 +
>>>>  10 files changed, 221 insertions(+), 60 deletions(-)
>>>>
>>>> --
>>>> 2.37.1.559.g78731f0fdb-goog
>>>
RE: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dong, Eddie 1 day, 7 hours ago
> >
> >
> >> However, with KVM + vfio (or whatever is listening on the resamplefd)
> >> we don't check that the interrupt is still masked in the guest at the moment
> of EOI.
> >> Resamplefd is notified regardless, so vfio prematurely unmasks the
> >> host physical IRQ, thus a new (unwanted) physical interrupt is
> >> generated in the host and queued for injection to the guest."
> >>
> >
> > Emulation of level triggered IRQ is a pain point ☹ I read we need to
> > emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e.
> ioapic here).
> > Technically, the guest can change the polarity of vIOAPIC, which will
> > lead to a new  virtual IRQ even w/o host side interrupt.
> 
> Thanks, interesting point. Do you mean that this behavior (a new vIRQ as a
> result of polarity change) may already happen with the existing KVM code?
> 
> It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC polarity
> bit, in particular it doesn't handle change of the polarity by the guest (i.e.
> doesn't update the virtual IRR register, and so on), so it shouldn't result in a
> new interrupt.

Correct, KVM doesn't handle polarity now. Probably because unlikely the commercial OSes 
will change polarity.

> 
> Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there seems to
> be an assumption that KVM interpretes the IRQ level value as active (asserted)
> vs inactive (deasserted) rather than high vs low, i.e.

Asserted/deasserted vs. high/low is same to me, though asserted/deasserted hints more for event rather than state.

> the polarity doesn't matter to KVM.
> 
> So, since both sides (KVM emulating the IOAPIC, and vfio/whatever emulating
> an external interrupt source) seem to operate on a level of abstraction of
> "asserted" vs "de-asserted" interrupt state regardless of the polarity, and that
> seems not a bug but a feature, it seems that we don't need to emulate the IRQ
> level as such. Or am I missing something?

YES, I know current KVM doesn't handle it.  Whether we should support it is another story which I cannot speak for.
Paolo and Alex are the right person 😊
The reason I mention this is because the complexity to adding a pending event vs. supporting a interrupt pin state is same.
I am wondering if we need to revisit it or not.  Behavior closing to real hardware helps us to avoid potential issues IMO, but I am fine to either choice.

> 
> OTOH, I guess this means that the existing KVM's emulation of level-triggered
> interrupts is somewhat limited (a guest may legitimately expect an interrupt
> fired as a result of polarity change, and that case is not supported by KVM). But
> that is rather out of scope of the oneshot interrupts issue addressed by this
> patchset.

Agree.
I didn't know any commercial OSes change polarity either. But I know Xen hypervisor uses polarity under certain condition.
One day, we may see the issue when running Xen as a L1 hypervisor.  But this is not the current worry.


> 
> > "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an
> event rather than an interrupt level.

I know.  I am fine either.

Thanks Eddie

> >
> >
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 1 day, 7 hours ago
On 8/10/22 7:17 PM, Dong, Eddie wrote:
>>>
>>>
>>>> However, with KVM + vfio (or whatever is listening on the resamplefd)
>>>> we don't check that the interrupt is still masked in the guest at the moment
>> of EOI.
>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the
>>>> host physical IRQ, thus a new (unwanted) physical interrupt is
>>>> generated in the host and queued for injection to the guest."
>>>>
>>>
>>> Emulation of level triggered IRQ is a pain point ☹ I read we need to
>>> emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e.
>> ioapic here).
>>> Technically, the guest can change the polarity of vIOAPIC, which will
>>> lead to a new  virtual IRQ even w/o host side interrupt.
>>
>> Thanks, interesting point. Do you mean that this behavior (a new vIRQ as a
>> result of polarity change) may already happen with the existing KVM code?
>>
>> It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC polarity
>> bit, in particular it doesn't handle change of the polarity by the guest (i.e.
>> doesn't update the virtual IRR register, and so on), so it shouldn't result in a
>> new interrupt.
> 
> Correct, KVM doesn't handle polarity now. Probably because unlikely the commercial OSes 
> will change polarity.
> 
>>
>> Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there seems to
>> be an assumption that KVM interpretes the IRQ level value as active (asserted)
>> vs inactive (deasserted) rather than high vs low, i.e.
> 
> Asserted/deasserted vs. high/low is same to me, though asserted/deasserted hints more for event rather than state.
> 
>> the polarity doesn't matter to KVM.
>>
>> So, since both sides (KVM emulating the IOAPIC, and vfio/whatever emulating
>> an external interrupt source) seem to operate on a level of abstraction of
>> "asserted" vs "de-asserted" interrupt state regardless of the polarity, and that
>> seems not a bug but a feature, it seems that we don't need to emulate the IRQ
>> level as such. Or am I missing something?
> 
> YES, I know current KVM doesn't handle it.  Whether we should support it is another story which I cannot speak for.
> Paolo and Alex are the right person 😊
> The reason I mention this is because the complexity to adding a pending event vs. supporting a interrupt pin state is same.
> I am wondering if we need to revisit it or not.  Behavior closing to real hardware helps us to avoid potential issues IMO, but I am fine to either choice.

I guess that would imply revisiting KVM irqfd interface, since its
design is based rather on events than states, even for level-triggered
interrupts:

- trigger event (from vfio to KVM) to assert an IRQ
- resample event (from KVM to vfio) to de-assert an IRQ

> 
>>
>> OTOH, I guess this means that the existing KVM's emulation of level-triggered
>> interrupts is somewhat limited (a guest may legitimately expect an interrupt
>> fired as a result of polarity change, and that case is not supported by KVM). But
>> that is rather out of scope of the oneshot interrupts issue addressed by this
>> patchset.
> 
> Agree.
> I didn't know any commercial OSes change polarity either. But I know Xen hypervisor uses polarity under certain condition.
> One day, we may see the issue when running Xen as a L1 hypervisor.  But this is not the current worry.
> 
> 
>>
>>> "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an
>> event rather than an interrupt level.
> 
> I know.  I am fine either.
> 
> Thanks Eddie
> 
>>>
>>>
RE: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dong, Eddie 1 day, 6 hours ago
> On 8/10/22 7:17 PM, Dong, Eddie wrote:
> >>>
> >>>
> >>>> However, with KVM + vfio (or whatever is listening on the
> >>>> resamplefd) we don't check that the interrupt is still masked in
> >>>> the guest at the moment
> >> of EOI.
> >>>> Resamplefd is notified regardless, so vfio prematurely unmasks the
> >>>> host physical IRQ, thus a new (unwanted) physical interrupt is
> >>>> generated in the host and queued for injection to the guest."
> >>>>
> >>>
> >>> Emulation of level triggered IRQ is a pain point ☹ I read we need to
> >>> emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e.
> >> ioapic here).
> >>> Technically, the guest can change the polarity of vIOAPIC, which
> >>> will lead to a new  virtual IRQ even w/o host side interrupt.
> >>
> >> Thanks, interesting point. Do you mean that this behavior (a new vIRQ
> >> as a result of polarity change) may already happen with the existing KVM
> code?
> >>
> >> It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC
> >> polarity bit, in particular it doesn't handle change of the polarity by the guest
> (i.e.
> >> doesn't update the virtual IRR register, and so on), so it shouldn't
> >> result in a new interrupt.
> >
> > Correct, KVM doesn't handle polarity now. Probably because unlikely
> > the commercial OSes will change polarity.
> >
> >>
> >> Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there
> >> seems to be an assumption that KVM interpretes the IRQ level value as
> >> active (asserted) vs inactive (deasserted) rather than high vs low, i.e.
> >
> > Asserted/deasserted vs. high/low is same to me, though
> asserted/deasserted hints more for event rather than state.
> >
> >> the polarity doesn't matter to KVM.
> >>
> >> So, since both sides (KVM emulating the IOAPIC, and vfio/whatever
> >> emulating an external interrupt source) seem to operate on a level of
> >> abstraction of "asserted" vs "de-asserted" interrupt state regardless
> >> of the polarity, and that seems not a bug but a feature, it seems
> >> that we don't need to emulate the IRQ level as such. Or am I missing
> something?
> >
> > YES, I know current KVM doesn't handle it.  Whether we should support it is
> another story which I cannot speak for.
> > Paolo and Alex are the right person 😊
> > The reason I mention this is because the complexity to adding a pending
> event vs. supporting a interrupt pin state is same.
> > I am wondering if we need to revisit it or not.  Behavior closing to real
> hardware helps us to avoid potential issues IMO, but I am fine to either choice.
> 
> I guess that would imply revisiting KVM irqfd interface, since its design is based
> rather on events than states, even for level-triggered
> interrupts:

We can read 2 different events:  IRQ fire/no-fire event, and state change event (for consumers to maintain internal assert/deassert state). 
If we switch from the former one to the later one.  Do we need to change the interface?

Probably needs Paolo and Alex to give clear direction, given that ARM64 side seems have similar state concept too.

Thanks Dmytro!

Eddie

> 
> - trigger event (from vfio to KVM) to assert an IRQ
> - resample event (from KVM to vfio) to de-assert an IRQ
> 
> >
> >>
> >> OTOH, I guess this means that the existing KVM's emulation of
> >> level-triggered interrupts is somewhat limited (a guest may
> >> legitimately expect an interrupt fired as a result of polarity
> >> change, and that case is not supported by KVM). But that is rather
> >> out of scope of the oneshot interrupts issue addressed by this patchset.
> >
> > Agree.
> > I didn't know any commercial OSes change polarity either. But I know Xen
> hypervisor uses polarity under certain condition.
> > One day, we may see the issue when running Xen as a L1 hypervisor.  But this
> is not the current worry.
> >
> >
> >>
> >>> "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more
> >>> an
> >> event rather than an interrupt level.
> >
> > I know.  I am fine either.
> >
> > Thanks Eddie
> >
> >>>
> >>>
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Marc Zyngier 1 day, 17 hours ago
On Wed, 10 Aug 2022 00:30:29 +0100,
Dmytro Maluka <dmy@semihalf.com> wrote:
> 
> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> > 
> > 
> >> -----Original Message-----
> >> From: Dmytro Maluka <dmy@semihalf.com>
> >> Sent: Tuesday, August 9, 2022 12:24 AM
> >> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
> >> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
> >> kvm@vger.kernel.org
> >> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> >> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
> >> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
> >> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
> >> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
> >> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
> >> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
> >> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
> >> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>
> >> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>
> >>>> The existing KVM mechanism for forwarding of level-triggered
> >>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>> existing KVM code doesn't take that into account, which results in
> >>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >> unacknowledged IRQ by the host.
> >>>
> >>> Interesting...  How it behaviors in native side?
> >>
> >> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >> completes.
> >>
> >> In handle_fasteoi_irq():
> >>
> >> 	if (desc->istate & IRQS_ONESHOT)
> >> 		mask_irq(desc);
> >>
> >> 	handle_irq_event(desc);
> >>
> >> 	cond_unmask_eoi_irq(desc, chip);
> >>
> >>
> >> and later in unmask_threaded_irq():
> >>
> >> 	unmask_irq(desc);
> >>
> >> I also mentioned that in patch #3 description:
> >> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >> prevent the EOI from re-asserting an unacknowledged interrupt.
> > 
> > That makes sense. Can you include the full story in cover letter too?
> 
> Ok, I will.
> 
> > 
> > 
> >> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >> check that the interrupt is still masked in the guest at the moment of EOI.
> >> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >> and queued for injection to the guest."

Sorry to barge in pretty late in the conversation (just been Cc'd on
this), but why shouldn't the resamplefd be notified? If there has been
an EOI, a new level must be made visible to the guest interrupt
controller, no matter what the state of the interrupt masking is.

Whether this new level is actually *presented* to a vCPU is another
matter entirely, and is arguably a problem for the interrupt
controller emulation.

For example on arm64, we expect to be able to read the pending state
of an interrupt from the guest irrespective of the masking state of
that interrupt. Any change to the interrupt flow should preserve this.

Thankfully, we don't have the polarity issue (there is no such thing
in the GIC architecture) and we only deal with pending/not-pending.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 1 day, 7 hours ago
Hi Marc,

On 8/10/22 8:51 AM, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 00:30:29 +0100,
> Dmytro Maluka <dmy@semihalf.com> wrote:
>>
>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Dmytro Maluka <dmy@semihalf.com>
>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
>>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
>>>> kvm@vger.kernel.org
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
>>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
>>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
>>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
>>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
>>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
>>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
>>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>
>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>
>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>> unacknowledged IRQ by the host.
>>>>>
>>>>> Interesting...  How it behaviors in native side?
>>>>
>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>> completes.
>>>>
>>>> In handle_fasteoi_irq():
>>>>
>>>> 	if (desc->istate & IRQS_ONESHOT)
>>>> 		mask_irq(desc);
>>>>
>>>> 	handle_irq_event(desc);
>>>>
>>>> 	cond_unmask_eoi_irq(desc, chip);
>>>>
>>>>
>>>> and later in unmask_threaded_irq():
>>>>
>>>> 	unmask_irq(desc);
>>>>
>>>> I also mentioned that in patch #3 description:
>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>
>>> That makes sense. Can you include the full story in cover letter too?
>>
>> Ok, I will.
>>
>>>
>>>
>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>> and queued for injection to the guest."
> 
> Sorry to barge in pretty late in the conversation (just been Cc'd on
> this), but why shouldn't the resamplefd be notified? If there has been
> an EOI, a new level must be made visible to the guest interrupt
> controller, no matter what the state of the interrupt masking is.
> 
> Whether this new level is actually *presented* to a vCPU is another
> matter entirely, and is arguably a problem for the interrupt
> controller emulation.
> 
> For example on arm64, we expect to be able to read the pending state
> of an interrupt from the guest irrespective of the masking state of
> that interrupt. Any change to the interrupt flow should preserve this.

I'd like to understand the problem better, so could you please give some
examples of cases where it is required/useful/desirable to read the
correct pending state of a guest interrupt?

> 
> Thankfully, we don't have the polarity issue (there is no such thing
> in the GIC architecture) and we only deal with pending/not-pending.
> 
> Thanks,
> 
> 	M.
>
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Marc Zyngier 12 hours ago
On Wed, 10 Aug 2022 18:06:53 +0100,
Dmytro Maluka <dmy@semihalf.com> wrote:
> 
> Hi Marc,
> 
> On 8/10/22 8:51 AM, Marc Zyngier wrote:
> > On Wed, 10 Aug 2022 00:30:29 +0100,
> > Dmytro Maluka <dmy@semihalf.com> wrote:
> >>
> >> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Dmytro Maluka <dmy@semihalf.com>
> >>>> Sent: Tuesday, August 9, 2022 12:24 AM
> >>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
> >>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
> >>>> kvm@vger.kernel.org
> >>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> >>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
> >>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
> >>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
> >>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
> >>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
> >>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
> >>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
> >>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>>>
> >>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>>>
> >>>>>> The existing KVM mechanism for forwarding of level-triggered
> >>>>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>>>> existing KVM code doesn't take that into account, which results in
> >>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >>>> unacknowledged IRQ by the host.
> >>>>>
> >>>>> Interesting...  How it behaviors in native side?
> >>>>
> >>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >>>> completes.
> >>>>
> >>>> In handle_fasteoi_irq():
> >>>>
> >>>> 	if (desc->istate & IRQS_ONESHOT)
> >>>> 		mask_irq(desc);
> >>>>
> >>>> 	handle_irq_event(desc);
> >>>>
> >>>> 	cond_unmask_eoi_irq(desc, chip);
> >>>>
> >>>>
> >>>> and later in unmask_threaded_irq():
> >>>>
> >>>> 	unmask_irq(desc);
> >>>>
> >>>> I also mentioned that in patch #3 description:
> >>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >>>> prevent the EOI from re-asserting an unacknowledged interrupt.
> >>>
> >>> That makes sense. Can you include the full story in cover letter too?
> >>
> >> Ok, I will.
> >>
> >>>
> >>>
> >>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >>>> check that the interrupt is still masked in the guest at the moment of EOI.
> >>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >>>> and queued for injection to the guest."
> > 
> > Sorry to barge in pretty late in the conversation (just been Cc'd on
> > this), but why shouldn't the resamplefd be notified? If there has been
> > an EOI, a new level must be made visible to the guest interrupt
> > controller, no matter what the state of the interrupt masking is.
> > 
> > Whether this new level is actually *presented* to a vCPU is another
> > matter entirely, and is arguably a problem for the interrupt
> > controller emulation.
> > 
> > For example on arm64, we expect to be able to read the pending state
> > of an interrupt from the guest irrespective of the masking state of
> > that interrupt. Any change to the interrupt flow should preserve this.
> 
> I'd like to understand the problem better, so could you please give some
> examples of cases where it is required/useful/desirable to read the
> correct pending state of a guest interrupt?

I'm not sure I understand the question. It is *always* desirable to
present the correct information to the guest.

For example, a guest could periodically poll the pending interrupt
registers and only enable interrupts that are pending. Is it a good
idea? No. Is it expected to work? Absolutely.

And yes, we go out of our way to make sure these things actually work,
because one day or another, you'll find a guest that does exactly
that.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 11 hours ago
On 8/11/22 14:35, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 18:06:53 +0100,
> Dmytro Maluka <dmy@semihalf.com> wrote:
>>
>> Hi Marc,
>>
>> On 8/10/22 8:51 AM, Marc Zyngier wrote:
>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>> Dmytro Maluka <dmy@semihalf.com> wrote:
>>>>
>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dmytro Maluka <dmy@semihalf.com>
>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
>>>>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
>>>>>> kvm@vger.kernel.org
>>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
>>>>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
>>>>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
>>>>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
>>>>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
>>>>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
>>>>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
>>>>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>
>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>>
>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>> unacknowledged IRQ by the host.
>>>>>>>
>>>>>>> Interesting...  How it behaviors in native side?
>>>>>>
>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>> completes.
>>>>>>
>>>>>> In handle_fasteoi_irq():
>>>>>>
>>>>>> 	if (desc->istate & IRQS_ONESHOT)
>>>>>> 		mask_irq(desc);
>>>>>>
>>>>>> 	handle_irq_event(desc);
>>>>>>
>>>>>> 	cond_unmask_eoi_irq(desc, chip);
>>>>>>
>>>>>>
>>>>>> and later in unmask_threaded_irq():
>>>>>>
>>>>>> 	unmask_irq(desc);
>>>>>>
>>>>>> I also mentioned that in patch #3 description:
>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>>
>>>>> That makes sense. Can you include the full story in cover letter too?
>>>>
>>>> Ok, I will.
>>>>
>>>>>
>>>>>
>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>> and queued for injection to the guest."
>>>
>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>> this), but why shouldn't the resamplefd be notified? If there has been
>>> an EOI, a new level must be made visible to the guest interrupt
>>> controller, no matter what the state of the interrupt masking is.
>>>
>>> Whether this new level is actually *presented* to a vCPU is another
>>> matter entirely, and is arguably a problem for the interrupt
>>> controller emulation.
>>>
>>> For example on arm64, we expect to be able to read the pending state
>>> of an interrupt from the guest irrespective of the masking state of
>>> that interrupt. Any change to the interrupt flow should preserve this.
>>
>> I'd like to understand the problem better, so could you please give some
>> examples of cases where it is required/useful/desirable to read the
>> correct pending state of a guest interrupt?
> 
> I'm not sure I understand the question. It is *always* desirable to
> present the correct information to the guest.
> 
> For example, a guest could periodically poll the pending interrupt
> registers and only enable interrupts that are pending. Is it a good
> idea? No. Is it expected to work? Absolutely.
> 
> And yes, we go out of our way to make sure these things actually work,
> because one day or another, you'll find a guest that does exactly
> that.

Ah indeed, thanks. Somehow I was thinking only about using this
information internally in KVM or perhaps presenting it to the host
userspace via some ioctl. Whereas indeed, the guest itself may well read
those registers and rely on this information.

> 
> Thanks,
> 
> 	M.
>
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Eric Auger 1 day, 16 hours ago
Hi Marc,

On 8/10/22 08:51, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 00:30:29 +0100,
> Dmytro Maluka <dmy@semihalf.com> wrote:
>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>
>>>> -----Original Message-----
>>>> From: Dmytro Maluka <dmy@semihalf.com>
>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
>>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
>>>> kvm@vger.kernel.org
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
>>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
>>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
>>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
>>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
>>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
>>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
>>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>
>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>> unacknowledged IRQ by the host.
>>>>> Interesting...  How it behaviors in native side?
>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>> completes.
>>>>
>>>> In handle_fasteoi_irq():
>>>>
>>>> 	if (desc->istate & IRQS_ONESHOT)
>>>> 		mask_irq(desc);
>>>>
>>>> 	handle_irq_event(desc);
>>>>
>>>> 	cond_unmask_eoi_irq(desc, chip);
>>>>
>>>>
>>>> and later in unmask_threaded_irq():
>>>>
>>>> 	unmask_irq(desc);
>>>>
>>>> I also mentioned that in patch #3 description:
>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>> That makes sense. Can you include the full story in cover letter too?
>> Ok, I will.
>>
>>>
>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>> and queued for injection to the guest."
> Sorry to barge in pretty late in the conversation (just been Cc'd on
> this), but why shouldn't the resamplefd be notified? If there has been
yeah sorry to get you involved here ;-)
> an EOI, a new level must be made visible to the guest interrupt
> controller, no matter what the state of the interrupt masking is.
>
> Whether this new level is actually *presented* to a vCPU is another
> matter entirely, and is arguably a problem for the interrupt
> controller emulation.

FWIU on guest EOI the physical line is still asserted so the pIRQ is
immediatly re-sampled by the interrupt controller (because the
resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
(although it is masked at guest level). When the guest actually unmasks
the vIRQ we do not get a chance to re-evaluate the physical line level.

When running native, when EOI is sent, the physical line is still
asserted but the IRQ is masked. When unmasking, the line is de-asserted.

Thanks

Eric
>
> For example on arm64, we expect to be able to read the pending state
> of an interrupt from the guest irrespective of the masking state of
> that interrupt. Any change to the interrupt flow should preserve this.
>
> Thankfully, we don't have the polarity issue (there is no such thing
> in the GIC architecture) and we only deal with pending/not-pending.
>
> Thanks,
>
> 	M.
>
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Marc Zyngier 1 day, 11 hours ago
On Wed, 10 Aug 2022 09:12:18 +0100,
Eric Auger <eric.auger@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 8/10/22 08:51, Marc Zyngier wrote:
> > On Wed, 10 Aug 2022 00:30:29 +0100,
> > Dmytro Maluka <dmy@semihalf.com> wrote:
> >> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: Dmytro Maluka <dmy@semihalf.com>
> >>>> Sent: Tuesday, August 9, 2022 12:24 AM
> >>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
> >>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
> >>>> kvm@vger.kernel.org
> >>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> >>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
> >>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
> >>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
> >>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
> >>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
> >>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
> >>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
> >>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>>>
> >>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>>> The existing KVM mechanism for forwarding of level-triggered
> >>>>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>>>> existing KVM code doesn't take that into account, which results in
> >>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >>>> unacknowledged IRQ by the host.
> >>>>> Interesting...  How it behaviors in native side?
> >>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >>>> completes.
> >>>>
> >>>> In handle_fasteoi_irq():
> >>>>
> >>>> 	if (desc->istate & IRQS_ONESHOT)
> >>>> 		mask_irq(desc);
> >>>>
> >>>> 	handle_irq_event(desc);
> >>>>
> >>>> 	cond_unmask_eoi_irq(desc, chip);
> >>>>
> >>>>
> >>>> and later in unmask_threaded_irq():
> >>>>
> >>>> 	unmask_irq(desc);
> >>>>
> >>>> I also mentioned that in patch #3 description:
> >>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >>>> prevent the EOI from re-asserting an unacknowledged interrupt.
> >>> That makes sense. Can you include the full story in cover letter too?
> >> Ok, I will.
> >>
> >>>
> >>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >>>> check that the interrupt is still masked in the guest at the moment of EOI.
> >>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >>>> and queued for injection to the guest."
> > Sorry to barge in pretty late in the conversation (just been Cc'd on
> > this), but why shouldn't the resamplefd be notified? If there has been
> yeah sorry to get you involved here ;-)

No problem!

> > an EOI, a new level must be made visible to the guest interrupt
> > controller, no matter what the state of the interrupt masking is.
> >
> > Whether this new level is actually *presented* to a vCPU is another
> > matter entirely, and is arguably a problem for the interrupt
> > controller emulation.
> 
> FWIU on guest EOI the physical line is still asserted so the pIRQ is
> immediatly re-sampled by the interrupt controller (because the
> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
> (although it is masked at guest level). When the guest actually unmasks
> the vIRQ we do not get a chance to re-evaluate the physical line level.

Indeed, and maybe this is what should be fixed instead of moving the
resampling point around (I was suggesting something along these lines
in [1]).

We already do this on arm64 for the timer, and it should be easy
enough it generalise to any interrupt backed by the GIC (there is an
in-kernel API to sample the pending state). No idea how that translate
for other architectures though.

	M.

[1] https://lore.kernel.org/r/87mtccbie4.wl-maz@kernel.org

-- 
Without deviation from the norm, progress is not possible.
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 1 day, 7 hours ago
Hi Marc,

On 8/10/22 3:01 PM, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 09:12:18 +0100,
> Eric Auger <eric.auger@redhat.com> wrote:
>>
>> Hi Marc,
>>
>> On 8/10/22 08:51, Marc Zyngier wrote:
>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>> Dmytro Maluka <dmy@semihalf.com> wrote:
>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dmytro Maluka <dmy@semihalf.com>
>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
>>>>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
>>>>>> kvm@vger.kernel.org
>>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
>>>>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
>>>>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
>>>>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
>>>>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
>>>>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
>>>>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
>>>>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>
>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>> unacknowledged IRQ by the host.
>>>>>>> Interesting...  How it behaviors in native side?
>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>> completes.
>>>>>>
>>>>>> In handle_fasteoi_irq():
>>>>>>
>>>>>> 	if (desc->istate & IRQS_ONESHOT)
>>>>>> 		mask_irq(desc);
>>>>>>
>>>>>> 	handle_irq_event(desc);
>>>>>>
>>>>>> 	cond_unmask_eoi_irq(desc, chip);
>>>>>>
>>>>>>
>>>>>> and later in unmask_threaded_irq():
>>>>>>
>>>>>> 	unmask_irq(desc);
>>>>>>
>>>>>> I also mentioned that in patch #3 description:
>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>> That makes sense. Can you include the full story in cover letter too?
>>>> Ok, I will.
>>>>
>>>>>
>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>> and queued for injection to the guest."
>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>> this), but why shouldn't the resamplefd be notified? If there has been
>> yeah sorry to get you involved here ;-)
> 
> No problem!
> 
>>> an EOI, a new level must be made visible to the guest interrupt
>>> controller, no matter what the state of the interrupt masking is.
>>>
>>> Whether this new level is actually *presented* to a vCPU is another
>>> matter entirely, and is arguably a problem for the interrupt
>>> controller emulation.
>>
>> FWIU on guest EOI the physical line is still asserted so the pIRQ is
>> immediatly re-sampled by the interrupt controller (because the
>> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
>> (although it is masked at guest level). When the guest actually unmasks
>> the vIRQ we do not get a chance to re-evaluate the physical line level.
> 
> Indeed, and maybe this is what should be fixed instead of moving the
> resampling point around (I was suggesting something along these lines
> in [1]).
> 
> We already do this on arm64 for the timer, and it should be easy
> enough it generalise to any interrupt backed by the GIC (there is an
> in-kernel API to sample the pending state). No idea how that translate
> for other architectures though.

Actually I'm now thinking about changing the behavior implemented in my
patchset, which is:

    1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
       that no new physical IRQ is generated, and the vIRQ is not set as
       pending.

    2. After this vIRQ is unmasked by the guest, notify resamplefd.

to the following one:

    1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
       but also remember this vIRQ as, let's call it, "pending oneshot".

    2. A new physical IRQ is immediately generated, so the vIRQ is
       properly set as pending.

    3. After the vIRQ is unmasked by the guest, check and find out that
       it is not just pending but also "pending oneshot", so don't
       deliver it to a vCPU. Instead, immediately notify resamplefd once
       again.

In other words, don't avoid extra physical interrupts in the host
(rather, use those extra interrupts for properly updating the pending
state of the vIRQ) but avoid propagating those extra interrupts to the
guest.

Does this sound reasonable to you?

Your suggestion to sample the pending state of the physical IRQ sounds
interesting too. But as you said, it's yet to be checked how feasible it
would be on architectures other than arm64. Also it assumes that the IRQ
in question is a forwarded physical interrupt, while I can imagine that
KVM's resamplefd could in principle also be useful for implementing
purely emulated interrupts.

Do you see any advantages of sampling the physical IRQ pending state
over remembering the "pending oneshot" state as described above?

> 
> 	M.
> 
> [1] https://lore.kernel.org/r/87mtccbie4.wl-maz@kernel.org
>
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Marc Zyngier 12 hours ago
Hi Dmytro,

On Wed, 10 Aug 2022 18:02:29 +0100,
Dmytro Maluka <dmy@semihalf.com> wrote:
> 
> Hi Marc,
> 
> On 8/10/22 3:01 PM, Marc Zyngier wrote:
> > On Wed, 10 Aug 2022 09:12:18 +0100,
> > Eric Auger <eric.auger@redhat.com> wrote:
> >>
> >> Hi Marc,
> >>
> >> On 8/10/22 08:51, Marc Zyngier wrote:
> >>> On Wed, 10 Aug 2022 00:30:29 +0100,
> >>> Dmytro Maluka <dmy@semihalf.com> wrote:
> >>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dmytro Maluka <dmy@semihalf.com>
> >>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
> >>>>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
> >>>>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
> >>>>>> kvm@vger.kernel.org
> >>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> >>>>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
> >>>>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
> >>>>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
> >>>>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
> >>>>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
> >>>>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
> >>>>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
> >>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>>>>>
> >>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>>>>> The existing KVM mechanism for forwarding of level-triggered
> >>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>>>>>> existing KVM code doesn't take that into account, which results in
> >>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >>>>>> unacknowledged IRQ by the host.
> >>>>>>> Interesting...  How it behaviors in native side?
> >>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >>>>>> completes.
> >>>>>>
> >>>>>> In handle_fasteoi_irq():
> >>>>>>
> >>>>>> 	if (desc->istate & IRQS_ONESHOT)
> >>>>>> 		mask_irq(desc);
> >>>>>>
> >>>>>> 	handle_irq_event(desc);
> >>>>>>
> >>>>>> 	cond_unmask_eoi_irq(desc, chip);
> >>>>>>
> >>>>>>
> >>>>>> and later in unmask_threaded_irq():
> >>>>>>
> >>>>>> 	unmask_irq(desc);
> >>>>>>
> >>>>>> I also mentioned that in patch #3 description:
> >>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
> >>>>> That makes sense. Can you include the full story in cover letter too?
> >>>> Ok, I will.
> >>>>
> >>>>>
> >>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
> >>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >>>>>> and queued for injection to the guest."
> >>> Sorry to barge in pretty late in the conversation (just been Cc'd on
> >>> this), but why shouldn't the resamplefd be notified? If there has been
> >> yeah sorry to get you involved here ;-)
> > 
> > No problem!
> > 
> >>> an EOI, a new level must be made visible to the guest interrupt
> >>> controller, no matter what the state of the interrupt masking is.
> >>>
> >>> Whether this new level is actually *presented* to a vCPU is another
> >>> matter entirely, and is arguably a problem for the interrupt
> >>> controller emulation.
> >>
> >> FWIU on guest EOI the physical line is still asserted so the pIRQ is
> >> immediatly re-sampled by the interrupt controller (because the
> >> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
> >> (although it is masked at guest level). When the guest actually unmasks
> >> the vIRQ we do not get a chance to re-evaluate the physical line level.
> > 
> > Indeed, and maybe this is what should be fixed instead of moving the
> > resampling point around (I was suggesting something along these lines
> > in [1]).
> > 
> > We already do this on arm64 for the timer, and it should be easy
> > enough it generalise to any interrupt backed by the GIC (there is an
> > in-kernel API to sample the pending state). No idea how that translate
> > for other architectures though.
> 
> Actually I'm now thinking about changing the behavior implemented in my
> patchset, which is:
> 
>     1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
>        that no new physical IRQ is generated, and the vIRQ is not set as
>        pending.
> 
>     2. After this vIRQ is unmasked by the guest, notify resamplefd.
> 
> to the following one:
> 
>     1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
>        but also remember this vIRQ as, let's call it, "pending oneshot".
> 
>     2. A new physical IRQ is immediately generated, so the vIRQ is
>        properly set as pending.
> 
>     3. After the vIRQ is unmasked by the guest, check and find out that
>        it is not just pending but also "pending oneshot", so don't
>        deliver it to a vCPU. Instead, immediately notify resamplefd once
>        again.
> 
> In other words, don't avoid extra physical interrupts in the host
> (rather, use those extra interrupts for properly updating the pending
> state of the vIRQ) but avoid propagating those extra interrupts to the
> guest.
> 
> Does this sound reasonable to you?

It does. I'm a bit concerned about the extra state (more state, more
problems...), but let's see the implementation.

> Your suggestion to sample the pending state of the physical IRQ sounds
> interesting too. But as you said, it's yet to be checked how feasible it
> would be on architectures other than arm64. Also it assumes that the IRQ
> in question is a forwarded physical interrupt, while I can imagine that
> KVM's resamplefd could in principle also be useful for implementing
> purely emulated interrupts.

No, there is no requirement for this being a forwarded interrupt. The
vgic code does that for forwarded interrupts, but the core code could
do that too if the information is available (irq_get_irqchip_state()
was introduced for this exact purpose).

> Do you see any advantages of sampling the physical IRQ pending state
> over remembering the "pending oneshot" state as described above?

The advantage is to not maintain some extra state, as this is usually
a source of problem, but to get to the source (the HW pending state).

It also solves the "pending in the vgic but not pending in the HW"
problem, as reading the pending state causes an exit (the register is
emulated), and as part of the exit handling we already perform the
resample. We just need to extend this to check the HW state, and
correct the pending state if required, making sure that the emulation
will return an accurate view.

	M.

-- 
Without deviation from the norm, progress is not possible.
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Dmytro Maluka 10 hours ago
Hi Marc,

On 8/11/22 14:21, Marc Zyngier wrote:
> Hi Dmytro,
> 
> On Wed, 10 Aug 2022 18:02:29 +0100,
> Dmytro Maluka <dmy@semihalf.com> wrote:
>>
>> Hi Marc,
>>
>> On 8/10/22 3:01 PM, Marc Zyngier wrote:
>>> On Wed, 10 Aug 2022 09:12:18 +0100,
>>> Eric Auger <eric.auger@redhat.com> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 8/10/22 08:51, Marc Zyngier wrote:
>>>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>>>> Dmytro Maluka <dmy@semihalf.com> wrote:
>>>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Dmytro Maluka <dmy@semihalf.com>
>>>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>>>> To: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
>>>>>>>> <seanjc@google.com>; Paolo Bonzini <pbonzini@redhat.com>;
>>>>>>>> kvm@vger.kernel.org
>>>>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
>>>>>>>> Borislav Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
>>>>>>>> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
>>>>>>>> kernel@vger.kernel.org; Eric Auger <eric.auger@redhat.com>; Alex
>>>>>>>> Williamson <alex.williamson@redhat.com>; Liu, Rong L <rong.l.liu@intel.com>;
>>>>>>>> Zhenyu Wang <zhenyuw@linux.intel.com>; Tomasz Nowicki
>>>>>>>> <tn@semihalf.com>; Grzegorz Jaszczyk <jaz@semihalf.com>;
>>>>>>>> upstream@semihalf.com; Dmitry Torokhov <dtor@google.com>
>>>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>>>
>>>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>>>> unacknowledged IRQ by the host.
>>>>>>>>> Interesting...  How it behaviors in native side?
>>>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>>>> completes.
>>>>>>>>
>>>>>>>> In handle_fasteoi_irq():
>>>>>>>>
>>>>>>>> 	if (desc->istate & IRQS_ONESHOT)
>>>>>>>> 		mask_irq(desc);
>>>>>>>>
>>>>>>>> 	handle_irq_event(desc);
>>>>>>>>
>>>>>>>> 	cond_unmask_eoi_irq(desc, chip);
>>>>>>>>
>>>>>>>>
>>>>>>>> and later in unmask_threaded_irq():
>>>>>>>>
>>>>>>>> 	unmask_irq(desc);
>>>>>>>>
>>>>>>>> I also mentioned that in patch #3 description:
>>>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>>>> That makes sense. Can you include the full story in cover letter too?
>>>>>> Ok, I will.
>>>>>>
>>>>>>>
>>>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>>>> and queued for injection to the guest."
>>>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>>>> this), but why shouldn't the resamplefd be notified? If there has been
>>>> yeah sorry to get you involved here ;-)
>>>
>>> No problem!
>>>
>>>>> an EOI, a new level must be made visible to the guest interrupt
>>>>> controller, no matter what the state of the interrupt masking is.
>>>>>
>>>>> Whether this new level is actually *presented* to a vCPU is another
>>>>> matter entirely, and is arguably a problem for the interrupt
>>>>> controller emulation.
>>>>
>>>> FWIU on guest EOI the physical line is still asserted so the pIRQ is
>>>> immediatly re-sampled by the interrupt controller (because the
>>>> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
>>>> (although it is masked at guest level). When the guest actually unmasks
>>>> the vIRQ we do not get a chance to re-evaluate the physical line level.
>>>
>>> Indeed, and maybe this is what should be fixed instead of moving the
>>> resampling point around (I was suggesting something along these lines
>>> in [1]).
>>>
>>> We already do this on arm64 for the timer, and it should be easy
>>> enough it generalise to any interrupt backed by the GIC (there is an
>>> in-kernel API to sample the pending state). No idea how that translate
>>> for other architectures though.
>>
>> Actually I'm now thinking about changing the behavior implemented in my
>> patchset, which is:
>>
>>     1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
>>        that no new physical IRQ is generated, and the vIRQ is not set as
>>        pending.
>>
>>     2. After this vIRQ is unmasked by the guest, notify resamplefd.
>>
>> to the following one:
>>
>>     1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
>>        but also remember this vIRQ as, let's call it, "pending oneshot".
>>
>>     2. A new physical IRQ is immediately generated, so the vIRQ is
>>        properly set as pending.
>>
>>     3. After the vIRQ is unmasked by the guest, check and find out that
>>        it is not just pending but also "pending oneshot", so don't
>>        deliver it to a vCPU. Instead, immediately notify resamplefd once
>>        again.
>>
>> In other words, don't avoid extra physical interrupts in the host
>> (rather, use those extra interrupts for properly updating the pending
>> state of the vIRQ) but avoid propagating those extra interrupts to the
>> guest.
>>
>> Does this sound reasonable to you?
> 
> It does. I'm a bit concerned about the extra state (more state, more
> problems...), but let's see the implementation.
> 
>> Your suggestion to sample the pending state of the physical IRQ sounds
>> interesting too. But as you said, it's yet to be checked how feasible it
>> would be on architectures other than arm64. Also it assumes that the IRQ
>> in question is a forwarded physical interrupt, while I can imagine that
>> KVM's resamplefd could in principle also be useful for implementing
>> purely emulated interrupts.
> 
> No, there is no requirement for this being a forwarded interrupt. The
> vgic code does that for forwarded interrupts, but the core code could
> do that too if the information is available (irq_get_irqchip_state()
> was introduced for this exact purpose).

I meant "forwarding" in a generic sense, not vgic specific. I.e. the
forwarding itself may be done generically by software, e.g. by vfio, but
the source is in any case a physical HW interrupt.

Whereas I have in mind also cases where an irqfd user injects purely
virtual interrupts, not coming from HW. I don't know any particular use
case for that, but irqfd doesn't seem to prohibit such use cases. So I
was thinking that maybe it's better to keep it this way, i.e. not depend
on reading physical HW state in KVM. Or am I trying to be too generic here?

> 
>> Do you see any advantages of sampling the physical IRQ pending state
>> over remembering the "pending oneshot" state as described above?
> 
> The advantage is to not maintain some extra state, as this is usually
> a source of problem, but to get to the source (the HW pending state).
> 
> It also solves the "pending in the vgic but not pending in the HW"
> problem, as reading the pending state causes an exit (the register is
> emulated), and as part of the exit handling we already perform the
> resample. We just need to extend this to check the HW state, and
> correct the pending state if required, making sure that the emulation
> will return an accurate view.

BTW, it seems that besides this "pending in the guest but not in the
host" issue, we also already have an opposite issue ("pending in the
host but not in the guest"): upon the guest EOI, we unconditionally
deassert the vIRQ in irqfd_resampler_ack() before notifying resamplefd,
even if the pIRQ is still asserted. So there is a time window (before
the new pIRQ trigger event makes it to KVM) when the IRQ may be pending
in the host but not in the guest.

Am I right that this is actually an issue, and that sampling the
physical state could help with this issue too?

> 
> 	M.
>
Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Paolo Bonzini 17 hours ago
On 8/10/22 19:02, Dmytro Maluka wrote:
>      1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
>         but also remember this vIRQ as, let's call it, "pending oneshot".
> 
>      2. A new physical IRQ is immediately generated, so the vIRQ is
>         properly set as pending.
> 
>      3. After the vIRQ is unmasked by the guest, check and find out that
>         it is not just pending but also "pending oneshot", so don't
>         deliver it to a vCPU. Instead, immediately notify resamplefd once
>         again.
> 
> In other words, don't avoid extra physical interrupts in the host
> (rather, use those extra interrupts for properly updating the pending
> state of the vIRQ) but avoid propagating those extra interrupts to the
> guest.
> 
> Does this sound reasonable to you?

Yeah, this makes sense and it lets the resamplefd set the "pending" 
status in the vGIC.  It still has the issue that the interrupt can 
remain pending in the guest for longer than it's pending on the host, 
but that can't be fixed?

Paolo
RE: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
Posted by Liu, Rong L 2 hours ago
Hi Paolo and Dmytro,

> -----Original Message-----
> From: Paolo Bonzini <pbonzini@redhat.com>
> Sent: Wednesday, August 10, 2022 11:48 PM
> To: Dmytro Maluka <dmy@semihalf.com>; Marc Zyngier
> <maz@kernel.org>; eric.auger@redhat.com
> Cc: Dong, Eddie <eddie.dong@intel.com>; Christopherson,, Sean
> <seanjc@google.com>; kvm@vger.kernel.org; Thomas Gleixner
> <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav
> Petkov <bp@alien8.de>; Dave Hansen <dave.hansen@linux.intel.com>;
> x86@kernel.org; H. Peter Anvin <hpa@zytor.com>; linux-
> kernel@vger.kernel.org; Alex Williamson <alex.williamson@redhat.com>;
> Liu, Rong L <rong.l.liu@intel.com>; Zhenyu Wang
> <zhenyuw@linux.intel.com>; Tomasz Nowicki <tn@semihalf.com>;
> Grzegorz Jaszczyk <jaz@semihalf.com>; upstream@semihalf.com;
> Dmitry Torokhov <dtor@google.com>
> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> 
> On 8/10/22 19:02, Dmytro Maluka wrote:
> >      1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
> >         but also remember this vIRQ as, let's call it, "pending oneshot".
> >

This is the part always confuses me.   In x86 case, for level triggered
interrupt, even if it is not oneshot, there is still "unmask" and the unmask
happens in the same sequence as in oneshot interrupt, just timing is different. 
 So are you going to differentiate oneshot from "normal" level triggered
interrupt or not?   And there is any situation that vEOI happens for an unmasked
vIRQ?

 > >      2. A new physical IRQ is immediately generated, so the vIRQ is
> >         properly set as pending.
> >

I am not sure this is always the case.  For example, a device may not raise a
new interrupt until it is notified that "done reading" - by device driver
writing to a register or something when device driver finishes reading data.  So
how do you handle this situation?

> >      3. After the vIRQ is unmasked by the guest, check and find out that
> >         it is not just pending but also "pending oneshot", so don't
> >         deliver it to a vCPU. Instead, immediately notify resamplefd once
> >         again.
> >

Does this mean the change of vfio code also?  That seems the case: vfio seems
keeping its own internal "state" whether the irq is enabled or not.

Thanks,

Rong
> > In other words, don't avoid extra physical interrupts in the host
> > (rather, use those extra interrupts for properly updating the pending
> > state of the vIRQ) but avoid propagating those extra interrupts to the
> > guest.
> >
> > Does this sound reasonable to you?
> 
> Yeah, this makes sense and it lets the resamplefd set the "pending"
> status in the vGIC.  It still has the issue that the interrupt can
> remain pending in the guest for longer than it's pending on the host,
> but that can't be fixed?
> 
> Paolo