hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

[PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

From: David Woodhouse <dwmw@amazon.co.uk>

A Linux guest will perform IRQ migration after the IRQ has happened,
updating the RTE to point to the new destination CPU and then unmasking
the interrupt.

However, when the guest updates the RTE, ioapic_mem_write() calls
ioapic_service(), which redelivers the pending level interrupt via
kvm_set_irq(), *before* calling ioapic_update_kvm_routes() which sets
the new target CPU.

Thus, the IRQ which is supposed to go to the new target CPU is instead
misdelivered to the previous target. An example where the guest kernel
is attempting to migrate from CPU#2 to CPU#0 shows:

xenstore_read tx 0 path control/platform-feature-xs_reset_watches
ioapic_set_irq vector: 11 level: 1
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
[    0.523627] The affinity mask was 0-3 and the handler is on 2
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
xenstore_reset_watches
ioapic_set_irq vector: 11 level: 1
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021
[    0.524569] ioapic_ack_level IRQ 11 moveit = 1
ioapic_eoi_broadcast EOI broadcast for vector 33
ioapic_clear_remote_irr clear remote irr for pin 11 vector 33
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021
[    0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq()
[    0.526147] irq_do_set_affinity for IRQ 11, 0
[    0.526732] ioapic_set_affinity for IRQ 11, 0
[    0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
[    0.527623] ioapic_set_affinity returns 0
[    0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq()
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021
[    0.529571] The affinity mask was 0 and the handler is on 2
[    xenstore_watch path memory/target token FFFFFFFF92847D40

There are no other code paths in ioapic_mem_write() which need the KVM
IRQ routing table to be updated, so just shift the call from the end
of the function to happen right before the call to ioapic_service()
and thus deliver the re-enabled IRQ to the right place.

Fixes: 15eafc2e602f "kvm: x86: add support for KVM_CAP_SPLIT_IRQCHIP"
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
Alternative fixes might have been just to remove the part in
ioapic_service() which delivers the IRQ via kvm_set_irq() because
surely delivering as MSI ought to work just fine anyway in all cases?
That code lacks a comment justifying its existence.

Or maybe in the specific case shown in the above log, it would have
sufficed for ioapic_update_kvm_routes() to update the route *even*
when the IRQ is masked. It's not like it's actually going to get
triggered unless QEMU deliberately does so, anyway? But that only
works because the target CPU happens to be in the high word of the
RTE; if something in the *low* word (vector, perhaps) was changed
at the same time as the unmask, we'd still trigger with stale data.

 hw/intc/ioapic.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 6364ecab1b..716ffc8bbb 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -405,6 +405,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val,
                 s->ioredtbl[index] |= ro_bits;
                 s->irq_eoi[index] = 0;
                 ioapic_fix_edge_remote_irr(&s->ioredtbl[index]);
+                ioapic_update_kvm_routes(s);
                 ioapic_service(s);
             }
         }
@@ -417,8 +418,6 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val,
         ioapic_eoi_broadcast(val);
         break;
     }
-
-    ioapic_update_kvm_routes(s);
 }
 
 static const MemoryRegionOps ioapic_io_ops = {
-- 
2.39.0

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by Peter Xu 2 years, 11 months ago

On Sun, Mar 05, 2023 at 06:43:42PM +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> A Linux guest will perform IRQ migration after the IRQ has happened,
> updating the RTE to point to the new destination CPU and then unmasking
> the interrupt.
> 
> However, when the guest updates the RTE, ioapic_mem_write() calls
> ioapic_service(), which redelivers the pending level interrupt via
> kvm_set_irq(), *before* calling ioapic_update_kvm_routes() which sets
> the new target CPU.
> 
> Thus, the IRQ which is supposed to go to the new target CPU is instead
> misdelivered to the previous target. An example where the guest kernel
> is attempting to migrate from CPU#2 to CPU#0 shows:
> 
> xenstore_read tx 0 path control/platform-feature-xs_reset_watches
> ioapic_set_irq vector: 11 level: 1
> ioapic_set_remote_irr set remote irr for pin 11
> ioapic_service: trigger KVM IRQ 11
> [    0.523627] The affinity mask was 0-3 and the handler is on 2
> ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
> ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021
> ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
> xenstore_reset_watches
> ioapic_set_irq vector: 11 level: 1
> ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021
> [    0.524569] ioapic_ack_level IRQ 11 moveit = 1
> ioapic_eoi_broadcast EOI broadcast for vector 33
> ioapic_clear_remote_irr clear remote irr for pin 11 vector 33
> ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
> ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021
> [    0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq()
> [    0.526147] irq_do_set_affinity for IRQ 11, 0
> [    0.526732] ioapic_set_affinity for IRQ 11, 0
> [    0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0
> ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27
> ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0
> ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
> ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
> [    0.527623] ioapic_set_affinity returns 0
> [    0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq()
> ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
> ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021
> ioapic_set_remote_irr set remote irr for pin 11
> ioapic_service: trigger KVM IRQ 11
> ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021
> [    0.529571] The affinity mask was 0 and the handler is on 2
> [    xenstore_watch path memory/target token FFFFFFFF92847D40
> 
> There are no other code paths in ioapic_mem_write() which need the KVM
> IRQ routing table to be updated, so just shift the call from the end
> of the function to happen right before the call to ioapic_service()
> and thus deliver the re-enabled IRQ to the right place.
> 
> Fixes: 15eafc2e602f "kvm: x86: add support for KVM_CAP_SPLIT_IRQCHIP"
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Reviewed-by: Peter Xu <peterx@redhat.com>

> ---
> Alternative fixes might have been just to remove the part in
> ioapic_service() which delivers the IRQ via kvm_set_irq() because
> surely delivering as MSI ought to work just fine anyway in all cases?
> That code lacks a comment justifying its existence.

Didn't check all details, but AFAIU there're still some different paths
triggered so at least it looks still clean to use the path it's for.

E.g., I think if someone traces kvm_set_irq() in kernel this specific irq
triggered right after unmasking might seem to be missed misterously (but
actually it was not).

Thanks,

> 
> Or maybe in the specific case shown in the above log, it would have
> sufficed for ioapic_update_kvm_routes() to update the route *even*
> when the IRQ is masked. It's not like it's actually going to get
> triggered unless QEMU deliberately does so, anyway? But that only
> works because the target CPU happens to be in the high word of the
> RTE; if something in the *low* word (vector, perhaps) was changed
> at the same time as the unmask, we'd still trigger with stale data.

-- 
Peter Xu

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

On 5 March 2023 22:36:18 GMT, Peter Xu <peterx@redhat.com> wrote:
>On Sun, Mar 05, 2023 at 06:43:42PM +0000, 
>> ---
>> Alternative fixes might have been just to remove the part in
>> ioapic_service() which delivers the IRQ via kvm_set_irq() because
>> surely delivering as MSI ought to work just fine anyway in all cases?
>> That code lacks a comment justifying its existence.
>
>Didn't check all details, but AFAIU there're still some different paths
>triggered so at least it looks still clean to use the path it's for.
>
>E.g., I think if someone traces kvm_set_irq() in kernel this specific irq
>triggered right after unmasking might seem to be missed misterously (but
>actually it was not).

Hm, not sure that's a reason we care about. The I/OAPIC is purely a device to turn line interrupts into MSIs. Which these days need to be translated by IOMMU interrupt remapping device models in userspace. I don't think a user has any valid reason to expect that the kernel will even know about any GSIs with any specific numbers. Tracing on that in the kernel would making some dodgy assumptions.

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

On Mon, 2023-03-06 at 06:51 +0000, David Woodhouse wrote:
> On 5 March 2023 22:36:18 GMT, Peter Xu <peterx@redhat.com> wrote:
> > On Sun, Mar 05, 2023 at 06:43:42PM +0000, 
> > > ---
> > > Alternative fixes might have been just to remove the part in
> > > ioapic_service() which delivers the IRQ via kvm_set_irq() because
> > > surely delivering as MSI ought to work just fine anyway in all cases?
> > > That code lacks a comment justifying its existence.
> > 
> > Didn't check all details, but AFAIU there're still some different paths
> > triggered so at least it looks still clean to use the path it's for.
> > 
> > E.g., I think if someone traces kvm_set_irq() in kernel this specific irq
> > triggered right after unmasking might seem to be missed misterously (but
> > actually it was not).
> 
> Hm, not sure that's a reason we care about. The I/OAPIC is purely a
> device to turn line interrupts into MSIs. Which these days need to be
> translated by IOMMU interrupt remapping device models in userspace. I
> don't think a user has any valid reason to expect that the kernel
> will even know about any GSIs with any specific numbers. Tracing on
> that in the kernel would making some dodgy assumptions.

I think if we want to fix the lack of IR translation faults from the
IOMMU, we have to change this anyway.

At the very least, it can only use KVM to deliver the IRQ if there *is*
a valid translation. And if not, it needs to ask the IOMMU to
retranslate, with a 'delivering_now' flag indicating that the fault
should be raised because this isn't a preemptive translation in
advance.

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by Peter Xu 2 years, 11 months ago

On Mon, Mar 06, 2023 at 09:25:35AM +0000, David Woodhouse wrote:
> On Mon, 2023-03-06 at 06:51 +0000, David Woodhouse wrote:
> > On 5 March 2023 22:36:18 GMT, Peter Xu <peterx@redhat.com> wrote:
> > > On Sun, Mar 05, 2023 at 06:43:42PM +0000, 
> > > > ---
> > > > Alternative fixes might have been just to remove the part in
> > > > ioapic_service() which delivers the IRQ via kvm_set_irq() because
> > > > surely delivering as MSI ought to work just fine anyway in all cases?
> > > > That code lacks a comment justifying its existence.
> > > 
> > > Didn't check all details, but AFAIU there're still some different paths
> > > triggered so at least it looks still clean to use the path it's for.
> > > 
> > > E.g., I think if someone traces kvm_set_irq() in kernel this specific irq
> > > triggered right after unmasking might seem to be missed misterously (but
> > > actually it was not).
> > 
> > Hm, not sure that's a reason we care about. The I/OAPIC is purely a
> > device to turn line interrupts into MSIs. Which these days need to be
> > translated by IOMMU interrupt remapping device models in userspace. I
> > don't think a user has any valid reason to expect that the kernel
> > will even know about any GSIs with any specific numbers. Tracing on
> > that in the kernel would making some dodgy assumptions.
> 
> I think if we want to fix the lack of IR translation faults from the
> IOMMU, we have to change this anyway.
> 
> At the very least, it can only use KVM to deliver the IRQ if there *is*
> a valid translation. And if not, it needs to ask the IOMMU to
> retranslate, with a 'delivering_now' flag indicating that the fault
> should be raised because this isn't a preemptive translation in
> advance.

I see that as a separate problem of what this patch wants to (rightfully)
fix.  I also agree we'll need that if e.g. we want to support ioapic in
kvm.

Sorry in advancie since I don't think I have the whole picture here.  Could
you remind me the whole purpose of having such?  Is that for part of Xen's
effort?

It was a long time ago, but IIRC such mechanism wasn't implemented when we
worked on vIOMMU IR initially, because at least at that time the "keeping
kvm irq routes always updated when IRQ entries modified" was good enough
for qemu to work with IR.

The only outlier was in-kernel ioapic, but there was a talk from Google
showing that in kernel ioapic brought mostly nothing good but more exposure
on attack surface.  It does sound reasonable since fast devices shouldn't
use ioapic at all to me, so IIRC the plan was split irqchip should be
always optimal hence no immediate concern of not supporting IR with
in-kernel ioapic.  But I definitely can miss important things along the
way.

Thanks,

-- 
Peter Xu

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

On Mon, 2023-03-06 at 11:39 -0500, Peter Xu wrote:
> On Mon, Mar 06, 2023 at 09:25:35AM +0000, David Woodhouse wrote:
> > I think if we want to fix the lack of IR translation faults from the
> > IOMMU, we have to change this anyway.
> > 
> > At the very least, it can only use KVM to deliver the IRQ if there *is*
> > a valid translation. And if not, it needs to ask the IOMMU to
> > retranslate, with a 'delivering_now' flag indicating that the fault
> > should be raised because this isn't a preemptive translation in
> > advance.
> 
> I see that as a separate problem of what this patch wants to (rightfully)
> fix. 
> 

Agreed. Which is why I didn't just rip the kvm_set_irq() out and call
it fixed, and put that discussion below the ^--- cutoff.

>  I also agree we'll need that if e.g. we want to support ioapic in
> kvm.
>
> Sorry in advancie since I don't think I have the whole picture here.  Could
> you remind me the whole purpose of having such?  Is that for part of Xen's
> effort?
> 
> It was a long time ago, but IIRC such mechanism wasn't implemented when we
> worked on vIOMMU IR initially, because at least at that time the "keeping
> kvm irq routes always updated when IRQ entries modified" was good enough
> for qemu to work with IR.
> 
> The only outlier was in-kernel ioapic, but there was a talk from Google
> showing that in kernel ioapic brought mostly nothing good but more exposure
> on attack surface.  It does sound reasonable since fast devices shouldn't
> use ioapic at all to me, so IIRC the plan was split irqchip should be
> always optimal hence no immediate concern of not supporting IR with
> in-kernel ioapic.  But I definitely can miss important things along the
> way.

Indeed, I don't think we care about the in-kernel I/OAPIC. I don't
think we care about the kernel knowing about e.g. "GSI #11" at all. We
can just deliver it as MSI (for the I/OAPIC) or using KVM_INTERRUPT and
the interrupt window as we do for the PIC. Which is why I'd happily rip
that out and let it be delivered via the APIC intercept at 0xfeexxxxx.

The existing code which just keeps IRQ routes updated when they're
valid is kind of OK, and well-behaved guests can function. But it isn't
*right* in the case where they aren't valid.

What *ought* to happen is that the IOMMU should raise a fault at the
moment the interrupt occurs, if the translation isn't valid. And we
don't have that at all.

As for why I care? I don't really *need* it, as I have everything I
need for Xen PIRQ support already merged in 
https://gitlab.com/qemu-project/qemu/-/commit/6096cf7877

So while the thread at
https://lore.kernel.org/qemu-devel/aaef9961d210ac1873153bf3cf01d984708744fc.camel@infradead.org/
was partly driven by expecting to need this for Xen PIRQ support
(because in $DAYJOB I did those things in the other order and the PIRQ
support ended up just being a trivial different translator like the
IOMMU's IR)... I'd still quite like to fix it up in QEMU anyway, just
for correctness and fidelity in the faulting cases.

We can do more efficient invalidation too, rather than blowing away the
entire routing table every time. Just disconnect the IRQFD for the
specific interrupts that get invalidated, and let them get fixed up
again next time they occur.

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by Peter Xu 2 years, 11 months ago

On Mon, Mar 06, 2023 at 05:28:24PM +0000, David Woodhouse wrote:
> Indeed, I don't think we care about the in-kernel I/OAPIC. I don't
> think we care about the kernel knowing about e.g. "GSI #11" at all. We
> can just deliver it as MSI (for the I/OAPIC) or using KVM_INTERRUPT and
> the interrupt window as we do for the PIC. Which is why I'd happily rip
> that out and let it be delivered via the APIC intercept at 0xfeexxxxx.
> 
> The existing code which just keeps IRQ routes updated when they're
> valid is kind of OK, and well-behaved guests can function. But it isn't
> *right* in the case where they aren't valid.
> 
> What *ought* to happen is that the IOMMU should raise a fault at the
> moment the interrupt occurs, if the translation isn't valid. And we
> don't have that at all.

Right, that's definitely not ideal as an emulator.

> 
> As for why I care? I don't really *need* it, as I have everything I
> need for Xen PIRQ support already merged in 
> https://gitlab.com/qemu-project/qemu/-/commit/6096cf7877
> 
> So while the thread at
> https://lore.kernel.org/qemu-devel/aaef9961d210ac1873153bf3cf01d984708744fc.camel@infradead.org/
> was partly driven by expecting to need this for Xen PIRQ support
> (because in $DAYJOB I did those things in the other order and the PIRQ
> support ended up just being a trivial different translator like the
> IOMMU's IR)... I'd still quite like to fix it up in QEMU anyway, just
> for correctness and fidelity in the faulting cases.
> 
> We can do more efficient invalidation too, rather than blowing away the
> entire routing table every time. Just disconnect the IRQFD for the
> specific interrupts that get invalidated, and let them get fixed up
> again next time they occur.

I'm curious whether there's anything else beside the "correctness of
emulation" reason.

I would think it nice if it existed or trivial to have as what you said.  I
just don't know whether it's as easy, at least so far a new kernel
interface seems still needed, allowing a kernel irq to be paused until
being translated by QEMU from some channel we provide.

So, IMHO it's about whether the reason that "we want to have a complete
emulation of IR" can properly justify the complexity of at least the kernel
interface (I don't worry on the qemu side a lot).  After all, even if it
can completes the emulation, 99.99% of people will not use it. :(

-- 
Peter Xu

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

On Wed, 2023-03-08 at 18:09 -0500, Peter Xu wrote:
> On Mon, Mar 06, 2023 at 05:28:24PM +0000, David Woodhouse wrote:
> > Indeed, I don't think we care about the in-kernel I/OAPIC. I don't
> > think we care about the kernel knowing about e.g. "GSI #11" at all. We
> > can just deliver it as MSI (for the I/OAPIC) or using KVM_INTERRUPT and
> > the interrupt window as we do for the PIC. Which is why I'd happily rip
> > that out and let it be delivered via the APIC intercept at 0xfeexxxxx.
> > 
> > The existing code which just keeps IRQ routes updated when they're
> > valid is kind of OK, and well-behaved guests can function. But it isn't
> > *right* in the case where they aren't valid.
> > 
> > What *ought* to happen is that the IOMMU should raise a fault at the
> > moment the interrupt occurs, if the translation isn't valid. And we
> > don't have that at all.
> 
> Right, that's definitely not ideal as an emulator.
> 
> > 
> > As for why I care? I don't really *need* it, as I have everything I
> > need for Xen PIRQ support already merged in 
> > https://gitlab.com/qemu-project/qemu/-/commit/6096cf7877
> > 
> > So while the thread at
> > https://lore.kernel.org/qemu-devel/aaef9961d210ac1873153bf3cf01d984708744fc.camel@infradead.org/
> > was partly driven by expecting to need this for Xen PIRQ support
> > (because in $DAYJOB I did those things in the other order and the PIRQ
> > support ended up just being a trivial different translator like the
> > IOMMU's IR)... I'd still quite like to fix it up in QEMU anyway, just
> > for correctness and fidelity in the faulting cases.
> > 
> > We can do more efficient invalidation too, rather than blowing away the
> > entire routing table every time. Just disconnect the IRQFD for the
> > specific interrupts that get invalidated, and let them get fixed up
> > again next time they occur.
> 
> I'm curious whether there's anything else beside the "correctness of
> emulation" reason.

Nah, at this point it's just because it annoys me. I did this in
another VMM and it works nicely, and I'd like QEMU to do the same. ;)

> I would think it nice if it existed or trivial to have as what you said.  I
> just don't know whether it's as easy, at least so far a new kernel
> interface seems still needed, allowing a kernel irq to be paused until
> being translated by QEMU from some channel we provide.

It doesn't need a new kernel interface.

With IRQ remapping we have a userspace I/OAPIC, so how we deliver its
MSIs is entirely up to us — we absolutely don't need the kernel to have
*any* KVM_IRQ_ROUTING_IRQCHIP entries.

The only IRQs that are handled fully in the kernel are events arriving
on some eventfd which is bound as an IRQFD to some IRQ in the KVM
routing table. (Mostly MSIs coming from VFIO).

If we want to "pause" one of those, all we have to do is unbind the
IRQFD and then we can handle that eventfd in userspace instead. Which
is what we do *anyway* in the case where IRQFD support isn't available.

In
https://lore.kernel.org/kvm/20201027143944.648769-1-dwmw2@infradead.org/
I *optimised* that dance, so userspace doesn't have to stop listening
on the eventfd when the IRQFD is bound; the IRQFD code consumes the
event first. But we can live without that in older kernels.

Basically, it's just treating the IRQFD support as an *optimisation*,
and retaining the 'slow path' of handling the event in userspace and
injecting the resulting MSI.

It's just that in that slow path, as we're translating and injecting
the MSI, we *also* update the IRQ routing table and reattach the IRQFD
for the next time that interrupt fires. Like a cache.

And we stash an invalidation cookie (for Intel-IOMMU the IRTE index)
for each routing table entry, so that when asked to invalidate a
*range* of cookies (that's how the Intel IOMMU invalidate works), we
can just detach the IRQFD from those ones, letting them get handled in
userspace next time and retranslated (with associated fault report, as
appropriate).

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by Peter Xu 2 years, 11 months ago

On Thu, Mar 09, 2023 at 09:16:08AM +0000, David Woodhouse wrote:
> On Wed, 2023-03-08 at 18:09 -0500, Peter Xu wrote:
> > On Mon, Mar 06, 2023 at 05:28:24PM +0000, David Woodhouse wrote:
> > > Indeed, I don't think we care about the in-kernel I/OAPIC. I don't
> > > think we care about the kernel knowing about e.g. "GSI #11" at all. We
> > > can just deliver it as MSI (for the I/OAPIC) or using KVM_INTERRUPT and
> > > the interrupt window as we do for the PIC. Which is why I'd happily rip
> > > that out and let it be delivered via the APIC intercept at 0xfeexxxxx.
> > > 
> > > The existing code which just keeps IRQ routes updated when they're
> > > valid is kind of OK, and well-behaved guests can function. But it isn't
> > > *right* in the case where they aren't valid.
> > > 
> > > What *ought* to happen is that the IOMMU should raise a fault at the
> > > moment the interrupt occurs, if the translation isn't valid. And we
> > > don't have that at all.
> > 
> > Right, that's definitely not ideal as an emulator.
> > 
> > > 
> > > As for why I care? I don't really *need* it, as I have everything I
> > > need for Xen PIRQ support already merged in 
> > > https://gitlab.com/qemu-project/qemu/-/commit/6096cf7877
> > > 
> > > So while the thread at
> > > https://lore.kernel.org/qemu-devel/aaef9961d210ac1873153bf3cf01d984708744fc.camel@infradead.org/
> > > was partly driven by expecting to need this for Xen PIRQ support
> > > (because in $DAYJOB I did those things in the other order and the PIRQ
> > > support ended up just being a trivial different translator like the
> > > IOMMU's IR)... I'd still quite like to fix it up in QEMU anyway, just
> > > for correctness and fidelity in the faulting cases.
> > > 
> > > We can do more efficient invalidation too, rather than blowing away the
> > > entire routing table every time. Just disconnect the IRQFD for the
> > > specific interrupts that get invalidated, and let them get fixed up
> > > again next time they occur.
> > 
> > I'm curious whether there's anything else beside the "correctness of
> > emulation" reason.
> 
> Nah, at this point it's just because it annoys me. I did this in
> another VMM and it works nicely, and I'd like QEMU to do the same. ;)
> 
> > I would think it nice if it existed or trivial to have as what you said.  I
> > just don't know whether it's as easy, at least so far a new kernel
> > interface seems still needed, allowing a kernel irq to be paused until
> > being translated by QEMU from some channel we provide.
> 
> It doesn't need a new kernel interface.
> 
> With IRQ remapping we have a userspace I/OAPIC, so how we deliver its
> MSIs is entirely up to us — we absolutely don't need the kernel to have
> *any* KVM_IRQ_ROUTING_IRQCHIP entries.

Oh so it's about split irqchip only..  Yes indeed it looks like it's
possible to not change the kernel.

> The only IRQs that are handled fully in the kernel are events arriving
> on some eventfd which is bound as an IRQFD to some IRQ in the KVM
> routing table. (Mostly MSIs coming from VFIO).
> 
> If we want to "pause" one of those, all we have to do is unbind the
> IRQFD and then we can handle that eventfd in userspace instead. Which
> is what we do *anyway* in the case where IRQFD support isn't available.
> 
> In
> https://lore.kernel.org/kvm/20201027143944.648769-1-dwmw2@infradead.org/
> I *optimised* that dance, so userspace doesn't have to stop listening
> on the eventfd when the IRQFD is bound; the IRQFD code consumes the
> event first. But we can live without that in older kernels.
> 
> Basically, it's just treating the IRQFD support as an *optimisation*,
> and retaining the 'slow path' of handling the event in userspace and
> injecting the resulting MSI.
> 
> It's just that in that slow path, as we're translating and injecting
> the MSI, we *also* update the IRQ routing table and reattach the IRQFD
> for the next time that interrupt fires. Like a cache.
> 
> And we stash an invalidation cookie (for Intel-IOMMU the IRTE index)
> for each routing table entry, so that when asked to invalidate a
> *range* of cookies (that's how the Intel IOMMU invalidate works), we
> can just detach the IRQFD from those ones, letting them get handled in
> userspace next time and retranslated (with associated fault report, as
> appropriate).

We can maintain a cookie per entry in the routing table in userspace, I
think, and ignore those when applying to KVM (perhaps need to regenerate
another table when applying?  As KVM won't recognize the cookies). Besides
that, do we need other infrastructures to link this entry / GSI back to
which device registers with it?  Since I assume we need to tear down irqfds
if there is, and rehook the event to an userspace handler here too.

There're four devices that can hook onto this, IIUC.  Besides IOAPIC and
VFIO, there's also ivshmem and vhost.  IIUC we'll need to change all the
four devices to implement this.

Besides the changeset (which seems to be still non-trivial to me.. without
yet evaluating whether that'll be worth the effort), one concern I have
right now is whether delaying the 1st irq would regress in any case.

I think you may have better knowledge here than me on how guest behaves in
IRQ subsystem.  For example, is there any case where IRQs can be modified /
invalidated frequently (perhaps mask / unmask?) so there can be a lot of
IRQs delivered slower than before? Because after this change the IRQ setup
/ cache overhead will be applied to the 1st IRQ being triggered rather than
when IRQ was established / unmasked.

This also reminded me (just out of curiosity) on what will happen if
without IR at all: say, what if a device setup a wrong MSI (with a messed
up MSI data register) on bare metal?  Then, does QEMU properly emulate that
too so far?

-- 
Peter Xu

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

On Thu, 2023-03-09 at 11:55 -0500, Peter Xu wrote:
> 
> There're four devices that can hook onto this, IIUC.  Besides IOAPIC and
> VFIO, there's also ivshmem and vhost.  IIUC we'll need to change all the
> four devices to implement this.

If you grep for kvm_irqchip_add_irqfd_notifier() there are more than that.

There's a bunch of largely duplicated code, with different code paths
for kvm_irqfds_enabled() and other variants. In code that I don't think
should even have to *know* about KVM, should it? 

I think I'd like to provide a generic set of helpers which just allow
callers to register a virtual IRQ and then trigger it manually and/or
attach an irqfd (and a resamplefd, qv) to it.

This new helper code can then cope with listening in userspace on that
fd if/when it needs to, and can even work for the non-KVM case. The
actual devices get a *lot* simpler.

It'll *broadly* look like the existing kvm_irqchip_* functions but be a
lot simpler to use.

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by Peter Xu 2 years, 11 months ago

On Fri, Mar 10, 2023 at 05:52:57PM +0000, David Woodhouse wrote:
> On Thu, 2023-03-09 at 11:55 -0500, Peter Xu wrote:
> > 
> > There're four devices that can hook onto this, IIUC.  Besides IOAPIC and
> > VFIO, there's also ivshmem and vhost.  IIUC we'll need to change all the
> > four devices to implement this.
> 
> If you grep for kvm_irqchip_add_irqfd_notifier() there are more than that.

Looks right to me.  I assume they're all line based IRQs routed later to
IOAPIC, so they're the real devices consuming the IOAPIC entries.

> There's a bunch of largely duplicated code, with different code paths
> for kvm_irqfds_enabled() and other variants. In code that I don't think
> should even have to *know* about KVM, should it? 
> 
> I think I'd like to provide a generic set of helpers which just allow
> callers to register a virtual IRQ and then trigger it manually and/or
> attach an irqfd (and a resamplefd, qv) to it.
> 
> This new helper code can then cope with listening in userspace on that
> fd if/when it needs to, and can even work for the non-KVM case. The
> actual devices get a *lot* simpler.
> 
> It'll *broadly* look like the existing kvm_irqchip_* functions but be a
> lot simpler to use.

IIUC what's missing is the reverse chain of notifications from e.g. IRTE to
the device, either via MSIs or via some pins of IOAPIC.

I don't think I have very good knowledge on the whole IRQ path yet so I
can't really tell anything useful, but what you said looks like a good
thing to have.  If it can cleanup things besides achieving the goal of
fault irq reporting it could be more worthwhile.

Thanks,

-- 
Peter Xu

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

On Fri, 2023-03-10 at 15:13 -0500, Peter Xu wrote:
> 
> > 
> > It'll *broadly* look like the existing kvm_irqchip_* functions but be a
> > lot simpler to use.
> 
> IIUC what's missing is the reverse chain of notifications from e.g. IRTE to
> the device, either via MSIs or via some pins of IOAPIC.

For the translation we don't need a reverse path.

When a range of 'cookies' is invalidated, the generic irqroute code
just kicks the irqfd out of the routing table.

That means the device code gets notified by its *own* eventfd. It tries
to say "deliver <this> virq", gets a "nope, it's invalid", and then
reprovides the original MSI information to be (re)translated (and to
raise a fault if it's invalid now).

Looks a bit like this...

	if (timer.gsi == static_cast<uint32_t>(-1))
		return -1;

	if (irqchip_try_deliver_gsi(timer.gsi) >= 0)
		return 0;

	struct kvm_msi msi = read_timer_msi_reg(timer);
	if (irqchip_update_msi(timer.gsi, &msi, deliver_now = true) < 0)
		errno = 0;

Now, for the interrupt *acknowledge* (kvm_resample_fd_notify) we
absolutely need a reverse path, and we need that anyway regardless of
KVM routing — VFIO and the Xen event channel GSI can both be massively
cleaned up if we have that wired up from the EOI/intack. But that's
*mostly* a separate issue, except that of course we want to wire that
resamplefd up in the routing table *too* and make things consistent in
userspace vs. KVM.

> I don't think I have very good knowledge on the whole IRQ path yet so I
> can't really tell anything useful, but what you said looks like a good
> thing to have.  If it can cleanup things besides achieving the goal of
> fault irq reporting it could be more worthwhile.

Yeah, that's definitely the intent.

When I first started playing with this in order to implement interrupt
remapping in my other VMM, I carefully drew out state diagrams in
pencil, tracking which eventfd was connected to userspace/KVM in
different states across first-trigger/masking/unmasking/live-update,
with a large 'WTF!!!' on the one of the state transitions from masked
to unmasked, when the IRQ happened anyway.

For QEMU I haven't even been able to write it out, because the logic is
split across different devices and handled differently in each. I think
it can be really simple.

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update

Posted by David Woodhouse 2 years, 11 months ago

On Thu, 2023-03-09 at 11:55 -0500, Peter Xu wrote:
> On Thu, Mar 09, 2023 at 09:16:08AM +0000, David Woodhouse wrote:
> > The only IRQs that are handled fully in the kernel are events arriving
> > on some eventfd which is bound as an IRQFD to some IRQ in the KVM
> > routing table. (Mostly MSIs coming from VFIO).
> > 
> > If we want to "pause" one of those, all we have to do is unbind the
> > IRQFD and then we can handle that eventfd in userspace instead. Which
> > is what we do *anyway* in the case where IRQFD support isn't available.
> > 
> > In
> > https://lore.kernel.org/kvm/20201027143944.648769-1-dwmw2@infradead.org/
> > I *optimised* that dance, so userspace doesn't have to stop listening
> > on the eventfd when the IRQFD is bound; the IRQFD code consumes the
> > event first. But we can live without that in older kernels.
> > 
> > Basically, it's just treating the IRQFD support as an *optimisation*,
> > and retaining the 'slow path' of handling the event in userspace and
> > injecting the resulting MSI.
> > 
> > It's just that in that slow path, as we're translating and injecting
> > the MSI, we *also* update the IRQ routing table and reattach the IRQFD
> > for the next time that interrupt fires. Like a cache.
> > 
> > And we stash an invalidation cookie (for Intel-IOMMU the IRTE index)
> > for each routing table entry, so that when asked to invalidate a
> > *range* of cookies (that's how the Intel IOMMU invalidate works), we
> > can just detach the IRQFD from those ones, letting them get handled in
> > userspace next time and retranslated (with associated fault report, as
> > appropriate).
> 
> We can maintain a cookie per entry in the routing table in userspace, I
> think, and ignore those when applying to KVM (perhaps need to regenerate
> another table when applying?  As KVM won't recognize the cookies). Besides
> that, do we need other infrastructures to link this entry / GSI back to
> which device registers with it?  Since I assume we need to tear down irqfds
> if there is, and rehook the event to an userspace handler here too.
> 

KVM doesn't need to recognise the cookies. Only the userspace code
which handles the core IRQ routing table and the IRQFD assignment.

I *have* just abused some S390-based fields for the cookie and a
'stale' bit...

/* 64-bit cookie for IOMMU to use for invalidation choices */
#define ire_ir_cookie(ire) ((ire)->u.adapter.ind_offset)

/* Flags, to indicate a stale entry that needs retranslating */
#define ire_user_flags(ire) ((ire)->u.adapter.summary_offset)
#define IRE_USER_FLAG_STALE		1

... but it could be done in a separate data structure just as well.

Given a range of cookies to invalidate, the core code just detaches the
IRQFD for any entry in the KVM IRQ routing table that has a cookie
within that range.

> There're four devices that can hook onto this, IIUC.  Besides IOAPIC and
> VFIO, there's also ivshmem and vhost.  IIUC we'll need to change all the
> four devices to implement this.
> 
> Besides the changeset (which seems to be still non-trivial to me.. without
> yet evaluating whether that'll be worth the effort), one concern I have
> right now is whether delaying the 1st irq would regress in any case.

It's fine. In QEMU you don't *have* to delay the first IRQ; you *can*
prepopulate the cache at the moment the guest programs a device's MSI
table, for example.

In a certain other implementation, we don't prepopulate so that first
IRQ does get handled in userspace every time, because we want to keep
track of whether a given MSI has *ever* fired or not. And there's been
absolutely no issue with that latency.

> I think you may have better knowledge here than me on how guest behaves in
> IRQ subsystem.  For example, is there any case where IRQs can be modified /
> invalidated frequently (perhaps mask / unmask?)

Mask/unmask shouldn't invalidate the cache. 

>  so there can be a lot of IRQs delivered slower than before? Because
> after this change the IRQ setup / cache overhead will be applied to
> the 1st IRQ being triggered rather than when IRQ was established /
> unmasked.

We've launched... a lot... of guests with this model and not seen any
issues :)

I'll knock up a prototype in QEMU and we can reason about it far more
coherently. I think it ends up actually being a simplification and
leading to easier-to-understand code.

> This also reminded me (just out of curiosity) on what will happen if
> without IR at all: say, what if a device setup a wrong MSI (with a messed
> up MSI data register) on bare metal?  Then, does QEMU properly emulate that
> too so far?

For the most past, MSI isn't special; it's *just* a memory write
transaction. Traditionally, we ask the device to write to some address
in the 0xFEExxxxx range, which happens not to be actual memory, but is
the APIC.

But if you want to just pin a data structure in memory, and give the
device the address of some 32-bit field in that data structure, then
*poll* for completion to see when the device wrote there... that's just
fine.

That would *generally* work in QEMU too, since we mostly raise MSI from
devices by doing that actual stl_le_phys() call.

The caveat is the "special" KVM way of encoding APIC IDs > 255, by
putting the high bits into the high bits of the 64-bit MSI target
address. That means that if handled as an actual memory transaction, it
would *miss* the APIC at 0x00000000FEExxxxx and really go to memory.

Which is why in some (post-translation) cases we have to treat it
"specially" as an MSI instead of just a memory write. Which I think is
actually the reason for that explicit kvm_set_irq() in ioapic_service()
which I mentioned at the start of this thread.

You'll note that when I added basically the same special case to
pci_msi_trigger() in commit 6096cf7877 I felt it warranted at least 5
lines of comment to explain itself :)