[PATCH v2 05/16] xen/riscv: introduce tracking of pending vCPU interrupts, part 2

Oleksii Kurochko posted 16 patches 2 weeks, 3 days ago
[PATCH v2 05/16] xen/riscv: introduce tracking of pending vCPU interrupts, part 2
Posted by Oleksii Kurochko 2 weeks, 3 days ago
This patch is based on Linux kernel 6.16.0.

Add the consumer side (vcpu_flush_interrupts()) of the lockless pending
interrupt tracking introduced in part 1 (for producers). According, to the
design only one consumer is possible, and it is vCPU itself.
vcpu_flush_interrupts() is expected to be ran (as guests aren't ran now due
to the lack of functionality) before the hypervisor returns control to the
guest.

Producers may set bits in irqs_pending_mask without a lock. Clearing bits in
irqs_pending_mask is performed only by the consumer via xchg() (with aquire &
release semantics). The consumer must not write to irqs_pending and must not
act on bits that are not set in the mask. Otherwise, extra synchronization
should be provided.
The worst thing which could happen with such approach is that a new pending
bit will be set to irqs_pending bitmap during update of hvip variable in
vcpu_flush_interrupt() but it isn't problem as the new pending bit won't
be lost and just be proceded during the next flush.

It is possible a guest could have pending bit not result in the hardware
register without to be marked pending in irq_pending bitmap as:
  According to the RISC-V ISA specification:
    Bits hip.VSSIP and hie.VSSIE are the interrupt-pending and
    interrupt-enable  bits for VS-level software interrupts. VSSIP in hip
    is an alias (writable) of the same bit in hvip.
  Additionally:
    When bit 2 of hideleg is zero, vsip.SSIP and vsie.SSIE are read-only
    zeros. Else, vsip.SSIP and vsie.SSIE are aliases of hip.VSSIP and
    hie.VSSIE.
This means the guest may modify vsip.SSIP, which implicitly updates
hip.VSSIP and the bit being writable with 1 would also trigger an interrupt
as according to the RISC-V spec:
  These conditions for an interrupt trap to occur must be evaluated in a
  bounded   amount of time from when an interrupt becomes, or ceases to be,
  pending in sip,  and must also be evaluated immediately following the
  execution of an SRET  instruction or an explicit write to a CSR on which
  these interrupt trap conditions expressly depend (including sip, sie and
  sstatus).
What means that IRQ_VS_SOFT must be synchronized separately, what is done
in vcpu_sync_interrupts(). Note, also, that IRQ_PMU_OVF would want to be
synced for the similar reason as IRQ_VS_SOFT, but isn't sync-ed now as
PMU isn't supported now.

For the remaining VS-level interrupt types (IRQ_VS_TIMER and
IRQ_VS_EXT), the specification states they cannot be modified by the guest
and are read-only:
  Bits hip.VSEIP and hie.VSEIE are the interrupt-pending and interrupt-enable
  bits for VS-level external interrupts. VSEIP is read-only in hip, and is
  the logical-OR of these interrupt sources:
    • bit VSEIP of hvip;
    • the bit of hgeip selected by hstatus.VGEIN; and
    • any other platform-specific external interrupt signal directed to
      VS-level.
  Bits hip.VSTIP and hie.VSTIE are the interrupt-pending and interrupt-enable
  bits for VS-level timer interrupts. VSTIP is read-only in hip, and is the
  logical-OR of hvip.VSTIP and any other platform-specific timer interrupt
  signal directed to VS-level.
Thus, for these interrupt types, it is sufficient to use vcpu_set_interrupt()
and vcpu_unset_interrupt(), and flush them during the call of
vcpu_flush_interrupts().

As AIA specs introduced hviph register which would want to be updated when
guest related AIA code will be introduced vcpu_update_hvip() is introduced
instead of just open-code it in vcpu_flush_interrupts().

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v2:
 - New patch.
---
 xen/arch/riscv/domain.c             | 65 +++++++++++++++++++++++++++++
 xen/arch/riscv/include/asm/domain.h |  3 ++
 2 files changed, 68 insertions(+)

diff --git a/xen/arch/riscv/domain.c b/xen/arch/riscv/domain.c
index 3777888f34ea..c078d595df9c 100644
--- a/xen/arch/riscv/domain.c
+++ b/xen/arch/riscv/domain.c
@@ -171,3 +171,68 @@ int vcpu_unset_interrupt(struct vcpu *v, unsigned int irq)
 
     return 0;
 }
+
+static void vcpu_update_hvip(struct vcpu *v)
+{
+    csr_write(CSR_HVIP, v->arch.hvip);
+}
+
+void vcpu_flush_interrupts(struct vcpu *v)
+{
+    register_t *hvip = &v->arch.hvip;
+
+    unsigned long mask, val;
+
+    if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
+    {
+        mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
+        val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
+
+        *hvip &= ~mask;
+        *hvip |= val;
+    }
+
+    /*
+     * Flush AIA high interrupts.
+     *
+     * It is necessary to do only for CONFIG_RISCV_32 which isn't supported
+     * now.
+     */
+#ifdef CONFIG_RISCV_32
+#   error "Update hviph"
+#endif
+
+    vcpu_update_hvip(v);
+}
+
+void vcpu_sync_interrupts(struct vcpu *v)
+{
+    unsigned long hvip;
+
+    /* Read current HVIP and VSIE CSRs */
+    v->arch.vsie = csr_read(CSR_VSIE);
+
+    /* Sync-up HVIP.VSSIP bit changes does by Guest */
+    hvip = csr_read(CSR_HVIP);
+    if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
+    {
+        if ( !test_and_set_bit(IRQ_VS_SOFT,
+                               &v->arch.irqs_pending_mask) )
+        {
+            if ( hvip & BIT(IRQ_VS_SOFT, UL) )
+                set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
+            else
+                clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
+        }
+    }
+
+    /*
+     * Sync-up AIA high interrupts.
+     *
+     * It is necessary to do only for CONFIG_RISCV_32 which isn't supported
+     * now.
+     */
+#ifdef CONFIG_RISCV_32
+#   error "Update vsieh"
+#endif
+}
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index b8178447c68f..fa083094b43e 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -136,6 +136,9 @@ static inline void arch_vcpu_block(struct vcpu *v) {}
 int vcpu_set_interrupt(struct vcpu *v, unsigned int irq);
 int vcpu_unset_interrupt(struct vcpu *v, unsigned int irq);
 
+void vcpu_flush_interrupts(struct vcpu *v);
+void vcpu_sync_interrupts(struct vcpu *v);
+
 #endif /* ASM__RISCV__DOMAIN_H */
 
 /*
-- 
2.52.0


Re: [PATCH v2 05/16] xen/riscv: introduce tracking of pending vCPU interrupts, part 2
Posted by Jan Beulich 1 week, 3 days ago
On 22.01.2026 17:47, Oleksii Kurochko wrote:
> This patch is based on Linux kernel 6.16.0.
> 
> Add the consumer side (vcpu_flush_interrupts()) of the lockless pending
> interrupt tracking introduced in part 1 (for producers). According, to the
> design only one consumer is possible, and it is vCPU itself.
> vcpu_flush_interrupts() is expected to be ran (as guests aren't ran now due
> to the lack of functionality) before the hypervisor returns control to the
> guest.
> 
> Producers may set bits in irqs_pending_mask without a lock. Clearing bits in
> irqs_pending_mask is performed only by the consumer via xchg() (with aquire &
> release semantics). The consumer must not write to irqs_pending and must not
> act on bits that are not set in the mask. Otherwise, extra synchronization
> should be provided.
> The worst thing which could happen with such approach is that a new pending
> bit will be set to irqs_pending bitmap during update of hvip variable in
> vcpu_flush_interrupt() but it isn't problem as the new pending bit won't
> be lost and just be proceded during the next flush.
> 
> It is possible a guest could have pending bit not result in the hardware
> register without to be marked pending in irq_pending bitmap as:
>   According to the RISC-V ISA specification:
>     Bits hip.VSSIP and hie.VSSIE are the interrupt-pending and
>     interrupt-enable  bits for VS-level software interrupts. VSSIP in hip
>     is an alias (writable) of the same bit in hvip.
>   Additionally:
>     When bit 2 of hideleg is zero, vsip.SSIP and vsie.SSIE are read-only
>     zeros. Else, vsip.SSIP and vsie.SSIE are aliases of hip.VSSIP and
>     hie.VSSIE.
> This means the guest may modify vsip.SSIP, which implicitly updates
> hip.VSSIP and the bit being writable with 1 would also trigger an interrupt
> as according to the RISC-V spec:
>   These conditions for an interrupt trap to occur must be evaluated in a
>   bounded   amount of time from when an interrupt becomes, or ceases to be,
>   pending in sip,  and must also be evaluated immediately following the
>   execution of an SRET  instruction or an explicit write to a CSR on which
>   these interrupt trap conditions expressly depend (including sip, sie and
>   sstatus).
> What means that IRQ_VS_SOFT must be synchronized separately, what is done
> in vcpu_sync_interrupts().

And this function is going to be used from where? Exit from guest into the
hypervisor? Whereas vcpu_flush_interrupt() is to be called ahead of re-
entering the guest?

I ask because vcpu_sync_interrupts() very much looks like a producer to me,
yet the patch here supposedly is the consumer side.

> --- a/xen/arch/riscv/domain.c
> +++ b/xen/arch/riscv/domain.c
> @@ -171,3 +171,68 @@ int vcpu_unset_interrupt(struct vcpu *v, unsigned int irq)
>  
>      return 0;
>  }
> +
> +static void vcpu_update_hvip(struct vcpu *v)

Pointer-to-const?

> +{
> +    csr_write(CSR_HVIP, v->arch.hvip);
> +}
> +
> +void vcpu_flush_interrupts(struct vcpu *v)
> +{
> +    register_t *hvip = &v->arch.hvip;
> +
> +    unsigned long mask, val;

These are used ...

> +    if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
> +    {
> +        mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
> +        val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
> +
> +        *hvip &= ~mask;
> +        *hvip |= val;

... solely in this more narrow scope.

> +    }
> +
> +    /*
> +     * Flush AIA high interrupts.
> +     *
> +     * It is necessary to do only for CONFIG_RISCV_32 which isn't supported
> +     * now.
> +     */
> +#ifdef CONFIG_RISCV_32
> +#   error "Update hviph"
> +#endif
> +
> +    vcpu_update_hvip(v);

Why would bits for which the mask bit wasn't set be written here?

> +void vcpu_sync_interrupts(struct vcpu *v)
> +{
> +    unsigned long hvip;
> +
> +    /* Read current HVIP and VSIE CSRs */
> +    v->arch.vsie = csr_read(CSR_VSIE);
> +
> +    /* Sync-up HVIP.VSSIP bit changes does by Guest */

Nit: s/does/done/ ?

> +    hvip = csr_read(CSR_HVIP);
> +    if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
> +    {
> +        if ( !test_and_set_bit(IRQ_VS_SOFT,
> +                               &v->arch.irqs_pending_mask) )

Why two separate, nested if()s?

> +        {
> +            if ( hvip & BIT(IRQ_VS_SOFT, UL) )
> +                set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
> +            else
> +                clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
> +        }

In the previous patch you set forth strict ordering rules, with a barrier in
the middle. All of this is violated here. 

> +    }
> +
> +    /*
> +     * Sync-up AIA high interrupts.
> +     *
> +     * It is necessary to do only for CONFIG_RISCV_32 which isn't supported
> +     * now.
> +     */
> +#ifdef CONFIG_RISCV_32
> +#   error "Update vsieh"
> +#endif

Here you mean the register or the struct vcpu field? It may be helpful to
disambiguate; assuming it's the latter, simply spell out v->arch.vsieh?
(Same then for the similar code in vcpu_flush_interrupts().)

Jan
Re: [PATCH v2 05/16] xen/riscv: introduce tracking of pending vCPU interrupts, part 2
Posted by Oleksii Kurochko 6 days, 17 hours ago
On 1/29/26 6:01 PM, Jan Beulich wrote:
> On 22.01.2026 17:47, Oleksii Kurochko wrote:
>> This patch is based on Linux kernel 6.16.0.
>>
>> Add the consumer side (vcpu_flush_interrupts()) of the lockless pending
>> interrupt tracking introduced in part 1 (for producers). According, to the
>> design only one consumer is possible, and it is vCPU itself.
>> vcpu_flush_interrupts() is expected to be ran (as guests aren't ran now due
>> to the lack of functionality) before the hypervisor returns control to the
>> guest.
>>
>> Producers may set bits in irqs_pending_mask without a lock. Clearing bits in
>> irqs_pending_mask is performed only by the consumer via xchg() (with aquire &
>> release semantics). The consumer must not write to irqs_pending and must not
>> act on bits that are not set in the mask. Otherwise, extra synchronization
>> should be provided.
>> The worst thing which could happen with such approach is that a new pending
>> bit will be set to irqs_pending bitmap during update of hvip variable in
>> vcpu_flush_interrupt() but it isn't problem as the new pending bit won't
>> be lost and just be proceded during the next flush.
>>
>> It is possible a guest could have pending bit not result in the hardware
>> register without to be marked pending in irq_pending bitmap as:
>>    According to the RISC-V ISA specification:
>>      Bits hip.VSSIP and hie.VSSIE are the interrupt-pending and
>>      interrupt-enable  bits for VS-level software interrupts. VSSIP in hip
>>      is an alias (writable) of the same bit in hvip.
>>    Additionally:
>>      When bit 2 of hideleg is zero, vsip.SSIP and vsie.SSIE are read-only
>>      zeros. Else, vsip.SSIP and vsie.SSIE are aliases of hip.VSSIP and
>>      hie.VSSIE.
>> This means the guest may modify vsip.SSIP, which implicitly updates
>> hip.VSSIP and the bit being writable with 1 would also trigger an interrupt
>> as according to the RISC-V spec:
>>    These conditions for an interrupt trap to occur must be evaluated in a
>>    bounded   amount of time from when an interrupt becomes, or ceases to be,
>>    pending in sip,  and must also be evaluated immediately following the
>>    execution of an SRET  instruction or an explicit write to a CSR on which
>>    these interrupt trap conditions expressly depend (including sip, sie and
>>    sstatus).
>> What means that IRQ_VS_SOFT must be synchronized separately, what is done
>> in vcpu_sync_interrupts().
> And this function is going to be used from where? Exit from guest into the
> hypervisor? Whereas vcpu_flush_interrupt() is to be called ahead of re-
> entering the guest?

Both of them are called before returning control to a guest (missed to mention
that in the commit message) in do_trap() at the end:

static void check_for_pcpu_work(void)
{
     ...

     vcpu_flush_interrupts(current);

     vcpu_sync_interrupts(current);
}

void do_trap(struct cpu_user_regs *cpu_regs)
{
     ...
     if ( cpu_regs->hstatus & HSTATUS_SPV )
         check_for_pcpu_work();
}

>
> I ask because vcpu_sync_interrupts() very much looks like a producer to me,
> yet the patch here supposedly is the consumer side.

Yes, vcpu_sync_interrupts() should be in producer side, I'll move it to the prev.
patch.

>
>> --- a/xen/arch/riscv/domain.c
>> +++ b/xen/arch/riscv/domain.c
>> @@ -171,3 +171,68 @@ int vcpu_unset_interrupt(struct vcpu *v, unsigned int irq)
>>   
>>       return 0;
>>   }
>> +
>> +static void vcpu_update_hvip(struct vcpu *v)
> Pointer-to-const?
>
>> +{
>> +    csr_write(CSR_HVIP, v->arch.hvip);
>> +}
>> +
>> +void vcpu_flush_interrupts(struct vcpu *v)
>> +{
>> +    register_t *hvip = &v->arch.hvip;
>> +
>> +    unsigned long mask, val;
> These are used ...
>
>> +    if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
>> +    {
>> +        mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
>> +        val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
>> +
>> +        *hvip &= ~mask;
>> +        *hvip |= val;
> ... solely in this more narrow scope.

I'll declare them inside the if().

>
>> +    }
>> +
>> +    /*
>> +     * Flush AIA high interrupts.
>> +     *
>> +     * It is necessary to do only for CONFIG_RISCV_32 which isn't supported
>> +     * now.
>> +     */
>> +#ifdef CONFIG_RISCV_32
>> +#   error "Update hviph"
>> +#endif
>> +
>> +    vcpu_update_hvip(v);
> Why would bits for which the mask bit wasn't set be written here?

This function inside uses only v->arch.hvip which is updated above according to
the mask.


>
>> +void vcpu_sync_interrupts(struct vcpu *v)
>> +{
>> +    unsigned long hvip;
>> +
>> +    /* Read current HVIP and VSIE CSRs */
>> +    v->arch.vsie = csr_read(CSR_VSIE);
>> +
>> +    /* Sync-up HVIP.VSSIP bit changes does by Guest */
> Nit: s/does/done/ ?
>
>> +    hvip = csr_read(CSR_HVIP);
>> +    if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>> +    {
>> +        if ( !test_and_set_bit(IRQ_VS_SOFT,
>> +                               &v->arch.irqs_pending_mask) )
> Why two separate, nested if()s?

Do you mean that it could be:
   if ( !test_and_set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending_mask) && (hvip & BIT(IRQ_VS_SOFT, UL))
?

>
>> +        {
>> +            if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>> +                set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>> +            else
>> +                clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>> +        }
> In the previous patch you set forth strict ordering rules, with a barrier in
> the middle. All of this is violated here.

It still respects the rule that the producer (|vcpu_sync_interrupts()| which
should be in the producer path) never clears the mask and only writes to
|irqs_pending| if it is the one that flipped the corresponding mask bit from 0
to 1.

Considering that the consumer cannot be called concurrently in this case
(since|vcpu_flush_interrupts()| and|vcpu_sync_interrupts()| are only invoked
sequentially in|check_for_pcpu_work()|, as mentioned above), nothing can
clear a bit in the mask in between. Therefore, I think it is acceptable to
slightly bend (and it should be explained in the comment above the
function or in the commit message) the rule that the|irqs_pending| bit must
be written first, followed by updating the corresponding bit in
|irqs_pending_mask() specifically for |vcpu_sync_interrupts().

>
>> +    }
>> +
>> +    /*
>> +     * Sync-up AIA high interrupts.
>> +     *
>> +     * It is necessary to do only for CONFIG_RISCV_32 which isn't supported
>> +     * now.
>> +     */
>> +#ifdef CONFIG_RISCV_32
>> +#   error "Update vsieh"
>> +#endif
> Here you mean the register or the struct vcpu field? It may be helpful to
> disambiguate; assuming it's the latter, simply spell out v->arch.vsieh?
> (Same then for the similar code in vcpu_flush_interrupts().)

Agree, it would be better.

Thanks.

~ Oleksii
Re: [PATCH v2 05/16] xen/riscv: introduce tracking of pending vCPU interrupts, part 2
Posted by Jan Beulich 6 days, 13 hours ago
On 02.02.2026 11:50, Oleksii Kurochko wrote:
> On 1/29/26 6:01 PM, Jan Beulich wrote:
>> On 22.01.2026 17:47, Oleksii Kurochko wrote:
>>> This patch is based on Linux kernel 6.16.0.
>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>> +{
>>> +    unsigned long hvip;
>>> +
>>> +    /* Read current HVIP and VSIE CSRs */
>>> +    v->arch.vsie = csr_read(CSR_VSIE);
>>> +
>>> +    /* Sync-up HVIP.VSSIP bit changes does by Guest */
>> Nit: s/does/done/ ?
>>
>>> +    hvip = csr_read(CSR_HVIP);
>>> +    if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>> +    {
>>> +        if ( !test_and_set_bit(IRQ_VS_SOFT,
>>> +                               &v->arch.irqs_pending_mask) )
>> Why two separate, nested if()s?
> 
> Do you mean that it could be:
>    if ( !test_and_set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending_mask) && (hvip & BIT(IRQ_VS_SOFT, UL))
> ?

That's combining with the if() ...

>>> +        {
>>> +            if ( hvip & BIT(IRQ_VS_SOFT, UL) )

... down here, which - ...

>>> +                set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>> +            else

... having an "else" - can't be folded like this, I think. I meant the two
if()s immediately ahead of my remark.

>>> +                clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>> +        }
>> In the previous patch you set forth strict ordering rules, with a barrier in
>> the middle. All of this is violated here.
> 
> It still respects the rule that the producer (|vcpu_sync_interrupts()| which
> should be in the producer path) never clears the mask and only writes to
> |irqs_pending| if it is the one that flipped the corresponding mask bit from 0
> to 1.
> 
> Considering that the consumer cannot be called concurrently in this case
> (since|vcpu_flush_interrupts()| and|vcpu_sync_interrupts()| are only invoked
> sequentially in|check_for_pcpu_work()|, as mentioned above), nothing can
> clear a bit in the mask in between. Therefore, I think it is acceptable to
> slightly bend (and it should be explained in the comment above the
> function or in the commit message) the rule that the|irqs_pending| bit must
> be written first, followed by updating the corresponding bit in
> |irqs_pending_mask() specifically for |vcpu_sync_interrupts().

With suitable commenting - yes.

Jan