This patch is based on Linux kernel 6.16.0.
Introduce a lockless mechanism for tracking pending vCPU interrupts using
atomic bit operations. The design follows a multi-producer, single-consumer
model where the consumer is the vCPU itself.
Two bitmaps are added:
- irqs_pending — represents interrupts currently pending
- irqs_pending_mask — represents bits that have changed in irqs_pending
Introduce vcpu_(un)set_interrupt() to mark an interrupt in irqs_pending{_mask}
bitmap(s) to notify vCPU that it has or no an interrupt.
Other parts (such as vcpu_has_interrupts(), vcpu_flush_interrupts() and
vcpu_sync_interrupts()) of a lockless mechanism for tracking pending vCPU
interuupts are going to be introduced in a separate patch.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
xen/arch/riscv/domain.c | 47 +++++++++++++++++++++
xen/arch/riscv/include/asm/domain.h | 19 +++++++++
xen/arch/riscv/include/asm/riscv_encoding.h | 1 +
3 files changed, 67 insertions(+)
diff --git a/xen/arch/riscv/domain.c b/xen/arch/riscv/domain.c
index 164ab14a5209..8a010ae5b47e 100644
--- a/xen/arch/riscv/domain.c
+++ b/xen/arch/riscv/domain.c
@@ -5,9 +5,11 @@
#include <xen/sched.h>
#include <xen/smp.h>
+#include <asm/bitops.h>
#include <asm/cpufeature.h>
#include <asm/csr.h>
#include <asm/riscv_encoding.h>
+#include <asm/system.h>
#include <asm/vtimer.h>
static void vcpu_csr_init(struct vcpu *v)
@@ -100,6 +102,9 @@ int arch_vcpu_create(struct vcpu *v)
if ( is_idle_vcpu(v) )
return rc;
+ bitmap_zero(v->arch.irqs_pending, RISCV_VCPU_NR_IRQS);
+ bitmap_zero(v->arch.irqs_pending_mask, RISCV_VCPU_NR_IRQS);
+
if ( (rc = vcpu_vtimer_init(v)) )
goto fail;
@@ -135,3 +140,45 @@ void vcpu_kick(struct vcpu *v)
smp_send_event_check_mask(cpumask_of(v->processor));
}
}
+
+int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq)
+{
+ /*
+ * We only allow VS-mode software, timer, and external
+ * interrupts when irq is one of the local interrupts
+ * defined by RISC-V privilege specification.
+ */
+ if ( irq < IRQ_LOCAL_MAX &&
+ irq != IRQ_VS_SOFT &&
+ irq != IRQ_VS_TIMER &&
+ irq != IRQ_VS_EXT )
+ return -EINVAL;
+
+ set_bit(irq, v->arch.irqs_pending);
+ smp_mb__before_atomic();
+ set_bit(irq, v->arch.irqs_pending_mask);
+
+ vcpu_kick(v);
+
+ return 0;
+}
+
+int vcpu_unset_interrupt(struct vcpu *v, const unsigned int irq)
+{
+ /*
+ * We only allow VS-mode software, timer, external
+ * interrupts when irq is one of the local interrupts
+ * defined by RISC-V privilege specification.
+ */
+ if ( irq < IRQ_LOCAL_MAX &&
+ irq != IRQ_VS_SOFT &&
+ irq != IRQ_VS_TIMER &&
+ irq != IRQ_VS_EXT )
+ return -EINVAL;
+
+ clear_bit(irq, v->arch.irqs_pending);
+ smp_mb__before_atomic();
+ set_bit(irq, v->arch.irqs_pending_mask);
+
+ return 0;
+}
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index be7ddaff30e7..a7538e0dc966 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -85,6 +85,22 @@ struct arch_vcpu
register_t vstval;
register_t vsatp;
register_t vsepc;
+
+ /*
+ * VCPU interrupts
+ *
+ * We have a lockless approach for tracking pending VCPU interrupts
+ * implemented using atomic bitops. The irqs_pending bitmap represent
+ * pending interrupts whereas irqs_pending_mask represent bits changed
+ * in irqs_pending. Our approach is modeled around multiple producer
+ * and single consumer problem where the consumer is the VCPU itself.
+ *
+ * DECLARE_BITMAP() is needed here to support 64 vCPU local interrupts
+ * on RV32 host.
+ */
+#define RISCV_VCPU_NR_IRQS 64
+ DECLARE_BITMAP(irqs_pending, RISCV_VCPU_NR_IRQS);
+ DECLARE_BITMAP(irqs_pending_mask, RISCV_VCPU_NR_IRQS);
} __cacheline_aligned;
struct paging_domain {
@@ -123,6 +139,9 @@ static inline void update_guest_memory_policy(struct vcpu *v,
static inline void arch_vcpu_block(struct vcpu *v) {}
+int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq);
+int vcpu_unset_interrupt(struct vcpu *v, const unsigned int irq);
+
#endif /* ASM__RISCV__DOMAIN_H */
/*
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index dd15731a86fa..32d25f2d3e94 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -91,6 +91,7 @@
#define IRQ_M_EXT 11
#define IRQ_S_GEXT 12
#define IRQ_PMU_OVF 13
+#define IRQ_LOCAL_MAX (IRQ_PMU_OVF + 1)
#define MIP_SSIP (_UL(1) << IRQ_S_SOFT)
#define MIP_VSSIP (_UL(1) << IRQ_VS_SOFT)
--
2.52.0
On 24.12.2025 18:03, Oleksii Kurochko wrote:
> This patch is based on Linux kernel 6.16.0.
>
> Introduce a lockless mechanism for tracking pending vCPU interrupts using
> atomic bit operations. The design follows a multi-producer, single-consumer
> model where the consumer is the vCPU itself.
>
> Two bitmaps are added:
> - irqs_pending — represents interrupts currently pending
> - irqs_pending_mask — represents bits that have changed in irqs_pending
>
> Introduce vcpu_(un)set_interrupt() to mark an interrupt in irqs_pending{_mask}
> bitmap(s) to notify vCPU that it has or no an interrupt.
It's not becoming clear how these are going to be used. It's also not clear
to me whether you really need to record these in software: Aren't there
(virtual) registers where they would be more naturally tracked, much like
hardware would do?
Furthermore, since you're dealing with two bitmaps, there's no full
atomicity here anyway. The bitmaps are each dealt with atomically, but
the overall update isn't atomic. Whether that's going to be okay can only
be told when also seeing the producer side.
> --- a/xen/arch/riscv/domain.c
> +++ b/xen/arch/riscv/domain.c
> @@ -5,9 +5,11 @@
> #include <xen/sched.h>
> #include <xen/smp.h>
>
> +#include <asm/bitops.h>
> #include <asm/cpufeature.h>
> #include <asm/csr.h>
> #include <asm/riscv_encoding.h>
> +#include <asm/system.h>
> #include <asm/vtimer.h>
>
> static void vcpu_csr_init(struct vcpu *v)
> @@ -100,6 +102,9 @@ int arch_vcpu_create(struct vcpu *v)
> if ( is_idle_vcpu(v) )
> return rc;
>
> + bitmap_zero(v->arch.irqs_pending, RISCV_VCPU_NR_IRQS);
> + bitmap_zero(v->arch.irqs_pending_mask, RISCV_VCPU_NR_IRQS);
This is pointless, as struct vcpu starts out all zero.
> @@ -135,3 +140,45 @@ void vcpu_kick(struct vcpu *v)
> smp_send_event_check_mask(cpumask_of(v->processor));
> }
> }
> +
> +int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq)
> +{
> + /*
> + * We only allow VS-mode software, timer, and external
> + * interrupts when irq is one of the local interrupts
> + * defined by RISC-V privilege specification.
> + */
> + if ( irq < IRQ_LOCAL_MAX &&
What use is this? In particular this allows an incoming irq with a huge
number to ...
> + irq != IRQ_VS_SOFT &&
> + irq != IRQ_VS_TIMER &&
> + irq != IRQ_VS_EXT )
> + return -EINVAL;
> +
> + set_bit(irq, v->arch.irqs_pending);
> + smp_mb__before_atomic();
> + set_bit(irq, v->arch.irqs_pending_mask);
... overrun both bitmaps.
> --- a/xen/arch/riscv/include/asm/domain.h
> +++ b/xen/arch/riscv/include/asm/domain.h
> @@ -85,6 +85,22 @@ struct arch_vcpu
> register_t vstval;
> register_t vsatp;
> register_t vsepc;
> +
> + /*
> + * VCPU interrupts
> + *
> + * We have a lockless approach for tracking pending VCPU interrupts
> + * implemented using atomic bitops. The irqs_pending bitmap represent
> + * pending interrupts whereas irqs_pending_mask represent bits changed
> + * in irqs_pending.
And hence a set immediately followed by an unset is then indistinguishable
from just an unset (or the other way around). This may not be a problem, but
if it isn't, I think this needs explaining. Much like it is unclear why the
"changed" state needs tracking in the first place.
> Our approach is modeled around multiple producer
> + * and single consumer problem where the consumer is the VCPU itself.
> + *
> + * DECLARE_BITMAP() is needed here to support 64 vCPU local interrupts
> + * on RV32 host.
> + */
> +#define RISCV_VCPU_NR_IRQS 64
> + DECLARE_BITMAP(irqs_pending, RISCV_VCPU_NR_IRQS);
> + DECLARE_BITMAP(irqs_pending_mask, RISCV_VCPU_NR_IRQS);
> } __cacheline_aligned;
>
> struct paging_domain {
> @@ -123,6 +139,9 @@ static inline void update_guest_memory_policy(struct vcpu *v,
>
> static inline void arch_vcpu_block(struct vcpu *v) {}
>
> +int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq);
> +int vcpu_unset_interrupt(struct vcpu *v, const unsigned int irq);
Why the const-s?
> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> @@ -91,6 +91,7 @@
> #define IRQ_M_EXT 11
> #define IRQ_S_GEXT 12
> #define IRQ_PMU_OVF 13
> +#define IRQ_LOCAL_MAX (IRQ_PMU_OVF + 1)
MAX together with "+ 1" looks wrong. What is 14 (which, when MAX is 14,
must be a valid interrupt)? Or if 14 isn't a valid interrupt, please use
NR or NUM.
Also, nit: Padding doesn't match with the earlier #define-s (even if in the
quoted text it appears otherwise).
Jan
On 1/7/26 5:28 PM, Jan Beulich wrote: >> + >> + /* >> + * VCPU interrupts >> + * >> + * We have a lockless approach for tracking pending VCPU interrupts >> + * implemented using atomic bitops. The irqs_pending bitmap represent >> + * pending interrupts whereas irqs_pending_mask represent bits changed >> + * in irqs_pending. > And hence a set immediately followed by an unset is then indistinguishable > from just an unset (or the other way around). This may not be a problem, but > if it isn't, I think this needs explaining. I am still not sure that this is actually a problem, or what kind of explanation is needed. |unset| is called only when the guest makes such a request, and the guest will make that request only after it has received an interrupt that was previously set in the|irq_pending| bitmap and then flushed to the hardware HVIP. If an interrupt is simply set and then unset without ever being flushed to the hardware HVIP, it seems there would be no issue, since it would not affect the guest. However, the question of why this happened at all would still remain. Do I miss some corner cases which should be taken into account? Should I still have to add some extra explanation to the comment or commit message? ~ Oleksii
On 16.01.2026 15:25, Oleksii Kurochko wrote: > > On 1/7/26 5:28 PM, Jan Beulich wrote: >>> + >>> + /* >>> + * VCPU interrupts >>> + * >>> + * We have a lockless approach for tracking pending VCPU interrupts >>> + * implemented using atomic bitops. The irqs_pending bitmap represent >>> + * pending interrupts whereas irqs_pending_mask represent bits changed >>> + * in irqs_pending. >> And hence a set immediately followed by an unset is then indistinguishable >> from just an unset (or the other way around). This may not be a problem, but >> if it isn't, I think this needs explaining. > > I am still not sure that this is actually a problem, or what kind of explanation > is needed. > |unset| is called only when the guest makes such a request, and the guest will > make that request only after it has received an interrupt that was previously > set in the|irq_pending| bitmap and then flushed to the hardware HVIP. > > If an interrupt is simply set and then unset without ever being flushed to the > hardware HVIP, it seems there would be no issue, since it would not affect the > guest. However, the question of why this happened at all would still remain. > > Do I miss some corner cases which should be taken into account? > Should I still have to add some extra explanation to the comment or commit > message? Perhaps the problem is that really I can't see the full picture yet, for the series not putting everything in place that is related. Jan
On 1/7/26 5:28 PM, Jan Beulich wrote:
> On 24.12.2025 18:03, Oleksii Kurochko wrote:
>> This patch is based on Linux kernel 6.16.0.
>>
>> Introduce a lockless mechanism for tracking pending vCPU interrupts using
>> atomic bit operations. The design follows a multi-producer, single-consumer
>> model where the consumer is the vCPU itself.
>>
>> Two bitmaps are added:
>> - irqs_pending — represents interrupts currently pending
>> - irqs_pending_mask — represents bits that have changed in irqs_pending
>>
>> Introduce vcpu_(un)set_interrupt() to mark an interrupt in irqs_pending{_mask}
>> bitmap(s) to notify vCPU that it has or no an interrupt.
> It's not becoming clear how these are going to be used. It's also not clear
> to me whether you really need to record these in software: Aren't there
> (virtual) registers where they would be more naturally tracked, much like
> hardware would do?
Guest (virtual) registers are not used to inject interrupts on RISC-V; for that
purpose, the HVIP register is provided. Even without considering HVIP, using guest
(virtual) registers has a downside: if a bit in hideleg is zero, the corresponding
bit in VSIP is read-only zero. During a context_switch(), when CSRs are saved,
this means we would not obtain correct values, since some VSIP bits may read as
zero during csr_read().
In fact, this is one of the reasons why we want to track interrupts to be
injected separately. For example, a vtimer may expire while the vCPU is running
on a different pCPU, so we update vCPU->hvip while the vCPU is active elsewhere.
When the vCPU is later switched in during a context_switch(), we would lose the
fact that vCPU->hvip.vtimer was set to 1, because the CSR save function will do:
vCPU->hvip = csr_read(CSR_HVIP);
and the pending interrupt state would be overwritten.
>
> Furthermore, since you're dealing with two bitmaps, there's no full
> atomicity here anyway. The bitmaps are each dealt with atomically, but
> the overall update isn't atomic. Whether that's going to be okay can only
> be told when also seeing the producer side.
You're correct that the two-bitmap update isn't fully atomic, but this design
is intentional. Here [1], other is the part 2 of introduction of pending vCPU interrupts
and as it requires more stuff to introduce (for example, [2]) I decided not to
introduce it now with some stubs and introduce it when all will be ready for it.
If a producer is interrupted between updating the two bitmaps the worst case is:
vCPU might process stale state for one cycle, this is resolved on the next flush when
the mask indicates the bit changed. No interrupt is permanently lost or spuriously
generated.
[1] https://gitlab.com/xen-project/people/olkur/xen/-/commit/31022d515789a032fd994f9ca90965db089dbbd5
void vcpu_flush_interrupts(struct vcpu *v)
{
register_t *hvip = &v->arch.hvip;
unsigned long mask, val;
if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
{
mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
*hvip &= ~mask;
*hvip |= val;
}
/* Flush AIA high interrupts */
vcpu_aia_flush_interrupts(v);
vcpu_update_hvip(v);
}
void vcpu_sync_interrupts(struct vcpu *v)
{
unsigned long hvip;
/* Read current HVIP and VSIE CSRs */
v->arch.vsie = csr_read(CSR_VSIE);
/* Sync-up HVIP.VSSIP bit changes does by Guest */
hvip = csr_read(CSR_HVIP);
if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
{
if ( hvip & BIT(IRQ_VS_SOFT, UL) )
{
if ( !test_and_set_bit(IRQ_VS_SOFT,
&v->arch.irqs_pending_mask) )
set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
}
else
{
if ( !test_and_set_bit(IRQ_VS_SOFT,
&v->arch.irqs_pending_mask) )
clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
}
}
/* Sync-up AIA high interrupts */
vcpu_aia_sync_interrupts(v);
/* Sync-up timer CSRs */
vtimer_sync(v);
}
[2] https://gitlab.com/xen-project/people/olkur/xen/-/commit/1c06b8b1d1eadfe009a4d6b1a1902fac64d080e9
>
>> --- a/xen/arch/riscv/domain.c
>> +++ b/xen/arch/riscv/domain.c
>> @@ -5,9 +5,11 @@
>> #include <xen/sched.h>
>> #include <xen/smp.h>
>>
>> +#include <asm/bitops.h>
>> #include <asm/cpufeature.h>
>> #include <asm/csr.h>
>> #include <asm/riscv_encoding.h>
>> +#include <asm/system.h>
>> #include <asm/vtimer.h>
>>
>> static void vcpu_csr_init(struct vcpu *v)
>> @@ -100,6 +102,9 @@ int arch_vcpu_create(struct vcpu *v)
>> if ( is_idle_vcpu(v) )
>> return rc;
>>
>> + bitmap_zero(v->arch.irqs_pending, RISCV_VCPU_NR_IRQS);
>> + bitmap_zero(v->arch.irqs_pending_mask, RISCV_VCPU_NR_IRQS);
> This is pointless, as struct vcpu starts out all zero.
>
>> @@ -135,3 +140,45 @@ void vcpu_kick(struct vcpu *v)
>> smp_send_event_check_mask(cpumask_of(v->processor));
>> }
>> }
>> +
>> +int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq)
>> +{
>> + /*
>> + * We only allow VS-mode software, timer, and external
>> + * interrupts when irq is one of the local interrupts
>> + * defined by RISC-V privilege specification.
>> + */
>> + if ( irq < IRQ_LOCAL_MAX &&
> What use is this? In particular this allows an incoming irq with a huge
> number to ...
>
>> + irq != IRQ_VS_SOFT &&
>> + irq != IRQ_VS_TIMER &&
>> + irq != IRQ_VS_EXT )
>> + return -EINVAL;
>> +
>> + set_bit(irq, v->arch.irqs_pending);
>> + smp_mb__before_atomic();
>> + set_bit(irq, v->arch.irqs_pending_mask);
> ... overrun both bitmaps.
Agree, it would be better just to drop "irq < IRQ_LOCAL_MAX &&".
>
>> --- a/xen/arch/riscv/include/asm/domain.h
>> +++ b/xen/arch/riscv/include/asm/domain.h
>> @@ -85,6 +85,22 @@ struct arch_vcpu
>> register_t vstval;
>> register_t vsatp;
>> register_t vsepc;
>> +
>> + /*
>> + * VCPU interrupts
>> + *
>> + * We have a lockless approach for tracking pending VCPU interrupts
>> + * implemented using atomic bitops. The irqs_pending bitmap represent
>> + * pending interrupts whereas irqs_pending_mask represent bits changed
>> + * in irqs_pending.
> And hence a set immediately followed by an unset is then indistinguishable
> from just an unset (or the other way around).
I think it is distinguishable with the combination of irqs_pending_mask.
> This may not be a problem, but
> if it isn't, I think this needs explaining. Much like it is unclear why the
> "changed" state needs tracking in the first place.
It is needed to track which bits are changed, irqs_pending only represents
the current state of pending interrupts.CPU might want to react to changes
rather than the absolute state.
Example:
- If CPU 0 sets an interrupt, CPU 1 needs to notice “something changed”
to inject it into the VCPU.
- If CPU 0 sets and then clears the bit before CPU 1 reads it,
irqs_pending alone shows 0, the transition is lost.
By maintaining irqs_pending_mask, you can detect “this bit changed
recently,” even if the final state is 0.
Also, having irqs_pending_mask allows to flush interrupts without lock:
if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
{
mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
*hvip &= ~mask;
*hvip |= val;
}
Without it I assume that we should have spinlcok around access to irqs_pending.
>
>> Our approach is modeled around multiple producer
>> + * and single consumer problem where the consumer is the VCPU itself.
>> + *
>> + * DECLARE_BITMAP() is needed here to support 64 vCPU local interrupts
>> + * on RV32 host.
>> + */
>> +#define RISCV_VCPU_NR_IRQS 64
>> + DECLARE_BITMAP(irqs_pending, RISCV_VCPU_NR_IRQS);
>> + DECLARE_BITMAP(irqs_pending_mask, RISCV_VCPU_NR_IRQS);
>> } __cacheline_aligned;
>>
>> struct paging_domain {
>> @@ -123,6 +139,9 @@ static inline void update_guest_memory_policy(struct vcpu *v,
>>
>> static inline void arch_vcpu_block(struct vcpu *v) {}
>>
>> +int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq);
>> +int vcpu_unset_interrupt(struct vcpu *v, const unsigned int irq);
> Why the const-s?
As irq number isn't going to be changed inside these functions.
>
>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>> @@ -91,6 +91,7 @@
>> #define IRQ_M_EXT 11
>> #define IRQ_S_GEXT 12
>> #define IRQ_PMU_OVF 13
>> +#define IRQ_LOCAL_MAX (IRQ_PMU_OVF + 1)
> MAX together with "+ 1" looks wrong. What is 14 (which, when MAX is 14,
> must be a valid interrupt)? Or if 14 isn't a valid interrupt, please use
> NR or NUM.
I didn’t fully understand your idea. Are you suggesting having|IRQ_LOCAL_NR|?
That sounds unclear, as it’s not obvious what it would represent.
Using|MAX_HART| seems better, since it represents the maximum number allowed
for a local interrupt. Any IRQ below that value is considered local, while
values above it are implementation-specific interrupts.
> Also, nit: Padding doesn't match with the earlier #define-s (even if in the
> quoted text it appears otherwise).
Thanks.
~ Oleksii
On 13.01.2026 13:51, Oleksii Kurochko wrote:
> On 1/7/26 5:28 PM, Jan Beulich wrote:
>> On 24.12.2025 18:03, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/domain.h
>>> +++ b/xen/arch/riscv/include/asm/domain.h
>>> @@ -85,6 +85,22 @@ struct arch_vcpu
>>> register_t vstval;
>>> register_t vsatp;
>>> register_t vsepc;
>>> +
>>> + /*
>>> + * VCPU interrupts
>>> + *
>>> + * We have a lockless approach for tracking pending VCPU interrupts
>>> + * implemented using atomic bitops. The irqs_pending bitmap represent
>>> + * pending interrupts whereas irqs_pending_mask represent bits changed
>>> + * in irqs_pending.
>> And hence a set immediately followed by an unset is then indistinguishable
>> from just an unset (or the other way around).
>
> I think it is distinguishable with the combination of irqs_pending_mask.
No. The set mask bit tells you that there was a change. But irqs_pending[]
records only the most recent set / clear.
>> This may not be a problem, but
>> if it isn't, I think this needs explaining. Much like it is unclear why the
>> "changed" state needs tracking in the first place.
>
> It is needed to track which bits are changed, irqs_pending only represents
> the current state of pending interrupts.CPU might want to react to changes
> rather than the absolute state.
>
> Example:
> - If CPU 0 sets an interrupt, CPU 1 needs to notice “something changed”
> to inject it into the VCPU.
> - If CPU 0 sets and then clears the bit before CPU 1 reads it,
> irqs_pending alone shows 0, the transition is lost.
The fact there was any number of transitions is recorded in _mask[], yes,
but "the transition" was still lost if we consider the "set" in your
example in isolation. And it's not quite clear to me what's interesting
about a 0 -> 0 transition. (On x86, such a lost 0 -> 1 transition, i.e.
one followed directly by a 1 -> 0 one, would result in a "spurious
interrupt": There would be an indication that there was a lost interrupt
without there being a way to know which one it was.)
> By maintaining irqs_pending_mask, you can detect “this bit changed
> recently,” even if the final state is 0.
>
> Also, having irqs_pending_mask allows to flush interrupts without lock:
> if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
> {
> mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
> val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
>
> *hvip &= ~mask;
> *hvip |= val;
> }
> Without it I assume that we should have spinlcok around access to irqs_pending.
Ah yes, this would indeed be a benefit. Just that it's not quite clear to
me:
*hvip |= xchg(&v->arch.irqs_pending[0], 0UL);
wouldn't require a lock either. What may be confusing me is that you put
things as if it was normal to see 1 -> 0 transitions from (virtual)
hardware, when I (with my x86 background) would expect 1 -> 0 transitions
to only occur due to software actions (End Of Interrupt), unless - see
above - something malfunctioned and an interrupt was lost. That (the 1 ->
0 transitions) could be (guest) writes to SVIP, for example.
Talking of which - do you really mean HVIP in the code you provided, not
VSVIP? So far I my understanding was that HVIP would be recording the
interrupts the hypervisor itself has pending (and needs to service).
>>> Our approach is modeled around multiple producer
>>> + * and single consumer problem where the consumer is the VCPU itself.
>>> + *
>>> + * DECLARE_BITMAP() is needed here to support 64 vCPU local interrupts
>>> + * on RV32 host.
>>> + */
>>> +#define RISCV_VCPU_NR_IRQS 64
>>> + DECLARE_BITMAP(irqs_pending, RISCV_VCPU_NR_IRQS);
>>> + DECLARE_BITMAP(irqs_pending_mask, RISCV_VCPU_NR_IRQS);
>>> } __cacheline_aligned;
>>>
>>> struct paging_domain {
>>> @@ -123,6 +139,9 @@ static inline void update_guest_memory_policy(struct vcpu *v,
>>>
>>> static inline void arch_vcpu_block(struct vcpu *v) {}
>>>
>>> +int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq);
>>> +int vcpu_unset_interrupt(struct vcpu *v, const unsigned int irq);
>> Why the const-s?
>
> As irq number isn't going to be changed inside these functions.
You realize though that we don't normally use const like this? This
use of qualifiers is meaningless to callers, and of limited meaning to
the function definition itself. There can be exceptions of course, when
it is important to clarify that a parameter must not change throughout
the function.
>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>> @@ -91,6 +91,7 @@
>>> #define IRQ_M_EXT 11
>>> #define IRQ_S_GEXT 12
>>> #define IRQ_PMU_OVF 13
>>> +#define IRQ_LOCAL_MAX (IRQ_PMU_OVF + 1)
>> MAX together with "+ 1" looks wrong. What is 14 (which, when MAX is 14,
>> must be a valid interrupt)? Or if 14 isn't a valid interrupt, please use
>> NR or NUM.
>
> I didn’t fully understand your idea. Are you suggesting having|IRQ_LOCAL_NR|?
> That sounds unclear, as it’s not obvious what it would represent.
> Using|MAX_HART| seems better, since it represents the maximum number allowed
> for a local interrupt. Any IRQ below that value is considered local, while
> values above it are implementation-specific interrupts.
Not quite. If you say "max", anything below _or equal_ that value is
valid / covered. When you say "num", anything below that value is
valid / covered. That is, "max" is inclusive for the upper bound of
the range, while "num" is exclusive. Hence my question whether 14 is
a valid local interrupt.
Jan
On 1/13/26 2:54 PM, Jan Beulich wrote:
> On 13.01.2026 13:51, Oleksii Kurochko wrote:
>> On 1/7/26 5:28 PM, Jan Beulich wrote:
>>> On 24.12.2025 18:03, Oleksii Kurochko wrote:
>>>> --- a/xen/arch/riscv/include/asm/domain.h
>>>> +++ b/xen/arch/riscv/include/asm/domain.h
>>>> @@ -85,6 +85,22 @@ struct arch_vcpu
>>>> register_t vstval;
>>>> register_t vsatp;
>>>> register_t vsepc;
>>>> +
>>>> + /*
>>>> + * VCPU interrupts
>>>> + *
>>>> + * We have a lockless approach for tracking pending VCPU interrupts
>>>> + * implemented using atomic bitops. The irqs_pending bitmap represent
>>>> + * pending interrupts whereas irqs_pending_mask represent bits changed
>>>> + * in irqs_pending.
>>> And hence a set immediately followed by an unset is then indistinguishable
>>> from just an unset (or the other way around).
>> I think it is distinguishable with the combination of irqs_pending_mask.
> No. The set mask bit tells you that there was a change. But irqs_pending[]
> records only the most recent set / clear.
>
>>> This may not be a problem, but
>>> if it isn't, I think this needs explaining. Much like it is unclear why the
>>> "changed" state needs tracking in the first place.
>> It is needed to track which bits are changed, irqs_pending only represents
>> the current state of pending interrupts.CPU might want to react to changes
>> rather than the absolute state.
>>
>> Example:
>> - If CPU 0 sets an interrupt, CPU 1 needs to notice “something changed”
>> to inject it into the VCPU.
>> - If CPU 0 sets and then clears the bit before CPU 1 reads it,
>> irqs_pending alone shows 0, the transition is lost.
> The fact there was any number of transitions is recorded in _mask[], yes,
> but "the transition" was still lost if we consider the "set" in your
> example in isolation. And it's not quite clear to me what's interesting
> about a 0 -> 0 transition. (On x86, such a lost 0 -> 1 transition, i.e.
> one followed directly by a 1 -> 0 one, would result in a "spurious
> interrupt": There would be an indication that there was a lost interrupt
> without there being a way to know which one it was.)
IIUC, in this reply you are talking about when the contents written to the
irq_pending and irqs_pending_mask bitmaps are flushed to the hardware
registers.
Originally, I understood your question to be about the case where
vcpu_set_interrupt() is called and then vcpu_unset_interrupt() is called.
I am trying to understand whether such a scenario is possible.
Let’s take the vtimer as an example. vcpu_set_interrupt(t->v, IRQ_VS_TIMER)
is not called again until vcpu_unset_interrupt(t->v, IRQ_VS_TIMER) and
set_timer() are called in vtimer_set_timer().
The opposite situation is not possible: it cannot happen that
vcpu_set_interrupt(t->v, IRQ_VS_TIMER) is called and then immediately
vcpu_unset_interrupt(t->v, IRQ_VS_TIMER) is called, because
vcpu_unset_interrupt() and set_timer() are only invoked when the guest
has handled the timer interrupt and requested a new one.
So if no interrupt flush is happening, the vcpu_set_interrupt() →
vcpu_unset_interrupt() sequence will only update the irq_pending and
irqs_pending_mask bitmaps, without touching the hardware registers,
so no spurious interrupt will occur. And if an interrupt flush does
happen, it is not possible to have a 1 -> 0 transition due to the call
sequence I mentioned in the last two paragraphs above.
>
>> By maintaining irqs_pending_mask, you can detect “this bit changed
>> recently,” even if the final state is 0.
>>
>> Also, having irqs_pending_mask allows to flush interrupts without lock:
>> if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
>> {
>> mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
>> val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
>>
>> *hvip &= ~mask;
>> *hvip |= val;
>> }
>> Without it I assume that we should have spinlcok around access to irqs_pending.
> Ah yes, this would indeed be a benefit. Just that it's not quite clear to
> me:
>
> *hvip |= xchg(&v->arch.irqs_pending[0], 0UL);
>
> wouldn't require a lock either
Because vCPU's hvip (which is stored on the stack) can't be changed concurrently
and it's almost the one place in the code where vCPU->hvip is changed. Another
place it is save_csrs() during context switch but it can't be called in parallel
with the vcpu_sync_interrupts() (look below).
> . What may be confusing me is that you put
> things as if it was normal to see 1 -> 0 transitions from (virtual)
> hardware, when I (with my x86 background) would expect 1 -> 0 transitions
> to only occur due to software actions (End Of Interrupt), unless - see
> above - something malfunctioned and an interrupt was lost. That (the 1 ->
> 0 transitions) could be (guest) writes to SVIP, for example.
>
> Talking of which - do you really mean HVIP in the code you provided, not
> VSVIP? So far I my understanding was that HVIP would be recording the
> interrupts the hypervisor itself has pending (and needs to service).
HVIP is correct to use here, HVIP is used to indicate virtual interrupts
intended for VS-mode. And I think you confused HVIP with the HIP register
which supplements the standard supervisor-level SIP register to indicate
pending virtual supervisor (VS-level) interrupts and hypervisor-specific
interrupts.
If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
to SVIP, for example." then the correspondent HVIP (and HIP as usually
they are aliasis of HVIP) bits will be updated. And that is why we need
vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
+void vcpu_sync_interrupts(struct vcpu *v)
+{
+ unsigned long hvip;
+
+ /* Read current HVIP and VSIE CSRs */
+ v->arch.vsie = csr_read(CSR_VSIE);
+
+ /* Sync-up HVIP.VSSIP bit changes does by Guest */
+ hvip = csr_read(CSR_HVIP);
+ if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
+ {
+ if ( hvip & BIT(IRQ_VS_SOFT, UL) )
+ {
+ if ( !test_and_set_bit(IRQ_VS_SOFT,
+ &v->arch.irqs_pending_mask) )
+ set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
+ }
+ else
+ {
+ if ( !test_and_set_bit(IRQ_VS_SOFT,
+ &v->arch.irqs_pending_mask) )
+ clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
+ }
+ }
+
+ /* Sync-up AIA high interrupts */
+ vcpu_aia_sync_interrupts(v);
+
+ /* Sync-up timer CSRs */
+ vtimer_sync(v);
+}
>
>>>> Our approach is modeled around multiple producer
>>>> + * and single consumer problem where the consumer is the VCPU itself.
>>>> + *
>>>> + * DECLARE_BITMAP() is needed here to support 64 vCPU local interrupts
>>>> + * on RV32 host.
>>>> + */
>>>> +#define RISCV_VCPU_NR_IRQS 64
>>>> + DECLARE_BITMAP(irqs_pending, RISCV_VCPU_NR_IRQS);
>>>> + DECLARE_BITMAP(irqs_pending_mask, RISCV_VCPU_NR_IRQS);
>>>> } __cacheline_aligned;
>>>>
>>>> struct paging_domain {
>>>> @@ -123,6 +139,9 @@ static inline void update_guest_memory_policy(struct vcpu *v,
>>>>
>>>> static inline void arch_vcpu_block(struct vcpu *v) {}
>>>>
>>>> +int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq);
>>>> +int vcpu_unset_interrupt(struct vcpu *v, const unsigned int irq);
>>> Why the const-s?
>> As irq number isn't going to be changed inside these functions.
> You realize though that we don't normally use const like this? This
> use of qualifiers is meaningless to callers, and of limited meaning to
> the function definition itself. There can be exceptions of course, when
> it is important to clarify that a parameter must not change throughout
> the function.
>
>>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>>> @@ -91,6 +91,7 @@
>>>> #define IRQ_M_EXT 11
>>>> #define IRQ_S_GEXT 12
>>>> #define IRQ_PMU_OVF 13
>>>> +#define IRQ_LOCAL_MAX (IRQ_PMU_OVF + 1)
>>> MAX together with "+ 1" looks wrong. What is 14 (which, when MAX is 14,
>>> must be a valid interrupt)? Or if 14 isn't a valid interrupt, please use
>>> NR or NUM.
>> I didn’t fully understand your idea. Are you suggesting having|IRQ_LOCAL_NR|?
>> That sounds unclear, as it’s not obvious what it would represent.
>> Using|MAX_HART| seems better, since it represents the maximum number allowed
>> for a local interrupt. Any IRQ below that value is considered local, while
>> values above it are implementation-specific interrupts.
> Not quite. If you say "max", anything below _or equal_ that value is
> valid / covered. When you say "num", anything below that value is
> valid / covered. That is, "max" is inclusive for the upper bound of
> the range, while "num" is exclusive. Hence my question whether 14 is
> a valid local interrupt.
14 is architecturally classified as a local interrupt, but its specific
function is currently reserved.
Intention was to cover standard portion (bits 15:0) of sip for which bits
15 and 14 are 0 as they are reserved, so it seems like NUM could be used here.
~ Oleksii
On 14.01.2026 16:39, Oleksii Kurochko wrote:
> On 1/13/26 2:54 PM, Jan Beulich wrote:
>> On 13.01.2026 13:51, Oleksii Kurochko wrote:
>>> On 1/7/26 5:28 PM, Jan Beulich wrote:
>>>> On 24.12.2025 18:03, Oleksii Kurochko wrote:
>>>>> --- a/xen/arch/riscv/include/asm/domain.h
>>>>> +++ b/xen/arch/riscv/include/asm/domain.h
>>>>> @@ -85,6 +85,22 @@ struct arch_vcpu
>>>>> register_t vstval;
>>>>> register_t vsatp;
>>>>> register_t vsepc;
>>>>> +
>>>>> + /*
>>>>> + * VCPU interrupts
>>>>> + *
>>>>> + * We have a lockless approach for tracking pending VCPU interrupts
>>>>> + * implemented using atomic bitops. The irqs_pending bitmap represent
>>>>> + * pending interrupts whereas irqs_pending_mask represent bits changed
>>>>> + * in irqs_pending.
>>>> And hence a set immediately followed by an unset is then indistinguishable
>>>> from just an unset (or the other way around).
>>> I think it is distinguishable with the combination of irqs_pending_mask.
>> No. The set mask bit tells you that there was a change. But irqs_pending[]
>> records only the most recent set / clear.
>>
>>>> This may not be a problem, but
>>>> if it isn't, I think this needs explaining. Much like it is unclear why the
>>>> "changed" state needs tracking in the first place.
>>> It is needed to track which bits are changed, irqs_pending only represents
>>> the current state of pending interrupts.CPU might want to react to changes
>>> rather than the absolute state.
>>>
>>> Example:
>>> - If CPU 0 sets an interrupt, CPU 1 needs to notice “something changed”
>>> to inject it into the VCPU.
>>> - If CPU 0 sets and then clears the bit before CPU 1 reads it,
>>> irqs_pending alone shows 0, the transition is lost.
>> The fact there was any number of transitions is recorded in _mask[], yes,
>> but "the transition" was still lost if we consider the "set" in your
>> example in isolation. And it's not quite clear to me what's interesting
>> about a 0 -> 0 transition. (On x86, such a lost 0 -> 1 transition, i.e.
>> one followed directly by a 1 -> 0 one, would result in a "spurious
>> interrupt": There would be an indication that there was a lost interrupt
>> without there being a way to know which one it was.)
>
> IIUC, in this reply you are talking about when the contents written to the
> irq_pending and irqs_pending_mask bitmaps are flushed to the hardware
> registers.
>
> Originally, I understood your question to be about the case where
> vcpu_set_interrupt() is called and then vcpu_unset_interrupt() is called.
I was actually asking in more abstract terms. And I was assuming there
would be pretty direct ways for the guest to have vcpu_{,un}set_interrupt()
invoked. Looks like ...
> I am trying to understand whether such a scenario is possible.
>
> Let’s take the vtimer as an example. vcpu_set_interrupt(t->v, IRQ_VS_TIMER)
> is not called again until vcpu_unset_interrupt(t->v, IRQ_VS_TIMER) and
> set_timer() are called in vtimer_set_timer().
>
> The opposite situation is not possible: it cannot happen that
> vcpu_set_interrupt(t->v, IRQ_VS_TIMER) is called and then immediately
> vcpu_unset_interrupt(t->v, IRQ_VS_TIMER) is called, because
> vcpu_unset_interrupt() and set_timer() are only invoked when the guest
> has handled the timer interrupt and requested a new one.
>
> So if no interrupt flush is happening, the vcpu_set_interrupt() →
> vcpu_unset_interrupt() sequence will only update the irq_pending and
> irqs_pending_mask bitmaps, without touching the hardware registers,
> so no spurious interrupt will occur. And if an interrupt flush does
> happen, it is not possible to have a 1 -> 0 transition due to the call
> sequence I mentioned in the last two paragraphs above.
... that wasn't a correct assumption. (Partly attributed to the patch
series leaving out a number of relevant things, which makes it hard to
guess what else is left out.)
>>> By maintaining irqs_pending_mask, you can detect “this bit changed
>>> recently,” even if the final state is 0.
>>>
>>> Also, having irqs_pending_mask allows to flush interrupts without lock:
>>> if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
>>> {
>>> mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
>>> val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
>>>
>>> *hvip &= ~mask;
>>> *hvip |= val;
>>> }
>>> Without it I assume that we should have spinlcok around access to irqs_pending.
>> Ah yes, this would indeed be a benefit. Just that it's not quite clear to
>> me:
>>
>> *hvip |= xchg(&v->arch.irqs_pending[0], 0UL);
>>
>> wouldn't require a lock either
>
> Because vCPU's hvip (which is stored on the stack) can't be changed concurrently
> and it's almost the one place in the code where vCPU->hvip is changed. Another
> place it is save_csrs() during context switch but it can't be called in parallel
> with the vcpu_sync_interrupts() (look below).
>
>> . What may be confusing me is that you put
>> things as if it was normal to see 1 -> 0 transitions from (virtual)
>> hardware, when I (with my x86 background) would expect 1 -> 0 transitions
>> to only occur due to software actions (End Of Interrupt), unless - see
>> above - something malfunctioned and an interrupt was lost. That (the 1 ->
>> 0 transitions) could be (guest) writes to SVIP, for example.
>>
>> Talking of which - do you really mean HVIP in the code you provided, not
>> VSVIP? So far I my understanding was that HVIP would be recording the
>> interrupts the hypervisor itself has pending (and needs to service).
>
> HVIP is correct to use here, HVIP is used to indicate virtual interrupts
> intended for VS-mode. And I think you confused HVIP with the HIP register
> which supplements the standard supervisor-level SIP register to indicate
> pending virtual supervisor (VS-level) interrupts and hypervisor-specific
> interrupts.
>
> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
> to SVIP, for example." then the correspondent HVIP (and HIP as usually
> they are aliasis of HVIP) bits will be updated. And that is why we need
> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
> +void vcpu_sync_interrupts(struct vcpu *v)
> +{
> + unsigned long hvip;
> +
> + /* Read current HVIP and VSIE CSRs */
> + v->arch.vsie = csr_read(CSR_VSIE);
> +
> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
> + hvip = csr_read(CSR_HVIP);
> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
> + {
> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
> + {
> + if ( !test_and_set_bit(IRQ_VS_SOFT,
> + &v->arch.irqs_pending_mask) )
> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
> + }
> + else
> + {
> + if ( !test_and_set_bit(IRQ_VS_SOFT,
> + &v->arch.irqs_pending_mask) )
> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
> + }
> + }
I fear I don't understand this at all. Why would the guest having set a
pending bit not result in the IRQ to be marked pending? You can't know
whether that guest write happened before or after you last touched
.irqs_pending{,mask}[]? Yet that pair of bit arrays is supposed to be
tracking the most recent update (according to how I understood earlier
explanations of yours).
As an aside - the !test_and_set_bit() can be pulled out, to the outermost
if().
Jan
On 1/14/26 4:56 PM, Jan Beulich wrote:
> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>> On 1/13/26 2:54 PM, Jan Beulich wrote:
>>> On 13.01.2026 13:51, Oleksii Kurochko wrote:
>>>> On 1/7/26 5:28 PM, Jan Beulich wrote:
>>>>> On 24.12.2025 18:03, Oleksii Kurochko wrote:
>>>>>> --- a/xen/arch/riscv/include/asm/domain.h
>>>>>> +++ b/xen/arch/riscv/include/asm/domain.h
>>>>>> @@ -85,6 +85,22 @@ struct arch_vcpu
>>>>>> register_t vstval;
>>>>>> register_t vsatp;
>>>>>> register_t vsepc;
>>>>>> +
>>>>>> + /*
>>>>>> + * VCPU interrupts
>>>>>> + *
>>>>>> + * We have a lockless approach for tracking pending VCPU interrupts
>>>>>> + * implemented using atomic bitops. The irqs_pending bitmap represent
>>>>>> + * pending interrupts whereas irqs_pending_mask represent bits changed
>>>>>> + * in irqs_pending.
>>>>> And hence a set immediately followed by an unset is then indistinguishable
>>>>> from just an unset (or the other way around).
>>>> I think it is distinguishable with the combination of irqs_pending_mask.
>>> No. The set mask bit tells you that there was a change. But irqs_pending[]
>>> records only the most recent set / clear.
>>>
>>>>> This may not be a problem, but
>>>>> if it isn't, I think this needs explaining. Much like it is unclear why the
>>>>> "changed" state needs tracking in the first place.
>>>> It is needed to track which bits are changed, irqs_pending only represents
>>>> the current state of pending interrupts.CPU might want to react to changes
>>>> rather than the absolute state.
>>>>
>>>> Example:
>>>> - If CPU 0 sets an interrupt, CPU 1 needs to notice “something changed”
>>>> to inject it into the VCPU.
>>>> - If CPU 0 sets and then clears the bit before CPU 1 reads it,
>>>> irqs_pending alone shows 0, the transition is lost.
>>> The fact there was any number of transitions is recorded in _mask[], yes,
>>> but "the transition" was still lost if we consider the "set" in your
>>> example in isolation. And it's not quite clear to me what's interesting
>>> about a 0 -> 0 transition. (On x86, such a lost 0 -> 1 transition, i.e.
>>> one followed directly by a 1 -> 0 one, would result in a "spurious
>>> interrupt": There would be an indication that there was a lost interrupt
>>> without there being a way to know which one it was.)
>> IIUC, in this reply you are talking about when the contents written to the
>> irq_pending and irqs_pending_mask bitmaps are flushed to the hardware
>> registers.
>>
>> Originally, I understood your question to be about the case where
>> vcpu_set_interrupt() is called and then vcpu_unset_interrupt() is called.
> I was actually asking in more abstract terms. And I was assuming there
> would be pretty direct ways for the guest to have vcpu_{,un}set_interrupt()
> invoked. Looks like ...
>
>> I am trying to understand whether such a scenario is possible.
>>
>> Let’s take the vtimer as an example. vcpu_set_interrupt(t->v, IRQ_VS_TIMER)
>> is not called again until vcpu_unset_interrupt(t->v, IRQ_VS_TIMER) and
>> set_timer() are called in vtimer_set_timer().
>>
>> The opposite situation is not possible: it cannot happen that
>> vcpu_set_interrupt(t->v, IRQ_VS_TIMER) is called and then immediately
>> vcpu_unset_interrupt(t->v, IRQ_VS_TIMER) is called, because
>> vcpu_unset_interrupt() and set_timer() are only invoked when the guest
>> has handled the timer interrupt and requested a new one.
>>
>> So if no interrupt flush is happening, the vcpu_set_interrupt() →
>> vcpu_unset_interrupt() sequence will only update the irq_pending and
>> irqs_pending_mask bitmaps, without touching the hardware registers,
>> so no spurious interrupt will occur. And if an interrupt flush does
>> happen, it is not possible to have a 1 -> 0 transition due to the call
>> sequence I mentioned in the last two paragraphs above.
> ... that wasn't a correct assumption. (Partly attributed to the patch
> series leaving out a number of relevant things, which makes it hard to
> guess what else is left out.)
Then it makes sense to introduce second part of pending interrupts tracking
as part of this patch series in the next version.
Or for now not introduce tracking of pending vCPU interrupts and implement
vtimer expired handler as:
csr_set(CSR_HVIP, IRQ_VS_TIMER);
vcpu->hvip = csr_read(CSR_HVIP);
>>>> By maintaining irqs_pending_mask, you can detect “this bit changed
>>>> recently,” even if the final state is 0.
>>>>
>>>> Also, having irqs_pending_mask allows to flush interrupts without lock:
>>>> if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
>>>> {
>>>> mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
>>>> val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
>>>>
>>>> *hvip &= ~mask;
>>>> *hvip |= val;
>>>> }
>>>> Without it I assume that we should have spinlcok around access to irqs_pending.
>>> Ah yes, this would indeed be a benefit. Just that it's not quite clear to
>>> me:
>>>
>>> *hvip |= xchg(&v->arch.irqs_pending[0], 0UL);
>>>
>>> wouldn't require a lock either
>> Because vCPU's hvip (which is stored on the stack) can't be changed concurrently
>> and it's almost the one place in the code where vCPU->hvip is changed. Another
>> place it is save_csrs() during context switch but it can't be called in parallel
>> with the vcpu_sync_interrupts() (look below).
>>
>>> . What may be confusing me is that you put
>>> things as if it was normal to see 1 -> 0 transitions from (virtual)
>>> hardware, when I (with my x86 background) would expect 1 -> 0 transitions
>>> to only occur due to software actions (End Of Interrupt), unless - see
>>> above - something malfunctioned and an interrupt was lost. That (the 1 ->
>>> 0 transitions) could be (guest) writes to SVIP, for example.
>>>
>>> Talking of which - do you really mean HVIP in the code you provided, not
>>> VSVIP? So far I my understanding was that HVIP would be recording the
>>> interrupts the hypervisor itself has pending (and needs to service).
>> HVIP is correct to use here, HVIP is used to indicate virtual interrupts
>> intended for VS-mode. And I think you confused HVIP with the HIP register
>> which supplements the standard supervisor-level SIP register to indicate
>> pending virtual supervisor (VS-level) interrupts and hypervisor-specific
>> interrupts.
>>
>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>> they are aliasis of HVIP) bits will be updated. And that is why we need
>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>> +void vcpu_sync_interrupts(struct vcpu *v)
>> +{
>> + unsigned long hvip;
>> +
>> + /* Read current HVIP and VSIE CSRs */
>> + v->arch.vsie = csr_read(CSR_VSIE);
>> +
>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>> + hvip = csr_read(CSR_HVIP);
>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>> + {
>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>> + {
>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>> + &v->arch.irqs_pending_mask) )
>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>> + }
>> + else
>> + {
>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>> + &v->arch.irqs_pending_mask) )
>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>> + }
>> + }
> I fear I don't understand this at all. Why would the guest having set a
> pending bit not result in the IRQ to be marked pending?
Maybe it is wrong assumption but based on the spec:
Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
bits for supervisor-level software interrupts. If implemented, SSIP is
writable in sip and may also be set to 1 by a platform-specific interrupt
controller.
and:
Interprocessor interrupts are sent to other harts by implementation-specific
means, which will ultimately cause the SSIP bit to be set in the recipient
hart’s sip register.
Meaning that sending an IPI to self by writing 1 to sip.SSIP is
well-defined. The same should be true of vsip.SSIP while in VS mode.
And so in this case if SSIP handling was delegated by hypervisor to guest by
setting hedeleg[2] = 1 we won't have an interrupt in hypervsor, and so nothing
will set a pending bit in bitmap or update hvip register from hypervisor.
( All bits except SSIP in the sip register are read-only. )
> You can't know
> whether that guest write happened before or after you last touched
> .irqs_pending{,mask}[]?
Yes, I think you are right.
On the other hand, if we are in hypervisor when vcpu_sync_interrupts() is
called it means that pCPU on which vCPU is ran and for which
vcpu_sync_interrupts() is called now executes some hypervisor things, so
guest won't able to update VSIP.SSIP for this pCPU. So nothing else will
change VSIP.SSIP and so h/w HVIP won't be changed by something and it is
okay to sync .irqs_pending{,mask} with what h/w in its HVIP.
~ Oleksii
> Yet that pair of bit arrays is supposed to be
> tracking the most recent update (according to how I understood earlier
> explanations of yours).
>
> As an aside - the !test_and_set_bit() can be pulled out, to the outermost
> if().
On 15.01.2026 10:14, Oleksii Kurochko wrote:
> On 1/14/26 4:56 PM, Jan Beulich wrote:
>> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>>> On 1/13/26 2:54 PM, Jan Beulich wrote:
>>>> On 13.01.2026 13:51, Oleksii Kurochko wrote:
>>>>> On 1/7/26 5:28 PM, Jan Beulich wrote:
>>>>>> On 24.12.2025 18:03, Oleksii Kurochko wrote:
>>>>> By maintaining irqs_pending_mask, you can detect “this bit changed
>>>>> recently,” even if the final state is 0.
>>>>>
>>>>> Also, having irqs_pending_mask allows to flush interrupts without lock:
>>>>> if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
>>>>> {
>>>>> mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
>>>>> val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
>>>>>
>>>>> *hvip &= ~mask;
>>>>> *hvip |= val;
>>>>> }
>>>>> Without it I assume that we should have spinlcok around access to irqs_pending.
>>>> Ah yes, this would indeed be a benefit. Just that it's not quite clear to
>>>> me:
>>>>
>>>> *hvip |= xchg(&v->arch.irqs_pending[0], 0UL);
>>>>
>>>> wouldn't require a lock either
>>> Because vCPU's hvip (which is stored on the stack) can't be changed concurrently
>>> and it's almost the one place in the code where vCPU->hvip is changed. Another
>>> place it is save_csrs() during context switch but it can't be called in parallel
>>> with the vcpu_sync_interrupts() (look below).
>>>
>>>> . What may be confusing me is that you put
>>>> things as if it was normal to see 1 -> 0 transitions from (virtual)
>>>> hardware, when I (with my x86 background) would expect 1 -> 0 transitions
>>>> to only occur due to software actions (End Of Interrupt), unless - see
>>>> above - something malfunctioned and an interrupt was lost. That (the 1 ->
>>>> 0 transitions) could be (guest) writes to SVIP, for example.
>>>>
>>>> Talking of which - do you really mean HVIP in the code you provided, not
>>>> VSVIP? So far I my understanding was that HVIP would be recording the
>>>> interrupts the hypervisor itself has pending (and needs to service).
>>> HVIP is correct to use here, HVIP is used to indicate virtual interrupts
>>> intended for VS-mode. And I think you confused HVIP with the HIP register
>>> which supplements the standard supervisor-level SIP register to indicate
>>> pending virtual supervisor (VS-level) interrupts and hypervisor-specific
>>> interrupts.
>>>
>>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>>> they are aliasis of HVIP) bits will be updated. And that is why we need
>>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>> +{
>>> + unsigned long hvip;
>>> +
>>> + /* Read current HVIP and VSIE CSRs */
>>> + v->arch.vsie = csr_read(CSR_VSIE);
>>> +
>>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>>> + hvip = csr_read(CSR_HVIP);
>>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>> + {
>>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>>> + {
>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>> + &v->arch.irqs_pending_mask) )
>>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>> + }
>>> + else
>>> + {
>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>> + &v->arch.irqs_pending_mask) )
>>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>> + }
>>> + }
>> I fear I don't understand this at all. Why would the guest having set a
>> pending bit not result in the IRQ to be marked pending?
>
> Maybe it is wrong assumption but based on the spec:
> Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
> bits for supervisor-level software interrupts. If implemented, SSIP is
> writable in sip and may also be set to 1 by a platform-specific interrupt
> controller.
> and:
> Interprocessor interrupts are sent to other harts by implementation-specific
> means, which will ultimately cause the SSIP bit to be set in the recipient
> hart’s sip register.
>
> Meaning that sending an IPI to self by writing 1 to sip.SSIP is
> well-defined. The same should be true of vsip.SSIP while in VS mode.
I can't read that out of the text above. To the contrary, "will ultimately cause
the SSIP bit to be set" suggests to me that the bit is not to be set by writing
the CSR. Things still may work like this for self-IPI, but that wouldn't follow
from the quotation above.
>> You can't know
>> whether that guest write happened before or after you last touched
>> .irqs_pending{,mask}[]?
>
> Yes, I think you are right.
>
> On the other hand, if we are in hypervisor when vcpu_sync_interrupts() is
> called it means that pCPU on which vCPU is ran and for which
> vcpu_sync_interrupts() is called now executes some hypervisor things, so
> guest won't able to update VSIP.SSIP for this pCPU. So nothing else will
> change VSIP.SSIP and so h/w HVIP won't be changed by something and it is
> okay to sync .irqs_pending{,mask} with what h/w in its HVIP.
That is, vcpu_sync_interrupts() is called on every entry to the hypervisor?
Not just during context switch?
Jan
On 1/15/26 10:52 AM, Jan Beulich wrote:
> On 15.01.2026 10:14, Oleksii Kurochko wrote:
>> On 1/14/26 4:56 PM, Jan Beulich wrote:
>>> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>>>> On 1/13/26 2:54 PM, Jan Beulich wrote:
>>>>> On 13.01.2026 13:51, Oleksii Kurochko wrote:
>>>>>> On 1/7/26 5:28 PM, Jan Beulich wrote:
>>>>>>> On 24.12.2025 18:03, Oleksii Kurochko wrote:
>>>>>> By maintaining irqs_pending_mask, you can detect “this bit changed
>>>>>> recently,” even if the final state is 0.
>>>>>>
>>>>>> Also, having irqs_pending_mask allows to flush interrupts without lock:
>>>>>> if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
>>>>>> {
>>>>>> mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
>>>>>> val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;
>>>>>>
>>>>>> *hvip &= ~mask;
>>>>>> *hvip |= val;
>>>>>> }
>>>>>> Without it I assume that we should have spinlcok around access to irqs_pending.
>>>>> Ah yes, this would indeed be a benefit. Just that it's not quite clear to
>>>>> me:
>>>>>
>>>>> *hvip |= xchg(&v->arch.irqs_pending[0], 0UL);
>>>>>
>>>>> wouldn't require a lock either
>>>> Because vCPU's hvip (which is stored on the stack) can't be changed concurrently
>>>> and it's almost the one place in the code where vCPU->hvip is changed. Another
>>>> place it is save_csrs() during context switch but it can't be called in parallel
>>>> with the vcpu_sync_interrupts() (look below).
>>>>
>>>>> . What may be confusing me is that you put
>>>>> things as if it was normal to see 1 -> 0 transitions from (virtual)
>>>>> hardware, when I (with my x86 background) would expect 1 -> 0 transitions
>>>>> to only occur due to software actions (End Of Interrupt), unless - see
>>>>> above - something malfunctioned and an interrupt was lost. That (the 1 ->
>>>>> 0 transitions) could be (guest) writes to SVIP, for example.
>>>>>
>>>>> Talking of which - do you really mean HVIP in the code you provided, not
>>>>> VSVIP? So far I my understanding was that HVIP would be recording the
>>>>> interrupts the hypervisor itself has pending (and needs to service).
>>>> HVIP is correct to use here, HVIP is used to indicate virtual interrupts
>>>> intended for VS-mode. And I think you confused HVIP with the HIP register
>>>> which supplements the standard supervisor-level SIP register to indicate
>>>> pending virtual supervisor (VS-level) interrupts and hypervisor-specific
>>>> interrupts.
>>>>
>>>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>>>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>>>> they are aliasis of HVIP) bits will be updated. And that is why we need
>>>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>>> +{
>>>> + unsigned long hvip;
>>>> +
>>>> + /* Read current HVIP and VSIE CSRs */
>>>> + v->arch.vsie = csr_read(CSR_VSIE);
>>>> +
>>>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>>>> + hvip = csr_read(CSR_HVIP);
>>>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>>> + {
>>>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>>>> + {
>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>> + &v->arch.irqs_pending_mask) )
>>>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>> + }
>>>> + else
>>>> + {
>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>> + &v->arch.irqs_pending_mask) )
>>>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>> + }
>>>> + }
>>> I fear I don't understand this at all. Why would the guest having set a
>>> pending bit not result in the IRQ to be marked pending?
>> Maybe it is wrong assumption but based on the spec:
>> Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
>> bits for supervisor-level software interrupts. If implemented, SSIP is
>> writable in sip and may also be set to 1 by a platform-specific interrupt
>> controller.
>> and:
>> Interprocessor interrupts are sent to other harts by implementation-specific
>> means, which will ultimately cause the SSIP bit to be set in the recipient
>> hart’s sip register.
>>
>> Meaning that sending an IPI to self by writing 1 to sip.SSIP is
>> well-defined. The same should be true of vsip.SSIP while in VS mode.
> I can't read that out of the text above. To the contrary, "will ultimately cause
> the SSIP bit to be set" suggests to me that the bit is not to be set by writing
> the CSR. Things still may work like this for self-IPI, but that wouldn't follow
> from the quotation above.
Why not that wouldn't follow from the quotation above?
The first quotation tells that we can do self-IPI so VSSIP.SSIP will set to 1
what we could miss SSIP bit if won't explicitly try to read h/w HVIP (or VSSIP,
or whatever other alias of the SSIP bit) and sync with what we have cached
in hypervisor.
The second quotation tells that if another CPU send IPI to CPUx then CPUx.SIP will
have SSIP bit set to 1 and again hypervisor won't know that without explicit
reading of HVIP (or VSSIP, or whatever other alias of the SSIP bit).
>
>>> You can't know
>>> whether that guest write happened before or after you last touched
>>> .irqs_pending{,mask}[]?
>> Yes, I think you are right.
>>
>> On the other hand, if we are in hypervisor when vcpu_sync_interrupts() is
>> called it means that pCPU on which vCPU is ran and for which
>> vcpu_sync_interrupts() is called now executes some hypervisor things, so
>> guest won't able to update VSIP.SSIP for this pCPU. So nothing else will
>> change VSIP.SSIP and so h/w HVIP won't be changed by something and it is
>> okay to sync .irqs_pending{,mask} with what h/w in its HVIP.
> That is, vcpu_sync_interrupts() is called on every entry to the hypervisor?
> Not just during context switch?
It is called each time before exit from the hypervisor to a guest.
~ Oleksii
On 15.01.2026 11:55, Oleksii Kurochko wrote:
> On 1/15/26 10:52 AM, Jan Beulich wrote:
>> On 15.01.2026 10:14, Oleksii Kurochko wrote:
>>> On 1/14/26 4:56 PM, Jan Beulich wrote:
>>>> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>>>>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>>>>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>>>>> they are aliasis of HVIP) bits will be updated. And that is why we need
>>>>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>>>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>>>> +{
>>>>> + unsigned long hvip;
>>>>> +
>>>>> + /* Read current HVIP and VSIE CSRs */
>>>>> + v->arch.vsie = csr_read(CSR_VSIE);
>>>>> +
>>>>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>>>>> + hvip = csr_read(CSR_HVIP);
>>>>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>>>> + {
>>>>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>>>>> + {
>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>> + &v->arch.irqs_pending_mask) )
>>>>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>> + }
>>>>> + else
>>>>> + {
>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>> + &v->arch.irqs_pending_mask) )
>>>>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>> + }
>>>>> + }
>>>> I fear I don't understand this at all. Why would the guest having set a
>>>> pending bit not result in the IRQ to be marked pending?
>>> Maybe it is wrong assumption but based on the spec:
>>> Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
>>> bits for supervisor-level software interrupts. If implemented, SSIP is
>>> writable in sip and may also be set to 1 by a platform-specific interrupt
>>> controller.
>>> and:
>>> Interprocessor interrupts are sent to other harts by implementation-specific
>>> means, which will ultimately cause the SSIP bit to be set in the recipient
>>> hart’s sip register.
>>>
>>> Meaning that sending an IPI to self by writing 1 to sip.SSIP is
>>> well-defined. The same should be true of vsip.SSIP while in VS mode.
>> I can't read that out of the text above. To the contrary, "will ultimately cause
>> the SSIP bit to be set" suggests to me that the bit is not to be set by writing
>> the CSR. Things still may work like this for self-IPI, but that wouldn't follow
>> from the quotation above.
>
> Why not that wouldn't follow from the quotation above?
>
> The first quotation tells that we can do self-IPI so VSSIP.SSIP will set to 1
> what we could miss SSIP bit if won't explicitly try to read h/w HVIP (or VSSIP,
> or whatever other alias of the SSIP bit) and sync with what we have cached
> in hypervisor.
The bit being writable doesn't imply that it being written with 1 would also
trigger an interruption. If that's indeed the behavior, it surely is being
said elsewhere.
Jan
On 1/15/26 11:59 AM, Jan Beulich wrote:
> On 15.01.2026 11:55, Oleksii Kurochko wrote:
>> On 1/15/26 10:52 AM, Jan Beulich wrote:
>>> On 15.01.2026 10:14, Oleksii Kurochko wrote:
>>>> On 1/14/26 4:56 PM, Jan Beulich wrote:
>>>>> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>>>>>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>>>>>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>>>>>> they are aliasis of HVIP) bits will be updated. And that is why we need
>>>>>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>>>>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>>>>> +{
>>>>>> + unsigned long hvip;
>>>>>> +
>>>>>> + /* Read current HVIP and VSIE CSRs */
>>>>>> + v->arch.vsie = csr_read(CSR_VSIE);
>>>>>> +
>>>>>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>>>>>> + hvip = csr_read(CSR_HVIP);
>>>>>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>>>>> + {
>>>>>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>>>>>> + {
>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>> + }
>>>>>> + else
>>>>>> + {
>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>> + }
>>>>>> + }
>>>>> I fear I don't understand this at all. Why would the guest having set a
>>>>> pending bit not result in the IRQ to be marked pending?
>>>> Maybe it is wrong assumption but based on the spec:
>>>> Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
>>>> bits for supervisor-level software interrupts. If implemented, SSIP is
>>>> writable in sip and may also be set to 1 by a platform-specific interrupt
>>>> controller.
>>>> and:
>>>> Interprocessor interrupts are sent to other harts by implementation-specific
>>>> means, which will ultimately cause the SSIP bit to be set in the recipient
>>>> hart’s sip register.
>>>>
>>>> Meaning that sending an IPI to self by writing 1 to sip.SSIP is
>>>> well-defined. The same should be true of vsip.SSIP while in VS mode.
>>> I can't read that out of the text above. To the contrary, "will ultimately cause
>>> the SSIP bit to be set" suggests to me that the bit is not to be set by writing
>>> the CSR. Things still may work like this for self-IPI, but that wouldn't follow
>>> from the quotation above.
>> Why not that wouldn't follow from the quotation above?
>>
>> The first quotation tells that we can do self-IPI so VSSIP.SSIP will set to 1
>> what we could miss SSIP bit if won't explicitly try to read h/w HVIP (or VSSIP,
>> or whatever other alias of the SSIP bit) and sync with what we have cached
>> in hypervisor.
> The bit being writable doesn't imply that it being written with 1 would also
> trigger an interruption. If that's indeed the behavior, it surely is being
> said elsewhere.
According to the spec it will trap to S-mode (VS-mode in our context) if both of
the following are true: (a) either the current privilege mode is S and the SIE
bit in the sstatus register is set, or the current privilege mode has less
privilege than S-mode; and (b) bit i is set in both sip and sie.
Even without a triggering an interrupt I think it we can still lose set bit in
VSSIP register (if Im not mistaken something). If we won't do a sync of cached
hvip and h/w hvip then it could lead to the issue we lost a real SSIP bit value.
For example, guest before entering hypervisor set VSSIP.SSIP to 1 what
means what means that hip.VSSIP will be also set to 1 as:
When bit 2 of hideleg is zero, vsip.SSIP and vsie.SSIE are read-only zeros.
Else, vsip.SSIP and vsie.SSIE are aliases of hip.VSSIP and hie.VSSIE.
And so hvip.SSIP will be set to 1 as:
Bits hip.VSSIP and hie.VSSIE are the interrupt-pending and interrupt-enable
bits for VS-level software interrupts. VSSIP in hip is an alias (writable)
of the same bit in hvip.
And then if we don't sync cached hvip with h/w hvip, it could lead to then
when we will put cached hvip (which has .VSSIP set to 0) overwrite h/w hvip.VSSIP
which was set to 1.
~ Oleksii
On 15.01.2026 12:46, Oleksii Kurochko wrote:
>
> On 1/15/26 11:59 AM, Jan Beulich wrote:
>> On 15.01.2026 11:55, Oleksii Kurochko wrote:
>>> On 1/15/26 10:52 AM, Jan Beulich wrote:
>>>> On 15.01.2026 10:14, Oleksii Kurochko wrote:
>>>>> On 1/14/26 4:56 PM, Jan Beulich wrote:
>>>>>> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>>>>>>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>>>>>>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>>>>>>> they are aliasis of HVIP) bits will be updated. And that is why we need
>>>>>>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>>>>>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>>>>>> +{
>>>>>>> + unsigned long hvip;
>>>>>>> +
>>>>>>> + /* Read current HVIP and VSIE CSRs */
>>>>>>> + v->arch.vsie = csr_read(CSR_VSIE);
>>>>>>> +
>>>>>>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>>>>>>> + hvip = csr_read(CSR_HVIP);
>>>>>>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>>>>>> + {
>>>>>>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>>>>>>> + {
>>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>>> + }
>>>>>>> + else
>>>>>>> + {
>>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>>> + }
>>>>>>> + }
>>>>>> I fear I don't understand this at all. Why would the guest having set a
>>>>>> pending bit not result in the IRQ to be marked pending?
>>>>> Maybe it is wrong assumption but based on the spec:
>>>>> Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
>>>>> bits for supervisor-level software interrupts. If implemented, SSIP is
>>>>> writable in sip and may also be set to 1 by a platform-specific interrupt
>>>>> controller.
>>>>> and:
>>>>> Interprocessor interrupts are sent to other harts by implementation-specific
>>>>> means, which will ultimately cause the SSIP bit to be set in the recipient
>>>>> hart’s sip register.
>>>>>
>>>>> Meaning that sending an IPI to self by writing 1 to sip.SSIP is
>>>>> well-defined. The same should be true of vsip.SSIP while in VS mode.
>>>> I can't read that out of the text above. To the contrary, "will ultimately cause
>>>> the SSIP bit to be set" suggests to me that the bit is not to be set by writing
>>>> the CSR. Things still may work like this for self-IPI, but that wouldn't follow
>>>> from the quotation above.
>>> Why not that wouldn't follow from the quotation above?
>>>
>>> The first quotation tells that we can do self-IPI so VSSIP.SSIP will set to 1
>>> what we could miss SSIP bit if won't explicitly try to read h/w HVIP (or VSSIP,
>>> or whatever other alias of the SSIP bit) and sync with what we have cached
>>> in hypervisor.
>> The bit being writable doesn't imply that it being written with 1 would also
>> trigger an interruption. If that's indeed the behavior, it surely is being
>> said elsewhere.
>
> According to the spec it will trap to S-mode (VS-mode in our context) if both of
> the following are true: (a) either the current privilege mode is S and the SIE
> bit in the sstatus register is set, or the current privilege mode has less
> privilege than S-mode; and (b) bit i is set in both sip and sie.
That's still not it. Here is the relevant quote:
"These conditions for an interrupt trap to occur must be evaluated in a bounded
amount of time from when an interrupt becomes, or ceases to be, pending in sip,
and must also be evaluated immediately following the execution of an SRET
instruction or an explicit write to a CSR on which these interrupt trap
conditions expressly depend (including sip, sie and sstatus)."
Note in particular the "explicit write to a CSR". (Sorry, I did read that before,
but I didn't memorize it. Else I wouldn't have asked the original question.)
Jan
On 1/15/26 1:09 PM, Jan Beulich wrote:
> On 15.01.2026 12:46, Oleksii Kurochko wrote:
>> On 1/15/26 11:59 AM, Jan Beulich wrote:
>>> On 15.01.2026 11:55, Oleksii Kurochko wrote:
>>>> On 1/15/26 10:52 AM, Jan Beulich wrote:
>>>>> On 15.01.2026 10:14, Oleksii Kurochko wrote:
>>>>>> On 1/14/26 4:56 PM, Jan Beulich wrote:
>>>>>>> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>>>>>>>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>>>>>>>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>>>>>>>> they are aliasis of HVIP) bits will be updated. And that is why we need
>>>>>>>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>>>>>>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>>>>>>> +{
>>>>>>>> + unsigned long hvip;
>>>>>>>> +
>>>>>>>> + /* Read current HVIP and VSIE CSRs */
>>>>>>>> + v->arch.vsie = csr_read(CSR_VSIE);
>>>>>>>> +
>>>>>>>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>>>>>>>> + hvip = csr_read(CSR_HVIP);
>>>>>>>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>>>>>>> + {
>>>>>>>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>>>>>>>> + {
>>>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>>>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>>>> + }
>>>>>>>> + else
>>>>>>>> + {
>>>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>>>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>>>> + }
>>>>>>>> + }
>>>>>>> I fear I don't understand this at all. Why would the guest having set a
>>>>>>> pending bit not result in the IRQ to be marked pending?
>>>>>> Maybe it is wrong assumption but based on the spec:
>>>>>> Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
>>>>>> bits for supervisor-level software interrupts. If implemented, SSIP is
>>>>>> writable in sip and may also be set to 1 by a platform-specific interrupt
>>>>>> controller.
>>>>>> and:
>>>>>> Interprocessor interrupts are sent to other harts by implementation-specific
>>>>>> means, which will ultimately cause the SSIP bit to be set in the recipient
>>>>>> hart’s sip register.
>>>>>>
>>>>>> Meaning that sending an IPI to self by writing 1 to sip.SSIP is
>>>>>> well-defined. The same should be true of vsip.SSIP while in VS mode.
>>>>> I can't read that out of the text above. To the contrary, "will ultimately cause
>>>>> the SSIP bit to be set" suggests to me that the bit is not to be set by writing
>>>>> the CSR. Things still may work like this for self-IPI, but that wouldn't follow
>>>>> from the quotation above.
>>>> Why not that wouldn't follow from the quotation above?
>>>>
>>>> The first quotation tells that we can do self-IPI so VSSIP.SSIP will set to 1
>>>> what we could miss SSIP bit if won't explicitly try to read h/w HVIP (or VSSIP,
>>>> or whatever other alias of the SSIP bit) and sync with what we have cached
>>>> in hypervisor.
>>> The bit being writable doesn't imply that it being written with 1 would also
>>> trigger an interruption. If that's indeed the behavior, it surely is being
>>> said elsewhere.
>> According to the spec it will trap to S-mode (VS-mode in our context) if both of
>> the following are true: (a) either the current privilege mode is S and the SIE
>> bit in the sstatus register is set, or the current privilege mode has less
>> privilege than S-mode; and (b) bit i is set in both sip and sie.
> That's still not it. Here is the relevant quote:
>
> "These conditions for an interrupt trap to occur must be evaluated in a bounded
> amount of time from when an interrupt becomes, or ceases to be, pending in sip,
> and must also be evaluated immediately following the execution of an SRET
> instruction or an explicit write to a CSR on which these interrupt trap
> conditions expressly depend (including sip, sie and sstatus)."
>
> Note in particular the "explicit write to a CSR". (Sorry, I did read that before,
> but I didn't memorize it. Else I wouldn't have asked the original question.)
Guest can do:
csr_write(CSR_SIP, SSIP);
what is an explicit write to a CSR. Or it the quote it means a different CSR?
~ Oleksii
On 15.01.2026 13:25, Oleksii Kurochko wrote:
>
> On 1/15/26 1:09 PM, Jan Beulich wrote:
>> On 15.01.2026 12:46, Oleksii Kurochko wrote:
>>> On 1/15/26 11:59 AM, Jan Beulich wrote:
>>>> On 15.01.2026 11:55, Oleksii Kurochko wrote:
>>>>> On 1/15/26 10:52 AM, Jan Beulich wrote:
>>>>>> On 15.01.2026 10:14, Oleksii Kurochko wrote:
>>>>>>> On 1/14/26 4:56 PM, Jan Beulich wrote:
>>>>>>>> On 14.01.2026 16:39, Oleksii Kurochko wrote:
>>>>>>>>> If a guest will do "That (the 1 -> 0 transitions) could be (guest) writes
>>>>>>>>> to SVIP, for example." then the correspondent HVIP (and HIP as usually
>>>>>>>>> they are aliasis of HVIP) bits will be updated. And that is why we need
>>>>>>>>> vcpu_sync_interrupts() I've mentioned in one of replies and sync VSSIP:
>>>>>>>>> +void vcpu_sync_interrupts(struct vcpu *v)
>>>>>>>>> +{
>>>>>>>>> + unsigned long hvip;
>>>>>>>>> +
>>>>>>>>> + /* Read current HVIP and VSIE CSRs */
>>>>>>>>> + v->arch.vsie = csr_read(CSR_VSIE);
>>>>>>>>> +
>>>>>>>>> + /* Sync-up HVIP.VSSIP bit changes does by Guest */
>>>>>>>>> + hvip = csr_read(CSR_HVIP);
>>>>>>>>> + if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
>>>>>>>>> + {
>>>>>>>>> + if ( hvip & BIT(IRQ_VS_SOFT, UL) )
>>>>>>>>> + {
>>>>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>>>>> + set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>>>>> + }
>>>>>>>>> + else
>>>>>>>>> + {
>>>>>>>>> + if ( !test_and_set_bit(IRQ_VS_SOFT,
>>>>>>>>> + &v->arch.irqs_pending_mask) )
>>>>>>>>> + clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
>>>>>>>>> + }
>>>>>>>>> + }
>>>>>>>> I fear I don't understand this at all. Why would the guest having set a
>>>>>>>> pending bit not result in the IRQ to be marked pending?
>>>>>>> Maybe it is wrong assumption but based on the spec:
>>>>>>> Bits sip.SSIP and sie.SSIE are the interrupt-pending and interrupt-enable
>>>>>>> bits for supervisor-level software interrupts. If implemented, SSIP is
>>>>>>> writable in sip and may also be set to 1 by a platform-specific interrupt
>>>>>>> controller.
>>>>>>> and:
>>>>>>> Interprocessor interrupts are sent to other harts by implementation-specific
>>>>>>> means, which will ultimately cause the SSIP bit to be set in the recipient
>>>>>>> hart’s sip register.
>>>>>>>
>>>>>>> Meaning that sending an IPI to self by writing 1 to sip.SSIP is
>>>>>>> well-defined. The same should be true of vsip.SSIP while in VS mode.
>>>>>> I can't read that out of the text above. To the contrary, "will ultimately cause
>>>>>> the SSIP bit to be set" suggests to me that the bit is not to be set by writing
>>>>>> the CSR. Things still may work like this for self-IPI, but that wouldn't follow
>>>>>> from the quotation above.
>>>>> Why not that wouldn't follow from the quotation above?
>>>>>
>>>>> The first quotation tells that we can do self-IPI so VSSIP.SSIP will set to 1
>>>>> what we could miss SSIP bit if won't explicitly try to read h/w HVIP (or VSSIP,
>>>>> or whatever other alias of the SSIP bit) and sync with what we have cached
>>>>> in hypervisor.
>>>> The bit being writable doesn't imply that it being written with 1 would also
>>>> trigger an interruption. If that's indeed the behavior, it surely is being
>>>> said elsewhere.
>>> According to the spec it will trap to S-mode (VS-mode in our context) if both of
>>> the following are true: (a) either the current privilege mode is S and the SIE
>>> bit in the sstatus register is set, or the current privilege mode has less
>>> privilege than S-mode; and (b) bit i is set in both sip and sie.
>> That's still not it. Here is the relevant quote:
>>
>> "These conditions for an interrupt trap to occur must be evaluated in a bounded
>> amount of time from when an interrupt becomes, or ceases to be, pending in sip,
>> and must also be evaluated immediately following the execution of an SRET
>> instruction or an explicit write to a CSR on which these interrupt trap
>> conditions expressly depend (including sip, sie and sstatus)."
>>
>> Note in particular the "explicit write to a CSR". (Sorry, I did read that before,
>> but I didn't memorize it. Else I wouldn't have asked the original question.)
>
> Guest can do:
> csr_write(CSR_SIP, SSIP);
> what is an explicit write to a CSR. Or it the quote it means a different CSR?
That's what is meant, aiui.
Jan
© 2016 - 2026 Red Hat, Inc.