[PATCH for-4.21 01/10] x86/HPET: limit channel changes

Jan Beulich posted 10 patches 2 weeks ago
There is a newer version of this series
[PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Jan Beulich 2 weeks ago
Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
processing") we can still observe nested invocations of
hpet_interrupt_handler(). This is, afaict, a result of previously used
channels retaining their IRQ affinity until some other CPU re-uses them.
Such nesting is increasingly problematic with higher CPU counts, as both
handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
variable. IOW already a single level of nesting may require more stack
space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
(the maximum value presently possible).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Whether this is still worthwhile with "x86/HPET: use single, global, low-
priority vector for broadcast IRQ" isn't quite clear to me.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -442,6 +442,8 @@ static void __init hpet_fsb_cap_lookup(v
            num_hpets_used, num_chs);
 }
 
+static DEFINE_PER_CPU(struct hpet_event_channel *, lru_channel);
+
 static struct hpet_event_channel *hpet_get_channel(unsigned int cpu)
 {
     static unsigned int next_channel;
@@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
     if ( num_hpets_used >= nr_cpu_ids )
         return &hpet_events[cpu];
 
+    /*
+     * Try the least recently used channel first.  It may still have its IRQ's
+     * affinity set to the desired CPU.  This way we also limit having multiple
+     * of our IRQs raised on the same CPU, in possibly a nested manner.
+     */
+    ch = per_cpu(lru_channel, cpu);
+    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
+    {
+        ch->cpu = cpu;
+        return ch;
+    }
+
+    /* Then look for an unused channel. */
     next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
 
-    /* try unused channel first */
     for ( i = next; i < next + num_hpets_used; i++ )
     {
         ch = &hpet_events[i % num_hpets_used];
@@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
 {
     struct irq_desc *desc = irq_to_desc(ch->msi.irq);
 
+    per_cpu(lru_channel, ch->cpu) = ch;
+
     ASSERT(!local_irq_is_enabled());
     spin_lock(&desc->lock);
     hpet_msi_mask(desc);
Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Roger Pau Monné 1 week, 6 days ago
On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
> processing") we can still observe nested invocations of
> hpet_interrupt_handler(). This is, afaict, a result of previously used
> channels retaining their IRQ affinity until some other CPU re-uses them.
> Such nesting is increasingly problematic with higher CPU counts, as both
> handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
> variable. IOW already a single level of nesting may require more stack
> space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
> (the maximum value presently possible).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Whether this is still worthwhile with "x86/HPET: use single, global, low-
> priority vector for broadcast IRQ" isn't quite clear to me.

Seeing the rest of the series, I don't think this is necessary
anymore?  Also the comment you here is made stale by the patch that
uses a global vector.

Thanks, Roger.
Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Jan Beulich 1 week, 6 days ago
On 17.10.2025 11:23, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
>> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
>> processing") we can still observe nested invocations of
>> hpet_interrupt_handler(). This is, afaict, a result of previously used
>> channels retaining their IRQ affinity until some other CPU re-uses them.
>> Such nesting is increasingly problematic with higher CPU counts, as both
>> handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
>> variable. IOW already a single level of nesting may require more stack
>> space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
>> (the maximum value presently possible).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Whether this is still worthwhile with "x86/HPET: use single, global, low-
>> priority vector for broadcast IRQ" isn't quite clear to me.
> 
> Seeing the rest of the series, I don't think this is necessary
> anymore?  Also the comment you here is made stale by the patch that
> uses a global vector.

Right now I'm not quite sure, hence the remark and the patch being part of
the series. If I re-work patch 3 to avoid the mask/unmask upon affinity
changes, I think the one here can indeed be dropped.

Jan

Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Roger Pau Monné 2 weeks ago
On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
> processing") we can still observe nested invocations of
> hpet_interrupt_handler(). This is, afaict, a result of previously used
> channels retaining their IRQ affinity until some other CPU re-uses them.

But the underlying problem here is not so much the affinity itself,
but the fact that the channel is not stopped after firing?

> Such nesting is increasingly problematic with higher CPU counts, as both
> handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
> variable. IOW already a single level of nesting may require more stack
> space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
> (the maximum value presently possible).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Whether this is still worthwhile with "x86/HPET: use single, global, low-
> priority vector for broadcast IRQ" isn't quite clear to me.
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -442,6 +442,8 @@ static void __init hpet_fsb_cap_lookup(v
>             num_hpets_used, num_chs);
>  }
>  
> +static DEFINE_PER_CPU(struct hpet_event_channel *, lru_channel);
> +
>  static struct hpet_event_channel *hpet_get_channel(unsigned int cpu)
>  {
>      static unsigned int next_channel;
> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
>      if ( num_hpets_used >= nr_cpu_ids )
>          return &hpet_events[cpu];
>  
> +    /*
> +     * Try the least recently used channel first.  It may still have its IRQ's
> +     * affinity set to the desired CPU.  This way we also limit having multiple
> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
> +     */
> +    ch = per_cpu(lru_channel, cpu);
> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
> +    {
> +        ch->cpu = cpu;
> +        return ch;
> +    }
> +
> +    /* Then look for an unused channel. */
>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
>  
> -    /* try unused channel first */
>      for ( i = next; i < next + num_hpets_used; i++ )
>      {
>          ch = &hpet_events[i % num_hpets_used];
> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
>  {
>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
>  
> +    per_cpu(lru_channel, ch->cpu) = ch;
> +
>      ASSERT(!local_irq_is_enabled());
>      spin_lock(&desc->lock);
>      hpet_msi_mask(desc);

Maybe I'm missing the point here, but you are resetting the MSI
affinity anyway here, so there isn't much point in attempting to
re-use the same channel when Xen still unconditionally goes through the
process of setting the affinity anyway?

Thanks, Roger.
Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Jan Beulich 2 weeks ago
On 16.10.2025 12:24, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
>> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
>> processing") we can still observe nested invocations of
>> hpet_interrupt_handler(). This is, afaict, a result of previously used
>> channels retaining their IRQ affinity until some other CPU re-uses them.
> 
> But the underlying problem here is not so much the affinity itself,
> but the fact that the channel is not stopped after firing?

(when being detached, that is) That's the main problem here, yes. A minor
benefit is to avoid the MMIO write in hpet_msi_set_affinity(). See also
below.

Further, even when mask while detaching, the issue would re-surface after
unmasking; it's just that the window then is smaller.

>> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
>>      if ( num_hpets_used >= nr_cpu_ids )
>>          return &hpet_events[cpu];
>>  
>> +    /*
>> +     * Try the least recently used channel first.  It may still have its IRQ's
>> +     * affinity set to the desired CPU.  This way we also limit having multiple
>> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
>> +     */
>> +    ch = per_cpu(lru_channel, cpu);
>> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
>> +    {
>> +        ch->cpu = cpu;
>> +        return ch;
>> +    }
>> +
>> +    /* Then look for an unused channel. */
>>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
>>  
>> -    /* try unused channel first */
>>      for ( i = next; i < next + num_hpets_used; i++ )
>>      {
>>          ch = &hpet_events[i % num_hpets_used];
>> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
>>  {
>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
>>  
>> +    per_cpu(lru_channel, ch->cpu) = ch;
>> +
>>      ASSERT(!local_irq_is_enabled());
>>      spin_lock(&desc->lock);
>>      hpet_msi_mask(desc);
> 
> Maybe I'm missing the point here, but you are resetting the MSI
> affinity anyway here, so there isn't much point in attempting to
> re-use the same channel when Xen still unconditionally goes through the
> process of setting the affinity anyway?

While still using normal IRQs, there's still a benefit: We can re-use the
same vector (as staying on the same CPU), and hence we save an IRQ
migration (being the main source of nested IRQs according to my
observations).

We could actually do even better, by avoiding the mask/unmask pair there,
which would avoid triggering the "immediate" IRQ that I (for now) see as
the only explanation of the large amount of "early" IRQs that I observe
on (at least) Intel hardware. That would require doing the msg.dest32
check earlier, but otherwise looks feasible. (Actually, the unmask would
still be necessary, in case we're called with the channel already masked.)

Jan

Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Roger Pau Monné 2 weeks ago
On Thu, Oct 16, 2025 at 01:47:38PM +0200, Jan Beulich wrote:
> On 16.10.2025 12:24, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> >> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
> >> processing") we can still observe nested invocations of
> >> hpet_interrupt_handler(). This is, afaict, a result of previously used
> >> channels retaining their IRQ affinity until some other CPU re-uses them.
> > 
> > But the underlying problem here is not so much the affinity itself,
> > but the fact that the channel is not stopped after firing?
> 
> (when being detached, that is) That's the main problem here, yes. A minor
> benefit is to avoid the MMIO write in hpet_msi_set_affinity(). See also
> below.
> 
> Further, even when mask while detaching, the issue would re-surface after
> unmasking; it's just that the window then is smaller.

Yeah, it could trigger after unmasking, but the window is smaller
there, as after enabling the comparator will get updated to the new
deadline.

> >> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
> >>      if ( num_hpets_used >= nr_cpu_ids )
> >>          return &hpet_events[cpu];
> >>  
> >> +    /*
> >> +     * Try the least recently used channel first.  It may still have its IRQ's
> >> +     * affinity set to the desired CPU.  This way we also limit having multiple
> >> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
> >> +     */
> >> +    ch = per_cpu(lru_channel, cpu);
> >> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
> >> +    {
> >> +        ch->cpu = cpu;
> >> +        return ch;
> >> +    }
> >> +
> >> +    /* Then look for an unused channel. */
> >>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
> >>  
> >> -    /* try unused channel first */
> >>      for ( i = next; i < next + num_hpets_used; i++ )
> >>      {
> >>          ch = &hpet_events[i % num_hpets_used];
> >> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
> >>  {
> >>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >>  
> >> +    per_cpu(lru_channel, ch->cpu) = ch;
> >> +
> >>      ASSERT(!local_irq_is_enabled());
> >>      spin_lock(&desc->lock);
> >>      hpet_msi_mask(desc);
> > 
> > Maybe I'm missing the point here, but you are resetting the MSI
> > affinity anyway here, so there isn't much point in attempting to
> > re-use the same channel when Xen still unconditionally goes through the
> > process of setting the affinity anyway?
> 
> While still using normal IRQs, there's still a benefit: We can re-use the
> same vector (as staying on the same CPU), and hence we save an IRQ
> migration (being the main source of nested IRQs according to my
> observations).

Hm, I see.  You short-circuit all the logic in _assign_irq_vector().

> We could actually do even better, by avoiding the mask/unmask pair there,
> which would avoid triggering the "immediate" IRQ that I (for now) see as
> the only explanation of the large amount of "early" IRQs that I observe
> on (at least) Intel hardware. That would require doing the msg.dest32
> check earlier, but otherwise looks feasible. (Actually, the unmask would
> still be necessary, in case we're called with the channel already masked.)

Checking with .dest32 seems a bit crude, I would possibly prefer to
slightly modify hpet_attach_channel() to notice when ch->cpu == cpu
and avoid the call to set_channel_irq_affinity()?

Thanks, Roger.

Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Jan Beulich 2 weeks ago
On 16.10.2025 17:07, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 01:47:38PM +0200, Jan Beulich wrote:
>> On 16.10.2025 12:24, Roger Pau Monné wrote:
>>> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
>>>> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
>>>>      if ( num_hpets_used >= nr_cpu_ids )
>>>>          return &hpet_events[cpu];
>>>>  
>>>> +    /*
>>>> +     * Try the least recently used channel first.  It may still have its IRQ's
>>>> +     * affinity set to the desired CPU.  This way we also limit having multiple
>>>> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
>>>> +     */
>>>> +    ch = per_cpu(lru_channel, cpu);
>>>> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
>>>> +    {
>>>> +        ch->cpu = cpu;
>>>> +        return ch;
>>>> +    }
>>>> +
>>>> +    /* Then look for an unused channel. */
>>>>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
>>>>  
>>>> -    /* try unused channel first */
>>>>      for ( i = next; i < next + num_hpets_used; i++ )
>>>>      {
>>>>          ch = &hpet_events[i % num_hpets_used];
>>>> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
>>>>  {
>>>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
>>>>  
>>>> +    per_cpu(lru_channel, ch->cpu) = ch;
>>>> +
>>>>      ASSERT(!local_irq_is_enabled());
>>>>      spin_lock(&desc->lock);
>>>>      hpet_msi_mask(desc);
>>>
>>> Maybe I'm missing the point here, but you are resetting the MSI
>>> affinity anyway here, so there isn't much point in attempting to
>>> re-use the same channel when Xen still unconditionally goes through the
>>> process of setting the affinity anyway?
>>
>> While still using normal IRQs, there's still a benefit: We can re-use the
>> same vector (as staying on the same CPU), and hence we save an IRQ
>> migration (being the main source of nested IRQs according to my
>> observations).
> 
> Hm, I see.  You short-circuit all the logic in _assign_irq_vector().
> 
>> We could actually do even better, by avoiding the mask/unmask pair there,
>> which would avoid triggering the "immediate" IRQ that I (for now) see as
>> the only explanation of the large amount of "early" IRQs that I observe
>> on (at least) Intel hardware. That would require doing the msg.dest32
>> check earlier, but otherwise looks feasible. (Actually, the unmask would
>> still be necessary, in case we're called with the channel already masked.)
> 
> Checking with .dest32 seems a bit crude, I would possibly prefer to
> slightly modify hpet_attach_channel() to notice when ch->cpu == cpu
> and avoid the call to set_channel_irq_affinity()?

That would be an always-false condition, wouldn't it? "attach" and "detach"
are used strictly in pairs, and after "detach" ch->cpu != cpu.

Jan

Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
Posted by Roger Pau Monné 2 weeks ago
On Thu, Oct 16, 2025 at 05:16:07PM +0200, Jan Beulich wrote:
> On 16.10.2025 17:07, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 01:47:38PM +0200, Jan Beulich wrote:
> >> On 16.10.2025 12:24, Roger Pau Monné wrote:
> >>> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> >>>> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
> >>>>      if ( num_hpets_used >= nr_cpu_ids )
> >>>>          return &hpet_events[cpu];
> >>>>  
> >>>> +    /*
> >>>> +     * Try the least recently used channel first.  It may still have its IRQ's
> >>>> +     * affinity set to the desired CPU.  This way we also limit having multiple
> >>>> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
> >>>> +     */
> >>>> +    ch = per_cpu(lru_channel, cpu);
> >>>> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
> >>>> +    {
> >>>> +        ch->cpu = cpu;
> >>>> +        return ch;
> >>>> +    }
> >>>> +
> >>>> +    /* Then look for an unused channel. */
> >>>>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
> >>>>  
> >>>> -    /* try unused channel first */
> >>>>      for ( i = next; i < next + num_hpets_used; i++ )
> >>>>      {
> >>>>          ch = &hpet_events[i % num_hpets_used];
> >>>> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
> >>>>  {
> >>>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >>>>  
> >>>> +    per_cpu(lru_channel, ch->cpu) = ch;
> >>>> +
> >>>>      ASSERT(!local_irq_is_enabled());
> >>>>      spin_lock(&desc->lock);
> >>>>      hpet_msi_mask(desc);
> >>>
> >>> Maybe I'm missing the point here, but you are resetting the MSI
> >>> affinity anyway here, so there isn't much point in attempting to
> >>> re-use the same channel when Xen still unconditionally goes through the
> >>> process of setting the affinity anyway?
> >>
> >> While still using normal IRQs, there's still a benefit: We can re-use the
> >> same vector (as staying on the same CPU), and hence we save an IRQ
> >> migration (being the main source of nested IRQs according to my
> >> observations).
> > 
> > Hm, I see.  You short-circuit all the logic in _assign_irq_vector().
> > 
> >> We could actually do even better, by avoiding the mask/unmask pair there,
> >> which would avoid triggering the "immediate" IRQ that I (for now) see as
> >> the only explanation of the large amount of "early" IRQs that I observe
> >> on (at least) Intel hardware. That would require doing the msg.dest32
> >> check earlier, but otherwise looks feasible. (Actually, the unmask would
> >> still be necessary, in case we're called with the channel already masked.)
> > 
> > Checking with .dest32 seems a bit crude, I would possibly prefer to
> > slightly modify hpet_attach_channel() to notice when ch->cpu == cpu
> > and avoid the call to set_channel_irq_affinity()?
> 
> That would be an always-false condition, wouldn't it? "attach" and "detach"
> are used strictly in pairs, and after "detach" ch->cpu != cpu.

I see, we set ch->cpu = -1 if the channel is really detached as
opposed to migrated to a different CPU.  I haven't looked, but I
assume leaving the previous CPU in ch->cpu would cause issues
elsewhere.

Thanks, Roger.