Unfair qspinlocks on ARM64 without LSE atomics => 3ms delay in interrupt handling

Bouska, Zdenek posted 1 patch 2 years, 10 months ago
Unfair qspinlocks on ARM64 without LSE atomics => 3ms delay in interrupt handling
Posted by Bouska, Zdenek 2 years, 10 months ago
Hello,

I have seen ~3 ms delay in interrupt handling on ARM64.

I have traced it down to raw_spin_lock() call in handle_irq_event() in
kernel/irq/handle.c:

irqreturn_t handle_irq_event(struct irq_desc *desc)
{
    irqreturn_t ret;

    desc->istate &= ~IRQS_PENDING;
    irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
    raw_spin_unlock(&desc->lock);

    ret = handle_irq_event_percpu(desc);

--> raw_spin_lock(&desc->lock);
    irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
    return ret;
}

It took ~3 ms for this raw_spin_lock() to lock.

During this time irq_finalize_oneshot() from kernel/irq/manage.c locks and
unlocks the same raw spin lock more than 1000 times:

static void irq_finalize_oneshot(struct irq_desc *desc,
                 struct irqaction *action)
{
    if (!(desc->istate & IRQS_ONESHOT) ||
        action->handler == irq_forced_secondary_handler)
        return;
again:
    chip_bus_lock(desc);
--> raw_spin_lock_irq(&desc->lock);

    /*
     * Implausible though it may be we need to protect us against
     * the following scenario:
     *
     * The thread is faster done than the hard interrupt handler
     * on the other CPU. If we unmask the irq line then the
     * interrupt can come in again and masks the line, leaves due
     * to IRQS_INPROGRESS and the irq line is masked forever.
     *
     * This also serializes the state of shared oneshot handlers
     * versus "desc->threads_oneshot |= action->thread_mask;" in
     * irq_wake_thread(). See the comment there which explains the
     * serialization.
     */
    if (unlikely(irqd_irq_inprogress(&desc->irq_data))) {
-->     raw_spin_unlock_irq(&desc->lock);
        chip_bus_sync_unlock(desc);
        cpu_relax();
        goto again;
    }
...

I have created a workaround for this problem by calling cpu_relax() 50
times after 100 failed tries. See attached patch
3ms_tx_delay_workaround.patch.

I have created custom kernel module with 2 threads, one similar to
irq_finalize_oneshot(), second similar to handle_irq_event(). I have used
latest Linux 6.3-rc3 with no added patches and I confirmed that even there
qspinlocks are not fair on my ARM64 board.

I copied qspinlocks code to the module twice and I have put traces only to
one thread, the one which takes several ms to lock and is originally
called from handle_irq_event(). I have found out that the
queued_fetch_set_pending_acquire() takes those 3 ms to finish. On ARM64
queued_fetch_set_pending_acquire() is implemented as
atomic_fetch_or_acquire().

I have found out that my CPU doesn't know LSE atomic instructions and it
looks like atomic operations could be quite slow there. Assembler code in
arch/arm64/include/asm/atomic_ll_sc.h has loop inside:

#define ATOMIC_FETCH_OP(name, mb, acq, rel, cl, op, asm_op, constraint) \
static __always_inline int                      \
__ll_sc_atomic_fetch_##op##name(int i, atomic_t *v)         \
{                                   \
    unsigned long tmp;                      \
    int val, result;                        \
                                    \
    asm volatile("// atomic_fetch_" #op #name "\n"          \
    "   prfm    pstl1strm, %3\n"                \
    "1: ld" #acq "xr    %w0, %3\n"              \
    "   " #asm_op " %w1, %w0, %w4\n"            \
    "   st" #rel "xr    %w2, %w1, %3\n"             \
--> "   cbnz    %w2, 1b\n"                  \
    "   " #mb                           \
    : "=&r" (result), "=&r" (val), "=&r" (tmp), "+Q" (v->counter)   \
    : __stringify(constraint) "r" (i)               \
    : cl);                              \
                                    \
    return result;                          \
}

Most importantly, these atomic operations seem to make one CPU dominate
the cache line so that the other is unable to take the lock. And that is
problematic in combination with the retry loop in irq_finalize_oneshot().

To confirm it I have created small userspace program, which just calls
__ll_sc_atomic_fetch_or_acquire() from two threads. See attached
unfair_arm64_asm_atomic_ll_sc_demonstration.tar.gz. Bellow you can see
that it took 16 ms for one atomic operation.

# ./contested
load thread started
evaluation thread started
new max duration: 6420 ns
new max duration: 9355 ns
new max duration: 22240 ns
new max duration: 23180 ns
new max duration: 70465 ns
new max duration: 77860 ns
new max duration: 83100 ns
new max duration: 105115 ns
new max duration: 127695 ns
new max duration: 128840 ns
new max duration: 1265595 ns
new max duration: 3713430 ns
new max duration: 3750810 ns
new max duration: 7996020 ns
new max duration: 7998890 ns
new max duration: 7999340 ns
new max duration: 7999490 ns
new max duration: 12000210 ns
new max duration: 15999700 ns
new max duration: 16000000 ns
new max duration: 16000030 ns

So I confirmed that atomic operations from
arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are
contested from second CPU.

Do you think that it is possible to create fair qspinlock implementation
on top of atomic instructions supported by ARM64 version 8 (no LSE atomic
instructions) without compromising performance in the uncontested case?
For example ARM64 could have custom queued_fetch_set_pending_acquire
implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the
retry loop in irq_finalize_oneshot() ok together with the current ARM64
cpu_relax() implementation for processor with no LSE atomic instructions?

I reproduced the real life scenario of TX delay only in ICSSG network
driver (not yet merged to mainline) [1], it was with kernel 5.10 with
patches, CONFIG_PREEMPT_RT and custom ICSSG firmware on Texas Instruments
AM65x IDK [2] with ARM Cortex A53. This custom setup comes with high
interrupt load.

[1] https://lore.kernel.org/all/20220406094358.7895-1-p-mohan@ti.com/
[2] https://www.ti.com/tool/TMDX654IDKEVM

With best regards,
Zdenek Bouska

--
Siemens, s.r.o
Siemens Advanta Developmentdiff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 8ce75495e04f..1f976f36cd56 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1067,51 +1067,59 @@ static int irq_wait_for_interrupt(struct irqaction *action)
 		}
 		schedule();
 	}
 }
 
 /*
  * Oneshot interrupts keep the irq line masked until the threaded
  * handler finished. unmask if the interrupt has not been disabled and
  * is marked MASKED.
  */
 static void irq_finalize_oneshot(struct irq_desc *desc,
 				 struct irqaction *action)
 {
+	int i;
+	int64_t again_count = 0;
 	if (!(desc->istate & IRQS_ONESHOT) ||
 	    action->handler == irq_forced_secondary_handler)
 		return;
 again:
 	chip_bus_lock(desc);
 	raw_spin_lock_irq(&desc->lock);
 
 	/*
 	 * Implausible though it may be we need to protect us against
 	 * the following scenario:
 	 *
 	 * The thread is faster done than the hard interrupt handler
 	 * on the other CPU. If we unmask the irq line then the
 	 * interrupt can come in again and masks the line, leaves due
 	 * to IRQS_INPROGRESS and the irq line is masked forever.
 	 *
 	 * This also serializes the state of shared oneshot handlers
 	 * versus "desc->threads_oneshot |= action->thread_mask;" in
 	 * irq_wake_thread(). See the comment there which explains the
 	 * serialization.
 	 */
 	if (unlikely(irqd_irq_inprogress(&desc->irq_data))) {
 		raw_spin_unlock_irq(&desc->lock);
 		chip_bus_sync_unlock(desc);
 		cpu_relax();
+		++again_count;
+		if (again_count > 100) {
+			for (i=0; i < 50; ++i) {
+				cpu_relax();
+			}
+		}
 		goto again;
 	}
 
 	/*
 	 * Now check again, whether the thread should run. Otherwise
 	 * we would clear the threads_oneshot bit of this thread which
 	 * was just set.
 	 */
 	if (test_bit(IRQTF_RUNTHREAD, &action->thread_flags))
 		goto out_unlock;
 
 	desc->threads_oneshot &= ~action->thread_mask;
 
Re: Unfair qspinlocks on ARM64 without LSE atomics => 3ms delay in interrupt handling
Posted by Catalin Marinas 2 years, 10 months ago
On Fri, Mar 24, 2023 at 08:43:38AM +0000, Bouska, Zdenek wrote:
> I have seen ~3 ms delay in interrupt handling on ARM64.
> 
> I have traced it down to raw_spin_lock() call in handle_irq_event() in
> kernel/irq/handle.c:
> 
> irqreturn_t handle_irq_event(struct irq_desc *desc)
> {
>     irqreturn_t ret;
> 
>     desc->istate &= ~IRQS_PENDING;
>     irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>     raw_spin_unlock(&desc->lock);
> 
>     ret = handle_irq_event_percpu(desc);
> 
> --> raw_spin_lock(&desc->lock);
>     irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>     return ret;
> }
> 
> It took ~3 ms for this raw_spin_lock() to lock.

That's quite a large indeed.

> During this time irq_finalize_oneshot() from kernel/irq/manage.c locks and
> unlocks the same raw spin lock more than 1000 times:
> 
> static void irq_finalize_oneshot(struct irq_desc *desc,
>                  struct irqaction *action)
> {
>     if (!(desc->istate & IRQS_ONESHOT) ||
>         action->handler == irq_forced_secondary_handler)
>         return;
> again:
>     chip_bus_lock(desc);
> --> raw_spin_lock_irq(&desc->lock);
> 
>     /*
>      * Implausible though it may be we need to protect us against
>      * the following scenario:
>      *
>      * The thread is faster done than the hard interrupt handler
>      * on the other CPU. If we unmask the irq line then the
>      * interrupt can come in again and masks the line, leaves due
>      * to IRQS_INPROGRESS and the irq line is masked forever.
>      *
>      * This also serializes the state of shared oneshot handlers
>      * versus "desc->threads_oneshot |= action->thread_mask;" in
>      * irq_wake_thread(). See the comment there which explains the
>      * serialization.
>      */
>     if (unlikely(irqd_irq_inprogress(&desc->irq_data))) {
> -->     raw_spin_unlock_irq(&desc->lock);
>         chip_bus_sync_unlock(desc);
>         cpu_relax();
>         goto again;
>     }

So this path is hammering the desc->lock location and another CPU cannot
change it. As you found, the problem is not the spinlock algorithm but
the atomic primitives. The LDXR/STXR constructs on arm64 are known to
have this issue with STXR failing indefinitely. raw_spin_unlock() simply
does an STLR and this clears the exclusive monitor that the other CPU
may have set with LDXR but before the STXR. The queued spinlock only
provides fairness if the CPU manages to get in the queue.

> So I confirmed that atomic operations from
> arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are
> contested from second CPU.
> 
> Do you think that it is possible to create fair qspinlock implementation
> on top of atomic instructions supported by ARM64 version 8 (no LSE atomic
> instructions) without compromising performance in the uncontested case?
> For example ARM64 could have custom queued_fetch_set_pending_acquire
> implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the
> retry loop in irq_finalize_oneshot() ok together with the current ARM64
> cpu_relax() implementation for processor with no LSE atomic instructions?

So is the queued_fetch_set_pending_acquire() where it gets stuck or the
earlier atomic_try_cmpxchg_acquire() before entering on the slow path? I
guess both can fail in a similar way.

A longer cpu_relax() here would improve things (on arm64 this function
is a no-op) but maybe Thomas or Will have a better idea.

-- 
Catalin
Re: Unfair qspinlocks on ARM64 without LSE atomics => 3ms delay in interrupt handling
Posted by Bouska, Zdenek 2 years, 10 months ago
>> So I confirmed that atomic operations from
>> arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are
>> contested from second CPU.
>> 
>> Do you think that it is possible to create fair qspinlock implementation
>> on top of atomic instructions supported by ARM64 version 8 (no LSE atomic
>> instructions) without compromising performance in the uncontested case?
>> For example ARM64 could have custom queued_fetch_set_pending_acquire
>> implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the
>> retry loop in irq_finalize_oneshot() ok together with the current ARM64
>> cpu_relax() implementation for processor with no LSE atomic instructions?
>
>So is the queued_fetch_set_pending_acquire() where it gets stuck or the
>earlier atomic_try_cmpxchg_acquire() before entering on the slow path? I
>guess both can fail in a similar way.

For me it was stuck on queued_fetch_set_pending_acquire().

Zdenek Bouska

--
Siemens, s.r.o
Siemens Advanta Development
Re: Unfair qspinlocks on ARM64 without LSE atomics => 3ms delay in interrupt handling
Posted by Will Deacon 2 years, 10 months ago
On Fri, Mar 24, 2023 at 05:01:28PM +0000, Catalin Marinas wrote:
> On Fri, Mar 24, 2023 at 08:43:38AM +0000, Bouska, Zdenek wrote:
> > I have seen ~3 ms delay in interrupt handling on ARM64.
> > 
> > I have traced it down to raw_spin_lock() call in handle_irq_event() in
> > kernel/irq/handle.c:
> > 
> > irqreturn_t handle_irq_event(struct irq_desc *desc)
> > {
> >     irqreturn_t ret;
> > 
> >     desc->istate &= ~IRQS_PENDING;
> >     irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >     raw_spin_unlock(&desc->lock);
> > 
> >     ret = handle_irq_event_percpu(desc);
> > 
> > --> raw_spin_lock(&desc->lock);
> >     irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >     return ret;
> > }
> > 
> > It took ~3 ms for this raw_spin_lock() to lock.
> 
> That's quite a large indeed.
> 
> > During this time irq_finalize_oneshot() from kernel/irq/manage.c locks and
> > unlocks the same raw spin lock more than 1000 times:
> > 
> > static void irq_finalize_oneshot(struct irq_desc *desc,
> >                  struct irqaction *action)
> > {
> >     if (!(desc->istate & IRQS_ONESHOT) ||
> >         action->handler == irq_forced_secondary_handler)
> >         return;
> > again:
> >     chip_bus_lock(desc);
> > --> raw_spin_lock_irq(&desc->lock);
> > 
> >     /*
> >      * Implausible though it may be we need to protect us against
> >      * the following scenario:
> >      *
> >      * The thread is faster done than the hard interrupt handler
> >      * on the other CPU. If we unmask the irq line then the
> >      * interrupt can come in again and masks the line, leaves due
> >      * to IRQS_INPROGRESS and the irq line is masked forever.
> >      *
> >      * This also serializes the state of shared oneshot handlers
> >      * versus "desc->threads_oneshot |= action->thread_mask;" in
> >      * irq_wake_thread(). See the comment there which explains the
> >      * serialization.
> >      */
> >     if (unlikely(irqd_irq_inprogress(&desc->irq_data))) {
> > -->     raw_spin_unlock_irq(&desc->lock);
> >         chip_bus_sync_unlock(desc);
> >         cpu_relax();
> >         goto again;
> >     }
> 
> So this path is hammering the desc->lock location and another CPU cannot
> change it. As you found, the problem is not the spinlock algorithm but
> the atomic primitives. The LDXR/STXR constructs on arm64 are known to
> have this issue with STXR failing indefinitely. raw_spin_unlock() simply
> does an STLR and this clears the exclusive monitor that the other CPU
> may have set with LDXR but before the STXR. The queued spinlock only
> provides fairness if the CPU manages to get in the queue.
> 
> > So I confirmed that atomic operations from
> > arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are
> > contested from second CPU.
> > 
> > Do you think that it is possible to create fair qspinlock implementation
> > on top of atomic instructions supported by ARM64 version 8 (no LSE atomic
> > instructions) without compromising performance in the uncontested case?
> > For example ARM64 could have custom queued_fetch_set_pending_acquire
> > implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the
> > retry loop in irq_finalize_oneshot() ok together with the current ARM64
> > cpu_relax() implementation for processor with no LSE atomic instructions?
> 
> So is the queued_fetch_set_pending_acquire() where it gets stuck or the
> earlier atomic_try_cmpxchg_acquire() before entering on the slow path? I
> guess both can fail in a similar way.
> 
> A longer cpu_relax() here would improve things (on arm64 this function
> is a no-op) but maybe Thomas or Will have a better idea.

I had a pretty gross cpu_relax() implementation using wfe somewhere on
LKML, so you could try that if you can dig it up.

Generally though, LDXR/STXR and realtime don't mix super well.

Will
Re: Unfair qspinlocks on ARM64 without LSE atomics => 3ms delay in interrupt handling
Posted by Bouska, Zdenek 2 years, 10 months ago
> > A longer cpu_relax() here would improve things (on arm64 this function
> > is a no-op) but maybe Thomas or Will have a better idea.
>
> I had a pretty gross cpu_relax() implementation using wfe somewhere on
> LKML, so you could try that if you can dig it up.

Do you mean cpu_relax() implementation from this email [1] from
Fri, 28 Jul 2017 ?

I tried to rebase it on recent Linux, but it did not even boot for me.
Only this was printed:
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[    0

But cpu_relax() implementation from [1] fixes my problem if I use it
only in irq_finalize_oneshot().

[1] https://lore.kernel.org/lkml/20170728092831.GA24839@arm.com/

Zdenek Bouska

--
Siemens, s.r.o
Siemens Advanta Development