usb: gadget: u_ether: prevent deadlock under RT

[RFC PATCH] usb: gadget: u_ether: prevent deadlock under RT

Posted by Ralph Siemsen 11 months, 3 weeks ago

A deadlock can be readily triggered when using NCM gadget with the
Cadence cdns3 USB driver, under heavy traffic from "iperf3 --bidir".
It has been observed under 6.1, 6.6 and 6.12 kernel versions, but
only on PREEMPT_RT. Once deadlocked, even magicsysrq has no effect.
However, there is periodic output from the rcu stall detector:

[   71.105941] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   71.105966] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-1): P125/4:b..l
[   71.105992] rcu:     (detected by 1, t=5252 jiffies, g=-515, q=7 ncpus=2)
[   71.106003] task:irq/507-s-f4000 state:D stack:0     pid:125   tgid:125   ppid:2      flags:0x00000008
[   71.106018] Call trace:
[   71.106022]  __switch_to+0xf4/0x158
[   71.106046]  __schedule+0x2b4/0x920
[   71.106055]  schedule_rtlock+0x24/0x50
[   71.106064]  rtlock_slowlock_locked+0x348/0xcb8
[   71.106077]  rt_spin_lock+0x88/0xb8
[   71.106086]  eth_start_xmit+0x30/0x1490 [u_ether]        /*****/
[   71.106112]  ncm_tx_timeout+0x2c/0x50 [usb_f_ncm]
[   71.106131]  __hrtimer_run_queues+0x180/0x378
[   71.106143]  hrtimer_run_softirq+0x90/0x100
[   71.106151]  handle_softirqs.isra.0+0x14c/0x360
[   71.106165]  __local_bh_enable_ip+0x104/0x118
[   71.106177]  __netdev_alloc_skb+0x1e0/0x210
[   71.106192]  ncm_unwrap_ntb+0x1ec/0x528 [usb_f_ncm]
[   71.106206]  rx_complete+0x120/0x288 [u_ether]           /*****/
[   71.106221]  usb_gadget_giveback_request+0x34/0xf8
[   71.106236]  cdns3_gadget_giveback+0xe4/0x2d0 [cdns3]
[   71.106286]  cdns3_transfer_completed+0x3b0/0x630 [cdns3]
[   71.106320]  cdns3_device_thread_irq_handler+0x8b8/0xd18 [cdns3]
[   71.106353]  irq_thread_fn+0x34/0xb8
[   71.106364]  irq_thread+0x180/0x2f0
[   71.106374]  kthread+0x104/0x118
[   71.106384]  ret_from_fork+0x10/0x20

The deadlock occurs because eth_start_xmit() and rx_complete() both
acquire the same spinlock in the same instance of struct eth_dev.
The nested call occurs because rx_complete() calls __netdev_alloc_skb()
which performs a brief local_bh_disable/enable() sequence.

Prevent the deadlock by disabling softirq in rx_complete(), with the
same scope as the spinlock. With this fix, no deadlocks are observed
over many hours of continuous testing at USB 2.0 speed (480 Mbit/s).

Signed-off-by: Ralph Siemsen <ralph.siemsen@linaro.org>
---
Discussion items:

0) This is somewhat similar to the recent discussion
https://lore.kernel.org/linux-rt-devel/20250212123302.0f620f23@gandalf.local.home/
and my solution is to "sprinkle local_bh_disable() around",
which is clearly not optimal. Hence the RFC on this patch.

1) There are several more places using the same spinlock, for example
the gether_suspend() and gether_resume() functions. I'm not using power
management, but I wonder if there might be more deadlocks lurking?

2) By keeping softirq disabled for a longer duration, this patch could
potentially increase the RT latency. I tried to quantify this using
cyclictest, but I cannot get a baseline for comparison, since it
deadlocks almost immediately when the USB is active. Other suggestions?

3) The fix is in generic u_ether.c code, to match the scope of the
spinlock. Alternatively, it could be done in cdns3 specific code,
such as in cdns3_device_thread_irq_handler(). I'm not sure if the
same problem exists in other USB drivers?
---
 drivers/usb/gadget/function/u_ether.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/usb/gadget/function/u_ether.c b/drivers/usb/gadget/function/u_ether.c
index 09e2838917e2..dc511c01b741 100644
--- a/drivers/usb/gadget/function/u_ether.c
+++ b/drivers/usb/gadget/function/u_ether.c
@@ -233,6 +233,7 @@ static void rx_complete(struct usb_ep *ep, struct usb_request *req)
 		if (dev->unwrap) {
 			unsigned long	flags;
 
+			local_bh_disable();
 			spin_lock_irqsave(&dev->lock, flags);
 			if (dev->port_usb) {
 				status = dev->unwrap(dev->port_usb,
@@ -243,6 +244,7 @@ static void rx_complete(struct usb_ep *ep, struct usb_request *req)
 				status = -ENOTCONN;
 			}
 			spin_unlock_irqrestore(&dev->lock, flags);
+			local_bh_enable();
 		} else {
 			skb_queue_tail(&dev->rx_frames, skb);
 		}
-- 
2.45.2.121.gc2b3f2b3cd

Re: [RFC PATCH] usb: gadget: u_ether: prevent deadlock under RT

Posted by Sebastian Andrzej Siewior 11 months, 2 weeks ago

On 2025-02-19 13:15:52 [-0500], Ralph Siemsen wrote:
> [   71.106003] task:irq/507-s-f4000 state:D stack:0     pid:125   tgid:125   ppid:2      flags:0x00000008
> [   71.106018] Call trace:
> [   71.106022]  __switch_to+0xf4/0x158
> [   71.106046]  __schedule+0x2b4/0x920
> [   71.106055]  schedule_rtlock+0x24/0x50
> [   71.106064]  rtlock_slowlock_locked+0x348/0xcb8
> [   71.106077]  rt_spin_lock+0x88/0xb8
> [   71.106086]  eth_start_xmit+0x30/0x1490 [u_ether]        /*****/
> [   71.106112]  ncm_tx_timeout+0x2c/0x50 [usb_f_ncm]
Explicit softirq timer as per HRTIMER_MODE_REL_SOFT

> [   71.106131]  __hrtimer_run_queues+0x180/0x378
> [   71.106143]  hrtimer_run_softirq+0x90/0x100
> [   71.106151]  handle_softirqs.isra.0+0x14c/0x360
> [   71.106165]  __local_bh_enable_ip+0x104/0x118
> [   71.106177]  __netdev_alloc_skb+0x1e0/0x210
> [   71.106192]  ncm_unwrap_ntb+0x1ec/0x528 [usb_f_ncm]
> [   71.106206]  rx_complete+0x120/0x288 [u_ether]           /*****/
Network stack

> [   71.106221]  usb_gadget_giveback_request+0x34/0xf8
> [   71.106236]  cdns3_gadget_giveback+0xe4/0x2d0 [cdns3]
Returns URB to usb-core.

> [   71.106286]  cdns3_transfer_completed+0x3b0/0x630 [cdns3]
> [   71.106320]  cdns3_device_thread_irq_handler+0x8b8/0xd18 [cdns3]

Threaded interrupt handler. Not forced-threaded but voluntary threaded
due to devm_request_threaded_irq() usage.

> [   71.106353]  irq_thread_fn+0x34/0xb8
> [   71.106364]  irq_thread+0x180/0x2f0
> [   71.106374]  kthread+0x104/0x118
> [   71.106384]  ret_from_fork+0x10/0x20
> 
> The deadlock occurs because eth_start_xmit() and rx_complete() both
> acquire the same spinlock in the same instance of struct eth_dev.
> The nested call occurs because rx_complete() calls __netdev_alloc_skb()
> which performs a brief local_bh_disable/enable() sequence.
…

Based on the backtrace the problem is within the cdns3. The driver
acquires at the beginning of its threaded routine
	spin_lock_irqsave(&priv_dev->lock, flags);

and then before returning the URB it does
	         spin_unlock(&priv_dev->lock);
                 usb_gadget_giveback_request()

so the lock is dropped but the interrupts are still disabled. This makes
me wonder why using threaded interrupts at all since interrupts are
disabled for the whole routine but then others do the same.

If you look at dwc3_thread_interrupt() they have local_bh_disable()/
enable() before acquiring the lock and this is what I would suggest
doing here, too. The NCM is probably not the only one affected but
everything doing network that may since it may recourse into softirq.

Sebastian

Re: [RFC PATCH] usb: gadget: u_ether: prevent deadlock under RT

Posted by Ralph Siemsen 11 months, 1 week ago

Hi Sebastian,

Thanks for your reply!

On Wed, Feb 26, 2025 at 3:29 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> Based on the backtrace the problem is within the cdns3. The driver
> acquires at the beginning of its threaded routine
>         spin_lock_irqsave(&priv_dev->lock, flags);
>
> and then before returning the URB it does
>                  spin_unlock(&priv_dev->lock);
>                  usb_gadget_giveback_request()
>
> so the lock is dropped but the interrupts are still disabled. This makes
> me wonder why using threaded interrupts at all since interrupts are
> disabled for the whole routine but then others do the same.

I also wondered why threaded interrupts are being used, but I don't
know the reason.

> If you look at dwc3_thread_interrupt() they have local_bh_disable()/
> enable() before acquiring the lock and this is what I would suggest
> doing here, too. The NCM is probably not the only one affected but
> everything doing network that may since it may recourse into softirq.

Thanks for the suggestion. I had also thought of doing this. I also
noticed that the "cdnsp" driver has a similar fix in commit
58f2fcb3a845 ("usb: cdnsp: Fix deadlock issue during using NCM
gadget"), which was also backported to stable branches. So I will
prepare a v2 patch to do the same for "cdns3" driver.

Regards
Ralph