drivers/usb/gadget/function/u_ether.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
netif_rx() now disables bottom halves, which causes the USB gadget to be
unable to receive frames if the interface is not brought up quickly enough
after being created by the driver (a bug confirmed on AM3352 SoC).
Replacing netif_rx() with __netif_rx() restores the old behavior and fixes
the bug. This can be done since rx_callback() is called from the interrupt
context.
Fixes: baebdf48c360 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
Cc: stable@vger.kernel.org
Signed-off-by: Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
---
v2 -> v3: Added cc stable tag
v1 -> v2: Added Fixes tag and corrected Signed-off-by tag
drivers/usb/gadget/function/u_ether.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/gadget/function/u_ether.c b/drivers/usb/gadget/function/u_ether.c
index 4bb0553da658..fd7e483b4a48 100644
--- a/drivers/usb/gadget/function/u_ether.c
+++ b/drivers/usb/gadget/function/u_ether.c
@@ -266,7 +266,7 @@ static void rx_complete(struct usb_ep *ep, struct usb_request *req)
/* no buffer copies needed, unless hardware can't
* use skb buffers.
*/
- status = netif_rx(skb2);
+ status = __netif_rx(skb2);
next_frame:
skb2 = skb_dequeue(&dev->rx_frames);
}
--
2.30.2
On 2024-09-17 15:56:33 [+0200], Hubert Wiśniewski wrote: > netif_rx() now disables bottom halves, which causes the USB gadget to be > unable to receive frames if the interface is not brought up quickly enough > after being created by the driver (a bug confirmed on AM3352 SoC). > > Replacing netif_rx() with __netif_rx() restores the old behavior and fixes > the bug. This can be done since rx_callback() is called from the interrupt > context. > > Fixes: baebdf48c360 ("net: dev: Makes sure netif_rx() can be invoked in any context.") > Cc: stable@vger.kernel.org > Signed-off-by: Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com> Now that I see v3, my v2 question stands. This supposed only to disable BH if invoked from from non-interrupt. hardirq and softirq should be good. A backtrace would be nice and further explanation how this becomes a problem. Also lockdep should complain at some point. Sebastian
On Tue, 2024-09-17 at 16:36 +0200, Sebastian Andrzej Siewior wrote: > On 2024-09-17 15:56:33 [+0200], Hubert Wiśniewski wrote: > > netif_rx() now disables bottom halves, which causes the USB gadget to be > > unable to receive frames if the interface is not brought up quickly enough > > after being created by the driver (a bug confirmed on AM3352 SoC). > > > > Replacing netif_rx() with __netif_rx() restores the old behavior and fixes > > the bug. This can be done since rx_callback() is called from the interrupt > > context. > > > > Fixes: baebdf48c360 ("net: dev: Makes sure netif_rx() can be invoked in any context.") > > Cc: stable@vger.kernel.org > > Signed-off-by: Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com> > > Now that I see v3, my v2 question stands. > This supposed only to disable BH if invoked from from non-interrupt. > hardirq and softirq should be good. A backtrace would be nice and further > explanation how this becomes a problem. Also lockdep should complain at > some point. > > Sebastian This lockup seems to occur only on AM335x SoCs. If the interface is brought up immediately after it is created: # modprobe g_ether; ip link set usb0 up it works fine. But if there is some delay: # modprobe g_ether; sleep 5; ip link set usb0 up the interface is unable to receive any frames from the USB host (but the outgoing frames are sent correctly). This becomes a problem when the g_ether module is loaded early in the boot process or it is built in, and the interface is configured later (e.g. by ifupdown). The same thing happens when using configfs to create and configure an Ethernet gadget. I have built the kernel with CONFIG_PROVE_LOCKING=y and got the following messages upon bringing the interface up: Backtrace (timestamps stripped): WARNING: CPU: 0 PID: 86 at kernel/softirq.c:362 __local_bh_enable_ip+0x118/0x198 CPU: 0 UID: 0 PID: 86 Comm: ip Not tainted 6.11.0 #3 Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x88 dump_stack_lvl from __warn+0x70/0x1ac __warn from warn_slowpath_fmt+0x12c/0x1c4 warn_slowpath_fmt from __local_bh_enable_ip+0x118/0x198 __local_bh_enable_ip from netif_rx+0x118/0x208 netif_rx from rx_complete+0x140/0x27c rx_complete from musb_g_giveback+0xf0/0x1d8 musb_g_giveback from musb_ep_restart_resume_work+0x8/0x10 musb_ep_restart_resume_work from musb_gadget_queue+0x1c8/0x498 musb_gadget_queue from usb_ep_queue+0x38/0x134 usb_ep_queue from rx_submit+0xdc/0x1cc rx_submit from rx_fill+0x7c/0xa0 rx_fill from eth_start+0x30/0x54 eth_start from eth_open+0x40/0x80 eth_open from __dev_open+0xe0/0x174 __dev_open from __dev_change_flags+0x160/0x1d0 __dev_change_flags from dev_change_flags+0x1c/0x58 dev_change_flags from devinet_ioctl+0x6d0/0x884 devinet_ioctl from inet_ioctl+0x1ac/0x2bc inet_ioctl from sock_ioctl+0x2c0/0x3b8 sock_ioctl from sys_ioctl+0x194/0xf88 sys_ioctl from ret_fast_syscall+0x0/0x1c Exception stack(0xe1485fa8 to 0xe1485ff0) 5fa0: 000b3630 00000001 00000003 00008914 be802cd8 be802c90 5fc0: 000b3630 00000001 00000003 00000036 ffffffff ffffffff ffffffff be802ed4 5fe0: 000c74cc be802c80 00028000 b6e7d14c irq event stamp: 2581 hardirqs last enabled at (2579): [<c13e10cc>] _raw_spin_unlock_irqrestore+0x44/0x48 hardirqs last disabled at (2580): [<c13e0e94>] _raw_spin_lock_irqsave+0x64/0x68 softirqs last enabled at (2558): [<c0e727b0>] __dev_change_flags+0x7c/0x1d0 softirqs last disabled at (2581): [<c0e68dd8>] netif_rx+0xa4/0x208 Lockdep: ================================ WARNING: inconsistent lock state 6.11.0 #3 Tainted: G W -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. ip/86 [HC0[0]:SC0[0]:HE1:SE1] takes: c3e54050 (&musb->lock){?.-.}-{2:2}, at: musb_g_giveback+0xf8/0x1d8 {IN-HARDIRQ-W} state was registered at: lock_acquire+0x108/0x358 _raw_spin_lock_irqsave+0x4c/0x68 dsps_interrupt+0x28/0x278 __handle_irq_event_percpu+0xa0/0x2f4 handle_irq_event_percpu+0xc/0x40 handle_irq_event+0x38/0xcc handle_level_irq+0xb4/0x13c handle_irq_desc+0x1c/0x2c generic_handle_arch_irq+0x2c/0x64 call_with_stack+0x18/0x20 __irq_svc+0x9c/0xbc console_flush_all+0x25c/0x5c0 console_unlock+0x80/0x114 vprintk_emit+0x248/0x354 dev_vprintk_emit+0x110/0x144 dev_printk_emit+0x28/0x50 __dev_printk+0x74/0x90 _dev_warn+0x3c/0x68 _regulator_get+0x1f8/0x31c sdhci_omap_regulator_get_caps+0x8/0x84 sdhci_omap_probe+0x208/0x788 platform_probe+0x58/0xb8 really_probe+0xc4/0x28c __driver_probe_device+0x84/0x194 driver_probe_device+0x30/0xc8 __device_attach_driver+0xa4/0xe0 bus_for_each_drv+0x80/0xd0 __device_attach_async_helper+0xa8/0xdc async_run_entry_fn+0x20/0xb4 process_scheduled_works+0x254/0x6e8 worker_thread+0x13c/0x340 kthread+0xf4/0x114 ret_from_fork+0x14/0x24 irq event stamp: 2649 hardirqs last enabled at (2649): [<c13e10cc>] _raw_spin_unlock_irqrestore+0x44/0x48 hardirqs last disabled at (2648): [<c13e0e94>] _raw_spin_lock_irqsave+0x64/0x68 softirqs last enabled at (2644): [<c014b410>] handle_softirqs+0x260/0x45c softirqs last disabled at (2613): [<c1385f28>] call_with_stack+0x18/0x20 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&musb->lock); <Interrupt> lock(&musb->lock); *** DEADLOCK *** 1 lock held by ip/86: #0: c1d60ca4 (rtnl_mutex){+.+.}-{3:3}, at: devinet_ioctl+0xc4/0x884 stack backtrace: CPU: 0 UID: 0 PID: 86 Comm: ip Tainted: G W 6.11.0 #3 Tainted: [W]=WARN Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x88 dump_stack_lvl from mark_lock.part.17+0x3cc/0x440 mark_lock.part.17 from __lock_acquire+0x44c/0x22b0 __lock_acquire from lock_acquire+0x108/0x358 lock_acquire from _raw_spin_lock+0x38/0x48 _raw_spin_lock from musb_g_giveback+0xf8/0x1d8 musb_g_giveback from musb_ep_restart_resume_work+0x8/0x10 musb_ep_restart_resume_work from musb_gadget_queue+0x1c8/0x498 musb_gadget_queue from usb_ep_queue+0x38/0x134 usb_ep_queue from rx_submit+0xdc/0x1cc rx_submit from rx_fill+0x7c/0xa0 rx_fill from eth_start+0x30/0x54 eth_start from eth_open+0x40/0x80 eth_open from __dev_open+0xe0/0x174 __dev_open from __dev_change_flags+0x160/0x1d0 __dev_change_flags from dev_change_flags+0x1c/0x58 dev_change_flags from devinet_ioctl+0x6d0/0x884 devinet_ioctl from inet_ioctl+0x1ac/0x2bc inet_ioctl from sock_ioctl+0x2c0/0x3b8 sock_ioctl from sys_ioctl+0x194/0xf88 sys_ioctl from ret_fast_syscall+0x0/0x1c Exception stack(0xe1485fa8 to 0xe1485ff0) 5fa0: 000b3630 00000001 00000003 00008914 be802cd8 be802c90 5fc0: 000b3630 00000001 00000003 00000036 ffffffff ffffffff ffffffff be802ed4 5fe0: 000c74cc be802c80 00028000 b6e7d14c Surprisingly, with the patch I suggested, there is still a backtrace generated, but without lockdep warning: WARNING: CPU: 0 PID: 87 at net/core/dev.c:5207 __netif_rx+0xd4/0x210 CPU: 0 UID: 0 PID: 87 Comm: ip Not tainted 6.11.0-dirty #4 Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x88 dump_stack_lvl from __warn+0x70/0x1ac __warn from warn_slowpath_fmt+0x12c/0x1c4 warn_slowpath_fmt from __netif_rx+0xd4/0x210 __netif_rx from rx_complete+0x140/0x27c rx_complete from musb_g_giveback+0xf0/0x1d8 musb_g_giveback from musb_ep_restart_resume_work+0x8/0x10 musb_ep_restart_resume_work from musb_gadget_queue+0x1c8/0x498 musb_gadget_queue from usb_ep_queue+0x38/0x134 usb_ep_queue from rx_submit+0xdc/0x1cc rx_submit from rx_fill+0x7c/0xa0 rx_fill from eth_start+0x30/0x54 eth_start from eth_open+0x40/0x80 eth_open from __dev_open+0xe0/0x174 __dev_open from __dev_change_flags+0x160/0x1d0 __dev_change_flags from dev_change_flags+0x1c/0x58 dev_change_flags from devinet_ioctl+0x6d0/0x884 devinet_ioctl from inet_ioctl+0x1ac/0x2bc inet_ioctl from sock_ioctl+0x2c0/0x3b8 sock_ioctl from sys_ioctl+0x194/0xf88 sys_ioctl from ret_fast_syscall+0x0/0x1c Exception stack(0xe148dfa8 to 0xe148dff0) dfa0: 000b3630 00000001 00000003 00008914 be8c8cd8 be8c8c90 dfc0: 000b3630 00000001 00000003 00000036 ffffffff ffffffff ffffffff be8c8ed4 dfe0: 000c74cc be8c8c80 00028000 b6ed614c irq event stamp: 2542 hardirqs last enabled at (2541): [<c13e10cc>] _raw_spin_unlock_irqrestore+0x44/0x48 hardirqs last disabled at (2542): [<c13e0e94>] _raw_spin_lock_irqsave+0x64/0x68 softirqs last enabled at (2520): [<c0e727b0>] __dev_change_flags+0x7c/0x1d0 softirqs last disabled at (2518): [<c0e7236c>] dev_set_rx_mode+0x0/0x40 I'm a bit at loss here. The deadlock seems to be unrelated to netif_rx() (which is not being called in the interrupt context after all), yet replacing it with __netif_rx() fixes the lockup (though a warning is still generated, which suggests that the patch does not completely fix the issue). -- Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
On Thu, 2024-09-26 at 21:39 +0200, Hubert Wiśniewski wrote: > I'm a bit at loss here. The deadlock seems to be unrelated to netif_rx() > (which is not being called in the interrupt context after all), yet > replacing it with __netif_rx() fixes the lockup (though a warning is still > generated, which suggests that the patch does not completely fix the > issue). Well, never mind. After some investigation, I think the problem is as follows: 1. musb_g_giveback() releases the musb lock using spin_unlock(). The lock is now released, but hardirqs are still disabled. 2. Then, usb_gadget_giveback_request() is called, which in turn calls rx_complete(). This does not happen in the interrupt context, so netif_rx() disables bottom havles, then enables them using local_bh_enable(). 3. This leads to calling __local_bh_enable_ip(), which gives off a warning (the first backtrace) that hardirqs are disabled. Then, hardirqs are disabled (again?), and then enabled (as they should have been in the first place). 4. After usb_gadget_giveback_request() returns, musb_g_giveback() acquires the musb lock using spin_lock(). This does not disable hardirqs, so they are still enabled. 5. While the musb lock is acquired, an interrupt occurs. It is handled by dsps_interrupt(), which acquires the musb lock. A deadlock occurs. Replacing netif_rx() with __netif_rx() apparently fixes this part, as it does not lead to any change of hardirq state. There is still one problem though: rx_complete() is usually called from the interrupt context, except when the network interface is brought up. I think one solution would be to make musb_g_giveback() use spin_unlock_irqrestore() and spin_lock_irqsave(), but I would need to pass the flags to it somehow. Also, I am not sure how that would influence other drivers using musb. -- Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
On 2024-09-27 15:33:35 [+0200], Hubert Wiśniewski wrote: > On Thu, 2024-09-26 at 21:39 +0200, Hubert Wiśniewski wrote: > > I'm a bit at loss here. The deadlock seems to be unrelated to netif_rx() > > (which is not being called in the interrupt context after all), yet > > replacing it with __netif_rx() fixes the lockup (though a warning is still > > generated, which suggests that the patch does not completely fix the > > issue). > > Well, never mind. After some investigation, I think the problem is as > follows: > > 1. musb_g_giveback() releases the musb lock using spin_unlock(). The lock > is now released, but hardirqs are still disabled. > > 2. Then, usb_gadget_giveback_request() is called, which in turn calls > rx_complete(). This does not happen in the interrupt context, so netif_rx() > disables bottom havles, then enables them using local_bh_enable(). > > 3. This leads to calling __local_bh_enable_ip(), which gives off a warning > (the first backtrace) that hardirqs are disabled. Then, hardirqs are > disabled (again?), and then enabled (as they should have been in the first > place). > > 4. After usb_gadget_giveback_request() returns, musb_g_giveback() acquires > the musb lock using spin_lock(). This does not disable hardirqs, so they > are still enabled. > > 5. While the musb lock is acquired, an interrupt occurs. It is handled by > dsps_interrupt(), which acquires the musb lock. A deadlock occurs. This all makes sense so far. > Replacing netif_rx() with __netif_rx() apparently fixes this part, as it > does not lead to any change of hardirq state. There is still one problem > though: rx_complete() is usually called from the interrupt context, except > when the network interface is brought up. __netif_rx() has an assert which should complain if you use __netif_rx(). Further in this case you pass the skb to backlog but never kick it for processing. Which means it is delayed until a random interrupt notices and processes it. > I think one solution would be to make musb_g_giveback() use > spin_unlock_irqrestore() and spin_lock_irqsave(), but I would need to pass > the flags to it somehow. Also, I am not sure how that would influence other > drivers using musb. I would also suggest to do this since the other solution is not safe/ correct. There is the ->busy assignment which should cover for the most cases. If you drop the lock without enabling interrupts then the interrupt can't do anything to the EP and other enqueue/ dequeue invocation is not possible if run on UP. On the other hand am335x was used on PREEMPT_RT and it runs a UP machine into SMP so that should be covered :) While looking at it, dequeue/ enqueue during complete callback looks safe due to the busy flag. Sebastian
On Fri, 2024-09-27 at 16:12 +0200, Sebastian Andrzej Siewior wrote: > On 2024-09-27 15:33:35 [+0200], Hubert Wiśniewski wrote: > > On Thu, 2024-09-26 at 21:39 +0200, Hubert Wiśniewski wrote: > > > I'm a bit at loss here. The deadlock seems to be unrelated to netif_rx() > > > (which is not being called in the interrupt context after all), yet > > > replacing it with __netif_rx() fixes the lockup (though a warning is still > > > generated, which suggests that the patch does not completely fix the > > > issue). > > > > Well, never mind. After some investigation, I think the problem is as > > follows: > > > > 1. musb_g_giveback() releases the musb lock using spin_unlock(). The lock > > is now released, but hardirqs are still disabled. > > > > 2. Then, usb_gadget_giveback_request() is called, which in turn calls > > rx_complete(). This does not happen in the interrupt context, so netif_rx() > > disables bottom havles, then enables them using local_bh_enable(). > > > > 3. This leads to calling __local_bh_enable_ip(), which gives off a warning > > (the first backtrace) that hardirqs are disabled. Then, hardirqs are > > disabled (again?), and then enabled (as they should have been in the first > > place). > > > > 4. After usb_gadget_giveback_request() returns, musb_g_giveback() acquires > > the musb lock using spin_lock(). This does not disable hardirqs, so they > > are still enabled. > > > > 5. While the musb lock is acquired, an interrupt occurs. It is handled by > > dsps_interrupt(), which acquires the musb lock. A deadlock occurs. > > This all makes sense so far. I have done more testing on this. It seems that this deadlock possibility reported by lockdep is not the cause, but just a symptom. For now, my conclusion is that the problem lies in the MUSB gadget driver itself. Interrupts (in peripheral mode) on Rx endpoints are handled by musb_g_rx(), which pulls requests from EP request queue. If there is no request queued, it just returns without clearing the RXPKTRDY flag in the RXCSR register (but the interrupt flag in the glue layer register has been already cleared by the glue layer IRQ handler). This makes the received packet wait for the next interrupt. If the Rx FIFO is full, no more packets are received and no more interrupts are generated. The EP stays locked up forever (or until the RXPKTRDY flag is cleared manually :)). From what I have learned, the request queue being empty just happens sometimes and it is not en error. This bug became exposed by the new behaviour of netif_rx(). When BHs are enabled, hardirqs are enabled too (for a moment) which causes the Rx interrupt to be handled before a request is enqueued. If there are enough such unhandled packets, the EP gets locked up. > > Replacing netif_rx() with __netif_rx() apparently fixes this part, as it > > does not lead to any change of hardirq state. There is still one problem > > though: rx_complete() is usually called from the interrupt context, except > > when the network interface is brought up. > > __netif_rx() has an assert which should complain if you use > __netif_rx(). Further in this case you pass the skb to backlog but never > kick it for processing. Which means it is delayed until a random > interrupt notices and processes it. Now I see that it was a bad idea. I just found this using git bisect. > > I think one solution would be to make musb_g_giveback() use > > spin_unlock_irqrestore() and spin_lock_irqsave(), but I would need to pass > > the flags to it somehow. Also, I am not sure how that would influence other > > drivers using musb. > > I would also suggest to do this since the other solution is not safe/ > correct. There is the ->busy assignment which should cover for the most > cases. If you drop the lock without enabling interrupts then the > interrupt can't do anything to the EP and other enqueue/ dequeue > invocation is not possible if run on UP. On the other hand am335x was > used on PREEMPT_RT and it runs a UP machine into SMP so that should be > covered :) > > While looking at it, dequeue/ enqueue during complete callback looks > safe due to the busy flag. I think it is not needed now. After I have modified the interrupt handling code to clear the RXPKTRDY flag if there is no request queued and the FIFO is full, neither __local_bh_enable_ip() nor lockdep complain (tested on SMP and UP, with and without PREEMPT, on AM3358 and A64). It would probably by nicer to ensure that no MUSB interrupts are handled when a MUSB request callback is invoked from musb_g_giveback() (e.g. by disabling MUSB interrupts before releasing the lock and enabling them after acquiring it), but that could cause some side effects if the callback relied on MUSB interrupts being enabled. And since there are no warnings and everything works... I guess it is time to submit another patch then and to forget about this one. Thank you for your time! -- Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
© 2016 - 2024 Red Hat, Inc.