[v1] usb: musb: Fix hardware lockup on first Rx endpoint request

[PATCH] usb: musb: Fix hardware lockup on first Rx endpoint request

Posted by Hubert Wiśniewski 1 year, 4 months ago

There is a possibility that a request's callback could be invoked from
usb_ep_queue() (call trace below, supplemented with missing calls):

req->complete from usb_gadget_giveback_request
	(drivers/usb/gadget/udc/core.c:999)
usb_gadget_giveback_request from musb_g_giveback
	(drivers/usb/musb/musb_gadget.c:147)
musb_g_giveback from rxstate
	(drivers/usb/musb/musb_gadget.c:784)
rxstate from musb_ep_restart
	(drivers/usb/musb/musb_gadget.c:1169)
musb_ep_restart from musb_ep_restart_resume_work
	(drivers/usb/musb/musb_gadget.c:1176)
musb_ep_restart_resume_work from musb_queue_resume_work
	(drivers/usb/musb/musb_core.c:2279)
musb_queue_resume_work from musb_gadget_queue
	(drivers/usb/musb/musb_gadget.c:1241)
musb_gadget_queue from usb_ep_queue
	(drivers/usb/gadget/udc/core.c:300)

According to the docstring of usb_ep_queue(), this should not happen:

"Note that @req's ->complete() callback must never be called from within
usb_ep_queue() as that can create deadlock situations."

In fact, a hardware lockup might occur in the following sequence:

1. The gadget is initialized using musb_gadget_enable().
2. Meanwhile, a packet arrives, and the RXPKTRDY flag is set, raising an
   interrupt.
3. If IRQs are enabled, the interrupt is handled, but musb_g_rx() finds an
   empty queue (next_request() returns NULL). The interrupt flag has
   already been cleared by the glue layer handler, but the RXPKTRDY flag
   remains set.
4. The first request is enqueued using usb_ep_queue(), leading to the call
   of req->complete(), as shown in the call trace above.
5. If the callback enables IRQs and another packet is waiting, step (3)
   repeats. The request queue is empty because usb_g_giveback() removes the
   request before invoking the callback.
6. The endpoint remains locked up, as the interrupt triggered by hardware
   setting the RXPKTRDY flag has been handled, but the flag itself remains
   set.

For this scenario to occur, it is only necessary for IRQs to be enabled at
some point during the complete callback. This happens with the USB Ethernet
gadget, whose rx_complete() callback calls netif_rx(). If called in the
task context, netif_rx() disables the bottom halves (BHs). When the BHs are
re-enabled, IRQs are also enabled to allow soft IRQs to be processed. The
gadget itself is initialized at module load (or at boot if built-in), but
the first request is enqueued when the network interface is brought up,
triggering rx_complete() in the task context via ioctl(). If a packet
arrives while the interface is down, it can prevent the interface from
receiving any further packets from the USB host.

The situation is quite complicated with many parties involved. This
particular issue can be resolved in several possible ways:

1. Ensure that callbacks never enable IRQs. This would be difficult to
   enforce, as discovering how netif_rx() interacts with interrupts was
   already quite challenging and u_ether is not the only function driver.
   Similar "bugs" could be hidden in other drivers as well.
2. Disable MUSB interrupts in musb_g_giveback() before calling the callback
   and re-enable them afterwars (by calling musb_{dis,en}able_interrupts(),
   for example). This would ensure that MUSB interrupts are not handled
   during the callback, even if IRQs are enabled. In fact, it would allow
   IRQs to be enabled when releasing the lock. However, this feels like an
   inelegant hack.
3. Modify the interrupt handler to clear the RXPKTRDY flag if the request
   queue is empty. While this approach also feels like a hack, it wastes
   CPU time by attempting to handle incoming packets when the software is
   not ready to process them.
4. Flush the Rx FIFO instead of calling rxstate() in musb_ep_restart().
   This ensures that the hardware can receive packets when there is at
   least one request in the queue. Once IRQs are enabled, the interrupt
   handler will be able to correctly process the next incoming packet
   (eventually calling rxstate()). This approach may cause one or two
   packets to be dropped (two if double buffering is enabled), but this
   seems to be a minor issue, as packet loss can occur when the software is
   not yet ready to process them. Additionally, this solution makes the
   gadget driver compliant with the rule mentioned in the docstring of
   usb_ep_queue().

There may be additional solutions, but from these four, the last one has
been chosen as it seems to be the most appropriate, as it addresses the
"bad" behavior of the driver.

Signed-off-by: Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
---
 drivers/usb/musb/musb_gadget.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/usb/musb/musb_gadget.c b/drivers/usb/musb/musb_gadget.c
index bdf13911a1e5..c6076df0d50c 100644
--- a/drivers/usb/musb/musb_gadget.c
+++ b/drivers/usb/musb/musb_gadget.c
@@ -1161,12 +1161,19 @@ void musb_free_request(struct usb_ep *ep, struct usb_request *req)
  */
 void musb_ep_restart(struct musb *musb, struct musb_request *req)
 {
+	u16 csr;
+	void __iomem *epio = req->ep->hw_ep->regs;
+
 	trace_musb_req_start(req);
 	musb_ep_select(musb->mregs, req->epnum);
-	if (req->tx)
+	if (req->tx) {
 		txstate(musb, req);
-	else
-		rxstate(musb, req);
+	} else {
+		csr = musb_readw(epio, MUSB_RXCSR);
+		csr |= MUSB_RXCSR_FLUSHFIFO | MUSB_RXCSR_P_WZC_BITS;
+		musb_writew(epio, MUSB_RXCSR, csr);
+		musb_writew(epio, MUSB_RXCSR, csr);
+	}
 }
 
 static int musb_ep_restart_resume_work(struct musb *musb, void *data)

base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
-- 
2.30.2

Re: [PATCH] usb: musb: Fix hardware lockup on first Rx endpoint request

Posted by Greg Kroah-Hartman 1 year, 3 months ago

On Sat, Oct 05, 2024 at 03:19:10PM +0200, Hubert Wiśniewski wrote:
> There is a possibility that a request's callback could be invoked from
> usb_ep_queue() (call trace below, supplemented with missing calls):
> 
> req->complete from usb_gadget_giveback_request
> 	(drivers/usb/gadget/udc/core.c:999)
> usb_gadget_giveback_request from musb_g_giveback
> 	(drivers/usb/musb/musb_gadget.c:147)
> musb_g_giveback from rxstate
> 	(drivers/usb/musb/musb_gadget.c:784)
> rxstate from musb_ep_restart
> 	(drivers/usb/musb/musb_gadget.c:1169)
> musb_ep_restart from musb_ep_restart_resume_work
> 	(drivers/usb/musb/musb_gadget.c:1176)
> musb_ep_restart_resume_work from musb_queue_resume_work
> 	(drivers/usb/musb/musb_core.c:2279)
> musb_queue_resume_work from musb_gadget_queue
> 	(drivers/usb/musb/musb_gadget.c:1241)
> musb_gadget_queue from usb_ep_queue
> 	(drivers/usb/gadget/udc/core.c:300)
> 
> According to the docstring of usb_ep_queue(), this should not happen:
> 
> "Note that @req's ->complete() callback must never be called from within
> usb_ep_queue() as that can create deadlock situations."
> 
> In fact, a hardware lockup might occur in the following sequence:
> 
> 1. The gadget is initialized using musb_gadget_enable().
> 2. Meanwhile, a packet arrives, and the RXPKTRDY flag is set, raising an
>    interrupt.
> 3. If IRQs are enabled, the interrupt is handled, but musb_g_rx() finds an
>    empty queue (next_request() returns NULL). The interrupt flag has
>    already been cleared by the glue layer handler, but the RXPKTRDY flag
>    remains set.
> 4. The first request is enqueued using usb_ep_queue(), leading to the call
>    of req->complete(), as shown in the call trace above.
> 5. If the callback enables IRQs and another packet is waiting, step (3)
>    repeats. The request queue is empty because usb_g_giveback() removes the
>    request before invoking the callback.
> 6. The endpoint remains locked up, as the interrupt triggered by hardware
>    setting the RXPKTRDY flag has been handled, but the flag itself remains
>    set.
> 
> For this scenario to occur, it is only necessary for IRQs to be enabled at
> some point during the complete callback. This happens with the USB Ethernet
> gadget, whose rx_complete() callback calls netif_rx(). If called in the
> task context, netif_rx() disables the bottom halves (BHs). When the BHs are
> re-enabled, IRQs are also enabled to allow soft IRQs to be processed. The
> gadget itself is initialized at module load (or at boot if built-in), but
> the first request is enqueued when the network interface is brought up,
> triggering rx_complete() in the task context via ioctl(). If a packet
> arrives while the interface is down, it can prevent the interface from
> receiving any further packets from the USB host.
> 
> The situation is quite complicated with many parties involved. This
> particular issue can be resolved in several possible ways:
> 
> 1. Ensure that callbacks never enable IRQs. This would be difficult to
>    enforce, as discovering how netif_rx() interacts with interrupts was
>    already quite challenging and u_ether is not the only function driver.
>    Similar "bugs" could be hidden in other drivers as well.
> 2. Disable MUSB interrupts in musb_g_giveback() before calling the callback
>    and re-enable them afterwars (by calling musb_{dis,en}able_interrupts(),
>    for example). This would ensure that MUSB interrupts are not handled
>    during the callback, even if IRQs are enabled. In fact, it would allow
>    IRQs to be enabled when releasing the lock. However, this feels like an
>    inelegant hack.
> 3. Modify the interrupt handler to clear the RXPKTRDY flag if the request
>    queue is empty. While this approach also feels like a hack, it wastes
>    CPU time by attempting to handle incoming packets when the software is
>    not ready to process them.
> 4. Flush the Rx FIFO instead of calling rxstate() in musb_ep_restart().
>    This ensures that the hardware can receive packets when there is at
>    least one request in the queue. Once IRQs are enabled, the interrupt
>    handler will be able to correctly process the next incoming packet
>    (eventually calling rxstate()). This approach may cause one or two
>    packets to be dropped (two if double buffering is enabled), but this
>    seems to be a minor issue, as packet loss can occur when the software is
>    not yet ready to process them. Additionally, this solution makes the
>    gadget driver compliant with the rule mentioned in the docstring of
>    usb_ep_queue().
> 
> There may be additional solutions, but from these four, the last one has
> been chosen as it seems to be the most appropriate, as it addresses the
> "bad" behavior of the driver.
> 
> Signed-off-by: Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
> ---
>  drivers/usb/musb/musb_gadget.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/musb/musb_gadget.c b/drivers/usb/musb/musb_gadget.c
> index bdf13911a1e5..c6076df0d50c 100644
> --- a/drivers/usb/musb/musb_gadget.c
> +++ b/drivers/usb/musb/musb_gadget.c
> @@ -1161,12 +1161,19 @@ void musb_free_request(struct usb_ep *ep, struct usb_request *req)
>   */
>  void musb_ep_restart(struct musb *musb, struct musb_request *req)
>  {
> +	u16 csr;
> +	void __iomem *epio = req->ep->hw_ep->regs;
> +
>  	trace_musb_req_start(req);
>  	musb_ep_select(musb->mregs, req->epnum);
> -	if (req->tx)
> +	if (req->tx) {
>  		txstate(musb, req);
> -	else
> -		rxstate(musb, req);
> +	} else {
> +		csr = musb_readw(epio, MUSB_RXCSR);
> +		csr |= MUSB_RXCSR_FLUSHFIFO | MUSB_RXCSR_P_WZC_BITS;
> +		musb_writew(epio, MUSB_RXCSR, csr);
> +		musb_writew(epio, MUSB_RXCSR, csr);
> +	}
>  }
>  
>  static int musb_ep_restart_resume_work(struct musb *musb, void *data)
> 
> base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
> -- 
> 2.30.2
> 
> 

What commit id does this fix?

thanks,

greg k-h

Re: [PATCH] usb: musb: Fix hardware lockup on first Rx endpoint request

Posted by Hubert Wiśniewski 1 year, 3 months ago

On Wed, 2024-10-16 at 10:27 +0200, Greg Kroah-Hartman wrote:
> What commit id does this fix?

This does not fix any commit in particular. Bisecting reveals baebdf48c360
("net: dev: Makes sure netif_rx() can be invoked in any context."), but it
did not create the problem, it just exposed it by changing the behaviour of
netif_rx(). Callback could be called from usb_ep_queue() since the very
beginning of the MUSB driver introduced by 550a7375fe72 ("USB: Add MUSB and
TUSB support").

Re: [PATCH] usb: musb: Fix hardware lockup on first Rx endpoint request

Posted by Greg Kroah-Hartman 1 year, 3 months ago

On Fri, Oct 18, 2024 at 07:51:54AM +0200, Hubert Wiśniewski wrote:
> On Wed, 2024-10-16 at 10:27 +0200, Greg Kroah-Hartman wrote:
> > What commit id does this fix?
> 
> This does not fix any commit in particular. Bisecting reveals baebdf48c360
> ("net: dev: Makes sure netif_rx() can be invoked in any context."), but it
> did not create the problem, it just exposed it by changing the behaviour of
> netif_rx(). Callback could be called from usb_ep_queue() since the very
> beginning of the MUSB driver introduced by 550a7375fe72 ("USB: Add MUSB and
> TUSB support").
> 

Great, can you resubmit this with a proper "Fixes:" tag on it so we can
backport it to the needed stable kernel(s)?

thanks,

greg k-h