xHCI: endpoint state maintainability and small fixes

[PATCH 0/6] xHCI: endpoint state maintainability and small fixes

Posted by Michal Pecio 11 months ago

These patches are mostly independent, except
- 2/6 depends on 1/6
- 6/6 depends on 4/5 and 5/6

It is assumed that issues with EP_STALLED are resolved like below.

They document assumptions currently made by xhci_urb_dequeue() and
xhci_handle_cmd_stop_ep() and clean up this code a little to make it
more maintainable.

Some potential issues with no known significant impact are fixed.
I haven't tagged them for stable. Maybe 5/6 could go, just in case?

Michal

--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1770,7 +1770,7 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 	}
 
 	/* In this case no commands are pending but the endpoint is stopped */
-	if (ep->ep_state & EP_CLEARING_TT) {
+	if (ep->ep_state & (EP_CLEARING_TT | EP_STALLED)) {
 		/* and cancelled TDs can be given back right away */
 		xhci_dbg(xhci, "Invalidating TDs instantly on slot %d ep %d in state 0x%x\n",
 				urb->dev->slot_id, ep_index, ep->ep_state);
@@ -3207,10 +3207,14 @@ static void xhci_endpoint_reset(struct usb_hcd *hcd,
 		return;
 
 	ep = &vdev->eps[ep_index];
+
+	spin_lock_irqsave(&xhci->lock, flags);
+
+	/* Unblock the endpoint as device side is unstalled now */
 	ep->ep_state &= ~EP_STALLED;
+	xhci_ring_doorbell_for_active_rings(xhci, udev->slot_id, ep_index);
 
 	/* Bail out if toggle is already being cleared by a endpoint reset */
-	spin_lock_irqsave(&xhci->lock, flags);
 	if (ep->ep_state & EP_HARD_CLEAR_TOGGLE) {
 		ep->ep_state &= ~EP_HARD_CLEAR_TOGGLE;
 		spin_unlock_irqrestore(&xhci->lock, flags);

Michal Pecio (6):
  usb: xhci: Document endpoint state management
  usb: xhci: Deduplicate some endpoint state flag lists
  usb: xhci: Only set EP_HARD_CLEAR_TOGGLE after queuing Reset Endpoint
  usb: xhci: Don't change the status of stalled TDs on failed Stop EP
  usb: xhci: Avoid Stop Endpoint retry loop if the endpoint seems
    Running
  usb: xhci: Update comments about Stop Endpoint races

 drivers/usb/host/xhci-ring.c | 102 ++++++++++++++++++++---------------
 drivers/usb/host/xhci.c      |  16 ++++--
 drivers/usb/host/xhci.h      |  58 +++++++++++++++-----
 3 files changed, 115 insertions(+), 61 deletions(-)

-- 
2.48.1

Re: [PATCH 0/6] xHCI: endpoint state maintainability and small fixes

Posted by Mathias Nyman 11 months ago

On 10.3.2025 10.36, Michal Pecio wrote:
> These patches are mostly independent, except
> - 2/6 depends on 1/6
> - 6/6 depends on 4/5 and 5/6
> 
> It is assumed that issues with EP_STALLED are resolved like below.
> 
> They document assumptions currently made by xhci_urb_dequeue() and
> xhci_handle_cmd_stop_ep() and clean up this code a little to make it
> more maintainable.
> 
> Some potential issues with no known significant impact are fixed.
> I haven't tagged them for stable. Maybe 5/6 could go, just in case?
> 
> Michal

I'll send a small fixup series to Greg for usb-next (6.15).

We are getting late in the cycle so I'll try to keep is as short as
possible. I'll cherry-pick 4/6 and 5/6 from this series to it.
  
Thanks
Mathias

[PATCH 1/6] usb: xhci: Document endpoint state management

Posted by Michal Pecio 11 months ago

Add systematic comments describing xhci_virt_ep.ep_state flags and
their relation to xHC endpoint state.

Add a few paragraphs about how they are used to track and manage the
state of endpoints.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci.h | 44 +++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 56ed1b817f91..46bbdc97cc4b 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -662,20 +662,18 @@ struct xhci_virt_ep {
 	 */
 	struct xhci_ring		*new_ring;
 	unsigned int			err_count;
+	/* Endpoint state, state transitions, pending operations. See also notes below. */
 	unsigned int			ep_state;
-#define SET_DEQ_PENDING		(1 << 0)
-#define EP_HALTED		(1 << 1)	/* Halted host ep handling */
-#define EP_STOP_CMD_PENDING	(1 << 2)	/* For URB cancellation */
-/* Transitioning the endpoint to using streams, don't enqueue URBs */
-#define EP_GETTING_STREAMS	(1 << 3)
-#define EP_HAS_STREAMS		(1 << 4)
-/* Transitioning the endpoint to not using streams, don't enqueue URBs */
-#define EP_GETTING_NO_STREAMS	(1 << 5)
-#define EP_HARD_CLEAR_TOGGLE	(1 << 6)
-#define EP_SOFT_CLEAR_TOGGLE	(1 << 7)
-/* usb_hub_clear_tt_buffer is in progress */
-#define EP_CLEARING_TT		(1 << 8)
-#define EP_STALLED		(1 << 9)	/* For stall handling */
+#define SET_DEQ_PENDING		BIT(0)	/* EP Stopped, Set TR Dequeue pending */
+#define EP_HALTED		BIT(1)	/* EP Halted -> Stopped, Reset Endpoint pending */
+#define EP_STOP_CMD_PENDING	BIT(2)	/* EP Running -> Stopped, Stop Endpoint pending */
+#define EP_GETTING_STREAMS	BIT(3)	/* Transitioning to streams, no URBs allowed */
+#define EP_HAS_STREAMS		BIT(4)	/* Streams are enabled */
+#define EP_GETTING_NO_STREAMS	BIT(5)	/* Transitioning to no streams, no URBs allowed */
+#define EP_HARD_CLEAR_TOGGLE	BIT(6)	/* Toggle is being or has been cleared by reset */
+#define EP_SOFT_CLEAR_TOGGLE	BIT(7)	/* Software toggle clearing, no URBs allowed */
+#define EP_CLEARING_TT		BIT(8)	/* EP not Running, Transaction Translator reset */
+#define EP_STALLED		BIT(9)	/* EP not Running, waiting for usb_clear_halt() */
 	/* ----  Related to URB cancellation ---- */
 	struct list_head	cancelled_td_list;
 	struct xhci_hcd		*xhci;
@@ -703,6 +701,26 @@ struct xhci_virt_ep {
 	bool			use_extended_tbc;
 };
 
+/*
+ * Endpoint state notes.
+ * xHCI 4.8.3 defines three basic states of an enabled endpoint: Running, Halted, Stopped.
+ * The fourth Error state is avoided by not queuing invalid TRBs. This seems to work so far.
+ *
+ * 4.8.3 warns that Endpoint Context EP State field may be stale, and it doesn't indicate ongoing
+ * transitions anyway. Some xhci_virt_ep.ep_state flags are used to keep track of known transitions
+ * and various operations which imply or require the endpoint to be in a particular state.
+ *
+ * Notably missing is the Stopped -> Running transition started by a doorbell ring. 4.8.3 suggests
+ * that it is instantenous, but on some common HCs it may not be. There is no universal way to know
+ * when it completes. Transfer events imply completion, but don't arrive if the device NAKs/NRDYs.
+ * Stop Endpoint fails if the transition isn't complete. Polling the Endpoint Context is an option.
+ *
+ * xhci_ring_ep_doorbell() inspects the flags to decide if the endpoint can be restarted. Another
+ * user is xhci_urb_dequeue(), which must not attempt to stop a Stopped endpoint, due to HW bugs.
+ * An endpoint with pending URBs and no flags preventing restart must be Running for this to work.
+ * Call xhci_ring_doorbell_for_active_rings() or similar after clearing any such flag.
+ */
+
 enum xhci_overhead_type {
 	LS_OVERHEAD_TYPE = 0,
 	FS_OVERHEAD_TYPE,
-- 
2.48.1

Re: [PATCH 1/6] usb: xhci: Document endpoint state management

Posted by Mathias Nyman 11 months ago

On 10.3.2025 10.36, Michal Pecio wrote:
> Add systematic comments describing xhci_virt_ep.ep_state flags and
> their relation to xHC endpoint state.
> 
> Add a few paragraphs about how they are used to track and manage the
> state of endpoints.
> 
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>   drivers/usb/host/xhci.h | 44 +++++++++++++++++++++++++++++------------
>   1 file changed, 31 insertions(+), 13 deletions(-)

Nice cleanup and clarification

Thanks
Mathias

[PATCH 2/6] usb: xhci: Deduplicate some endpoint state flag lists

Posted by Michal Pecio 11 months ago

xhci_ring_endpoint_doorbell() needs a list of flags which prohibit
running the endpoint.

xhci_urb_dequeue() needs the same list, split in two parts, to know
whether the endpoint is running and how to cancel TDs.

Define the two partial lists in xhci.h and use them in both functions.

Add a comment about the AMD Stop Endpoint bug, see commit 28a2369f7d72
("usb: xhci: Issue stop EP command only when the EP state is running")

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 10 ++--------
 drivers/usb/host/xhci.c      | 16 +++++++++++-----
 drivers/usb/host/xhci.h      | 16 +++++++++++++++-
 3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 0f8acbb9cd21..8aab077d6183 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -555,14 +555,8 @@ void xhci_ring_ep_doorbell(struct xhci_hcd *xhci,
 	struct xhci_virt_ep *ep = &xhci->devs[slot_id]->eps[ep_index];
 	unsigned int ep_state = ep->ep_state;
 
-	/* Don't ring the doorbell for this endpoint if there are pending
-	 * cancellations because we don't want to interrupt processing.
-	 * We don't want to restart any stream rings if there's a set dequeue
-	 * pointer command pending because the device can choose to start any
-	 * stream once the endpoint is on the HW schedule.
-	 */
-	if (ep_state & (EP_STOP_CMD_PENDING | SET_DEQ_PENDING | EP_HALTED |
-			EP_CLEARING_TT | EP_STALLED))
+	/* Don't start yet if certain endpoint operations are ongoing */
+	if (ep_state & (EP_CANCEL_PENDING | EP_MISC_OPS_PENDING))
 		return;
 
 	trace_xhci_ring_ep_doorbell(slot_id, DB_VALUE(ep_index, stream_id));
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 7492090fad5f..c33134a3003a 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1762,16 +1762,22 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 		}
 	}
 
-	/* These completion handlers will sort out cancelled TDs for us */
-	if (ep->ep_state & (EP_STOP_CMD_PENDING | EP_HALTED | SET_DEQ_PENDING)) {
+	/*
+	 * We have a few strategies to give back the cancelled TDs. If the endpoint is running,
+	 * no other choice - it must be stopped. But if it's not, we avoid queuing Stop Endpoint
+	 * because this triggers a bug in "AMD SNPS 3.1 xHC" and because our completion handler
+	 * is complex enough already without having to worry about such things.
+	 */
+
+	/* If cancellation is already running, giveback of all cancelled TDs is guaranteed */
+	if (ep->ep_state & EP_CANCEL_PENDING) {
 		xhci_dbg(xhci, "Not queuing Stop Endpoint on slot %d ep %d in state 0x%x\n",
 				urb->dev->slot_id, ep_index, ep->ep_state);
 		goto done;
 	}
 
-	/* In this case no commands are pending but the endpoint is stopped */
-	if (ep->ep_state & (EP_CLEARING_TT | EP_STALLED)) {
-		/* and cancelled TDs can be given back right away */
+	/* Cancel immediately if no commands are pending but the endpoint is held stopped */
+	if (ep->ep_state & EP_MISC_OPS_PENDING) {
 		xhci_dbg(xhci, "Invalidating TDs instantly on slot %d ep %d in state 0x%x\n",
 				urb->dev->slot_id, ep_index, ep->ep_state);
 		xhci_process_cancelled_tds(ep);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 46bbdc97cc4b..87d87ed08b8b 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -718,9 +718,23 @@ struct xhci_virt_ep {
  * xhci_ring_ep_doorbell() inspects the flags to decide if the endpoint can be restarted. Another
  * user is xhci_urb_dequeue(), which must not attempt to stop a Stopped endpoint, due to HW bugs.
  * An endpoint with pending URBs and no flags preventing restart must be Running for this to work.
- * Call xhci_ring_doorbell_for_active_rings() or similar after clearing any such flag.
+ * Call xhci_ring_doorbell_for_active_rings() or similar after clearing flags on the lists below.
  */
 
+/*
+ * TD cancellation is in progress. New TDs can be marked as cancelled without further action and
+ * indeed no such action is possible until these commands complete. Their handlers must check for
+ * more cancelled TDs and continue until all are given back. The endpoint must not be restarted.
+ */
+#define EP_CANCEL_PENDING (SET_DEQ_PENDING | EP_HALTED | EP_STOP_CMD_PENDING)
+
+/*
+ * Some other operations are pending which preclude restarting the endpoint. If the endpoint isn't
+ * transitioning to the Stopped state, it has already reached this state and stays in it.
+ */
+#define EP_MISC_OPS_PENDING (EP_CLEARING_TT | EP_STALLED)
+
+
 enum xhci_overhead_type {
 	LS_OVERHEAD_TYPE = 0,
 	FS_OVERHEAD_TYPE,
-- 
2.48.1

Re: [PATCH 2/6] usb: xhci: Deduplicate some endpoint state flag lists

Posted by Mathias Nyman 11 months ago

On 10.3.2025 10.37, Michal Pecio wrote:
> xhci_ring_endpoint_doorbell() needs a list of flags which prohibit
> running the endpoint.
> 
> xhci_urb_dequeue() needs the same list, split in two parts, to know
> whether the endpoint is running and how to cancel TDs.
> 
> Define the two partial lists in xhci.h and use them in both functions.
> 
> Add a comment about the AMD Stop Endpoint bug, see commit 28a2369f7d72
> ("usb: xhci: Issue stop EP command only when the EP state is running")
> 
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> --->   drivers/usb/host/xhci-ring.c | 10 ++--------
>   drivers/usb/host/xhci.c      | 16 +++++++++++-----
>   drivers/usb/host/xhci.h      | 16 +++++++++++++++-
>   3 files changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 0f8acbb9cd21..8aab077d6183 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -555,14 +555,8 @@ void xhci_ring_ep_doorbell(struct xhci_hcd *xhci,
>   	struct xhci_virt_ep *ep = &xhci->devs[slot_id]->eps[ep_index];
>   	unsigned int ep_state = ep->ep_state;
>   
> -	/* Don't ring the doorbell for this endpoint if there are pending
> -	 * cancellations because we don't want to interrupt processing.
> -	 * We don't want to restart any stream rings if there's a set dequeue
> -	 * pointer command pending because the device can choose to start any
> -	 * stream once the endpoint is on the HW schedule.
> -	 */
> -	if (ep_state & (EP_STOP_CMD_PENDING | SET_DEQ_PENDING | EP_HALTED |
> -			EP_CLEARING_TT | EP_STALLED))
> +	/* Don't start yet if certain endpoint operations are ongoing */
> +	if (ep_state & (EP_CANCEL_PENDING | EP_MISC_OPS_PENDING))
>   		return;
>   >   	trace_xhci_ring_ep_doorbell(slot_id, DB_VALUE(ep_index, stream_id));
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index 7492090fad5f..c33134a3003a 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -1762,16 +1762,22 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
>   		}
>   	}
>   
> -	/* These completion handlers will sort out cancelled TDs for us */
> -	if (ep->ep_state & (EP_STOP_CMD_PENDING | EP_HALTED | SET_DEQ_PENDING)) {
> +	/*
> +	 * We have a few strategies to give back the cancelled TDs. If the endpoint is running,
> +	 * no other choice - it must be stopped. But if it's not, we avoid queuing Stop Endpoint
> +	 * because this triggers a bug in "AMD SNPS 3.1 xHC" and because our completion handler
> +	 * is complex enough already without having to worry about such things.
> +	 */
> +
> +	/* If cancellation is already running, giveback of all cancelled TDs is guaranteed */
> +	if (ep->ep_state & EP_CANCEL_PENDING) {
>   		xhci_dbg(xhci, "Not queuing Stop Endpoint on slot %d ep %d in state 0x%x\n",
>   				urb->dev->slot_id, ep_index, ep->ep_state);
>   		goto done;
>   	}
>   
> -	/* In this case no commands are pending but the endpoint is stopped */
> -	if (ep->ep_state & (EP_CLEARING_TT | EP_STALLED)) {
> -		/* and cancelled TDs can be given back right away */
> +	/* Cancel immediately if no commands are pending but the endpoint is held stopped */
> +	if (ep->ep_state & EP_MISC_OPS_PENDING) {
>   		xhci_dbg(xhci, "Invalidating TDs instantly on slot %d ep %d in state 0x%x\n",
>   				urb->dev->slot_id, ep_index, ep->ep_state);
>   		xhci_process_cancelled_tds(ep);
> diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
> index 46bbdc97cc4b..87d87ed08b8b 100644
> --- a/drivers/usb/host/xhci.h
> +++ b/drivers/usb/host/xhci.h
> @@ -718,9 +718,23 @@ struct xhci_virt_ep {
>    * xhci_ring_ep_doorbell() inspects the flags to decide if the endpoint can be restarted. Another
>    * user is xhci_urb_dequeue(), which must not attempt to stop a Stopped endpoint, due to HW bugs.
>    * An endpoint with pending URBs and no flags preventing restart must be Running for this to work.
> - * Call xhci_ring_doorbell_for_active_rings() or similar after clearing any such flag.
> + * Call xhci_ring_doorbell_for_active_rings() or similar after clearing flags on the lists below.
>    */
>   
> +/*
> + * TD cancellation is in progress. New TDs can be marked as cancelled without further action and
> + * indeed no such action is possible until these commands complete. Their handlers must check for
> + * more cancelled TDs and continue until all are given back. The endpoint must not be restarted.
> + */
> +#define EP_CANCEL_PENDING (SET_DEQ_PENDING | EP_HALTED | EP_STOP_CMD_PENDING)
> +
> +/*
> + * Some other operations are pending which preclude restarting the endpoint. If the endpoint isn't
> + * transitioning to the Stopped state, it has already reached this state and stays in it.
> + */
> +#define EP_MISC_OPS_PENDING (EP_CLEARING_TT | EP_STALLED)
> +

Not sure this helps readability

It defines even more macros to abstract away something that is not complex enough.

It also gives false impression that EP_HALTED would somehow be more part of cancelling a
TD than EP_STALLED, when both of those are about returning a TD with an error due to
transfer issues detected by host, not class driver cancelling URBs

Thanks
Mathias

Re: [PATCH 2/6] usb: xhci: Deduplicate some endpoint state flag lists

Posted by Michał Pecio 11 months ago

On Mon, 10 Mar 2025 11:51:30 +0200, Mathias Nyman wrote:
> Not sure this helps readability
> 
> It defines even more macros to abstract away something that is not
> complex enough.

It was less about readability, but keeping these lists in one place
so that they don't get out of sync and trigger the double-stop bug.

With this change, a new flag like EP_STALLED only needs to be added
in one place and it's picked up by both functions which need it.

OTOH, maybe such flags aren't being added very often...

> It also gives false impression that EP_HALTED would somehow be more
> part of cancelling a TD than EP_STALLED, when both of those are about
> returning a TD with an error due to transfer issues detected by host,
> not class driver cancelling URBs.

I think EP_HALTED is about cancellation (among other things), because
it indicates that Reset EP handler will run and finish cancellation of
the halted TD and also any other TDs unlinked by class driver.

EP_STALLED and EP_CLEARING_TT are less about the halted TD and more
about ensuring full reset of the pipe before new TDs are executed.

Michal

[PATCH 3/6] usb: xhci: Only set EP_HARD_CLEAR_TOGGLE after queuing Reset Endpoint

Posted by Michal Pecio 11 months ago

The flag tells xhci_endpoint_reset() that toggle/sequence state
is being cleared or has been cleared by Reset Endpoint.

This only works if we actually queue the Reset Endpoint command.

Impact should be minimal, because the endpoint can't start running
with wrong toggle state if it's still halted, and class driver is
unlikely to usb_clear_halt() if the halted TD isn't given back (it
should normally unlink all URBs first before calling that).

But it looks wrong and could cause problems if the code changes.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 8aab077d6183..9e4940220252 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -988,7 +988,6 @@ static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 
 	/* add td to cancelled list and let reset ep handler take care of it */
 	if (reset_type == EP_HARD_RESET) {
-		ep->ep_state |= EP_HARD_CLEAR_TOGGLE;
 		if (td && list_empty(&td->cancelled_td_list)) {
 			list_add_tail(&td->cancelled_td_list, &ep->cancelled_td_list);
 			td->cancel_status = TD_HALTED;
@@ -1006,6 +1005,8 @@ static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 		return err;
 
 	ep->ep_state |= EP_HALTED;
+	if (reset_type == EP_HARD_RESET)
+		ep->ep_state |= EP_HARD_CLEAR_TOGGLE;
 
 	xhci_ring_cmd_db(xhci);
 
-- 
2.48.1

[PATCH 4/6] usb: xhci: Don't change the status of stalled TDs on failed Stop EP

Posted by Michal Pecio 11 months ago

When the device stalls an endpoint, current TD is assigned -EPIPE
status and Reset Endpoint is queued. If a Stop Endpoint is pending
at the time, it will run before Reset Endpoint and fail due to the
stall. Its handler will change TD's status to -EPROTO before Reset
Endpoint handler runs and initiates giveback.

Check if the stall has already been handled and don't try to do it
again. Since xhci_handle_halted_endpoint() performs this check too,
not overwriting td->status is the only difference.

I haven't seen this case yet, but I have seen a related one where
the xHC has already executed Reset Endpoint, EP Context state is
now Stopped and EP_HALTED is set. If the xHC took a bit longer to
execute Reset Endpoint, said case would become this one.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 9e4940220252..28ebc0fc5bc2 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1215,7 +1215,14 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 	 */
 		switch (GET_EP_CTX_STATE(ep_ctx)) {
 		case EP_STATE_HALTED:
-			xhci_dbg(xhci, "Stop ep completion raced with stall, reset ep\n");
+			xhci_dbg(xhci, "Stop ep completion raced with stall\n");
+			/*
+			 * If the halt happened before Stop Endpoint failed, its transfer event
+			 * should have already been handled and Reset Endpoint should be pending.
+			 */
+			if (ep->ep_state & EP_HALTED)
+				goto reset_done;
+
 			if (ep->ep_state & EP_HAS_STREAMS) {
 				reset_type = EP_SOFT_RESET;
 			} else {
@@ -1226,8 +1233,11 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 			}
 			/* reset ep, reset handler cleans up cancelled tds */
 			err = xhci_handle_halted_endpoint(xhci, ep, td, reset_type);
+			xhci_dbg(xhci, "Stop ep completion resetting ep, status %d\n", err);
 			if (err)
 				break;
+reset_done:
+			/* Reset EP handler will clean up cancelled TDs */
 			ep->ep_state &= ~EP_STOP_CMD_PENDING;
 			return;
 		case EP_STATE_STOPPED:
-- 
2.48.1

Re: [PATCH 4/6] usb: xhci: Don't change the status of stalled TDs on failed Stop EP

Posted by Mathias Nyman 11 months ago

On 10.3.2025 10.40, Michal Pecio wrote:
> When the device stalls an endpoint, current TD is assigned -EPIPE
> status and Reset Endpoint is queued. If a Stop Endpoint is pending
> at the time, it will run before Reset Endpoint and fail due to the
> stall. Its handler will change TD's status to -EPROTO before Reset
> Endpoint handler runs and initiates giveback.
> 
> Check if the stall has already been handled and don't try to do it
> again. Since xhci_handle_halted_endpoint() performs this check too,
> not overwriting td->status is the only difference.
> 
> I haven't seen this case yet, but I have seen a related one where
> the xHC has already executed Reset Endpoint, EP Context state is
> now Stopped and EP_HALTED is set. If the xHC took a bit longer to
> execute Reset Endpoint, said case would become this one.
> 
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>

Makes sense, nice improvement

Thanks
Mathias

[PATCH 5/6] usb: xhci: Avoid Stop Endpoint retry loop if the endpoint seems Running

Posted by Michal Pecio 11 months ago

Nothing prevents a broken HC from claiming that an endpoint is Running
and repeatedly rejecting Stop Endpoint with Context State Error.

Avoid infinite retries and give back cancelled TDs.

No such cases known so far, but HCs have bugs.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 28ebc0fc5bc2..241cd82672a6 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1259,16 +1259,19 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 			 * Stopped state, but it will soon change to Running.
 			 *
 			 * Assume this bug on unexpected Stop Endpoint failures.
-			 * Keep retrying until the EP starts and stops again, on
-			 * chips where this is known to help. Wait for 100ms.
+			 * Keep retrying until the EP starts and stops again.
 			 */
-			if (time_is_before_jiffies(ep->stop_time + msecs_to_jiffies(100)))
-				break;
 			fallthrough;
 		case EP_STATE_RUNNING:
 			/* Race, HW handled stop ep cmd before ep was running */
 			xhci_dbg(xhci, "Stop ep completion ctx error, ctx_state %d\n",
 					GET_EP_CTX_STATE(ep_ctx));
+			/*
+			 * Don't retry forever if we guessed wrong or a defective HC never starts
+			 * the EP or says 'Running' but fails the command. We must give back TDs.
+			 */
+			if (time_is_before_jiffies(ep->stop_time + msecs_to_jiffies(100)))
+				break;
 
 			command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
 			if (!command) {
-- 
2.48.1

[PATCH 6/6] usb: xhci: Update comments about Stop Endpoint races

Posted by Michal Pecio 11 months ago

The switch statement has grown beyond its original EP_STATE_HALTED case,
so move related comments inside this case and shorten them somewhat.

Shorten EP_STATE_STOPPED comments, add some common context at the top.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 70 ++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 31 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 241cd82672a6..759e8f612b4d 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1199,20 +1199,21 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 	trace_xhci_handle_cmd_stop_ep(ep_ctx);
 
 	if (comp_code == COMP_CONTEXT_STATE_ERROR) {
-	/*
-	 * If stop endpoint command raced with a halting endpoint we need to
-	 * reset the host side endpoint first.
-	 * If the TD we halted on isn't cancelled the TD should be given back
-	 * with a proper error code, and the ring dequeue moved past the TD.
-	 * If streams case we can't find hw_deq, or the TD we halted on so do a
-	 * soft reset.
-	 *
-	 * Proper error code is unknown here, it would be -EPIPE if device side
-	 * of enadpoit halted (aka STALL), and -EPROTO if not (transaction error)
-	 * We use -EPROTO, if device is stalled it should return a stall error on
-	 * next transfer, which then will return -EPIPE, and device side stall is
-	 * noted and cleared by class driver.
-	 */
+
+		/*
+		 * This happens if the endpoint was not in the Running state. Its state now may
+		 * be other than at the time the command failed. We need to look at the Endpoint
+		 * Context and driver state to figure out what went wrong and what to do next.
+		 *
+		 * Many HCs are kind enough to hide their internal state transitions from us and
+		 * never fail Stop Endpoint after a doorbell ring. Others, including vendors like
+		 * NEC, ASMedia or Intel, may fail the command and begin running microseconds or
+		 * even milliseconds later (up to an ESIT on NEC periodic endpoints).
+		 *
+		 * We may see Running or Halted state after the command really failed on Stopped.
+		 * We may see Stopped after it failed on Halted, if somebody resets the endpoint.
+		 */
+
 		switch (GET_EP_CTX_STATE(ep_ctx)) {
 		case EP_STATE_HALTED:
 			xhci_dbg(xhci, "Stop ep completion raced with stall\n");
@@ -1222,8 +1223,23 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 			 */
 			if (ep->ep_state & EP_HALTED)
 				goto reset_done;
-
+			/*
+			 * Maybe reset failed with -ENOMEM. We are paranoid that a buggy HC may
+			 * mishandle the race and produce no transfer event before this command
+			 * completion. Or the endpoint actually was restarting, rejected Stop
+			 * Endpoint, then started and halted. The transfer event may only come
+			 * after this completion and some ASMedia HCs don't report such events.
+			 *
+			 * Try to reset the host endpoint now. Locate the halted TD, update its
+			 * status and cancel it. Reset EP completion takes care of the rest.
+			 *
+			 * Proper status is unknown here, it would be -EPIPE if the device side
+			 * endpoint stalled and -EPROTO otherwise (Transaction Error, etc).
+			 * We use -EPROTO, if device is stalled it should keep returning STALL
+			 * on future transfers which will hopefully receive normal handling.
+			 */
 			if (ep->ep_state & EP_HAS_STREAMS) {
+				/* We don't know which stream failed, attempt Soft Retry */
 				reset_type = EP_SOFT_RESET;
 			} else {
 				reset_type = EP_HARD_RESET;
@@ -1231,39 +1247,31 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 				if (td)
 					td->status = -EPROTO;
 			}
-			/* reset ep, reset handler cleans up cancelled tds */
 			err = xhci_handle_halted_endpoint(xhci, ep, td, reset_type);
 			xhci_dbg(xhci, "Stop ep completion resetting ep, status %d\n", err);
 			if (err)
+				/* persistent problem or disconnection, we must give back TDs */
 				break;
 reset_done:
 			/* Reset EP handler will clean up cancelled TDs */
 			ep->ep_state &= ~EP_STOP_CMD_PENDING;
 			return;
+
 		case EP_STATE_STOPPED:
 			/*
-			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
-			 * EP is a Context State Error, and EP stays Stopped.
-			 *
-			 * But maybe it failed on Halted, and somebody ran Reset
-			 * Endpoint later. EP state is now Stopped and EP_HALTED
-			 * still set because Reset EP handler will run after us.
+			 * Maybe the command failed in Halted state and the transfer event queued
+			 * Reset Endpoint. The endpoint is now Stopped and EP_HALTED is still set,
+			 * because Reset Endpoint completion handler will run after this one.
 			 */
 			if (ep->ep_state & EP_HALTED)
 				break;
 			/*
-			 * On some HCs EP state remains Stopped for some tens of
-			 * us to a few ms or more after a doorbell ring, and any
-			 * new Stop Endpoint fails without aborting the restart.
-			 * This handler may run quickly enough to still see this
-			 * Stopped state, but it will soon change to Running.
-			 *
-			 * Assume this bug on unexpected Stop Endpoint failures.
-			 * Keep retrying until the EP starts and stops again.
+			 * We don't queue Stop Endpoint on Stopped endpoints. Assume the pending
+			 * restart case and keep retrying until it starts and stops again.
 			 */
 			fallthrough;
+
 		case EP_STATE_RUNNING:
-			/* Race, HW handled stop ep cmd before ep was running */
 			xhci_dbg(xhci, "Stop ep completion ctx error, ctx_state %d\n",
 					GET_EP_CTX_STATE(ep_ctx));
 			/*
-- 
2.48.1

usb: xhci: Don't change the status of stalled TDs on failed Stop EP

Posted by Michal Pecio 11 months ago

When the device stalls an endpoint, current TD is assigned -EPIPE
status and Reset Endpoint is queued. If a Stop Endpoint is pending
at the time, it will run before Reset Endpoint and fail due to the
stall. Its handler will change TD's status to -EPROTO before Reset
Endpoint handler runs and initiates giveback.

Check if the stall has already been handled and don't try to do it
again. Since xhci_handle_halted_endpoint() performs this check too,
not overwriting td->status is the only difference.

I haven't seen this case yet, but I have seen a related one where
the xHC has already executed Reset Endpoint, EP Context state is
now Stopped and EP_HALTED is set. If the xHC took a bit longer to
execute Reset Endpoint, said case would become this one.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 9e4940220252..28ebc0fc5bc2 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1215,7 +1215,14 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 	 */
 		switch (GET_EP_CTX_STATE(ep_ctx)) {
 		case EP_STATE_HALTED:
-			xhci_dbg(xhci, "Stop ep completion raced with stall, reset ep\n");
+			xhci_dbg(xhci, "Stop ep completion raced with stall\n");
+			/*
+			 * If the halt happened before Stop Endpoint failed, its transfer event
+			 * should have already been handled and Reset Endpoint should be pending.
+			 */
+			if (ep->ep_state & EP_HALTED)
+				goto reset_done;
+
 			if (ep->ep_state & EP_HAS_STREAMS) {
 				reset_type = EP_SOFT_RESET;
 			} else {
@@ -1226,8 +1233,11 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 			}
 			/* reset ep, reset handler cleans up cancelled tds */
 			err = xhci_handle_halted_endpoint(xhci, ep, td, reset_type);
+			xhci_dbg(xhci, "Stop ep completion resetting ep, status %d\n", err);
 			if (err)
 				break;
+reset_done:
+			/* Reset EP handler will clean up cancelled TDs */
 			ep->ep_state &= ~EP_STOP_CMD_PENDING;
 			return;
 		case EP_STATE_STOPPED:
-- 
2.48.1

Re: usb: xhci: Don't change the status of stalled TDs on failed Stop EP

Posted by Michal Pecio 11 months ago

Sorry for this one, missing [PATCH] tag in the subject.
I resent it corectly.