[v2] usbhid: tolerate intermittent errors

[PATCH v2] usbhid: tolerate intermittent errors

Posted by Liam Mitchell 1 month ago

Modifies usbhid error handling to tolerate intermittent protocol
errors, avoiding URB resubmission delay and device reset.

---
Protocol errors like EPROTO can occur randomly, sometimes frequently and are often not fixed by a device reset.

The current error handling will only resubmit the URB after at least 13ms delay and may reset the USB device if another error occurs 1-1.5s later, regardless of error type or count.

These delays and device resets increase the chance that input events will be missed and that users see symptoms like missed or sticky keyboard keys.

This patch allows one protocol error per 500ms to be tolerated and have the URB re-submitted immediately.

500ms was chosen to be well above the error rate of a malfunctioning
device but low enough to be useful for users with devices noisier than
mine.

Signed-off-by: Liam Mitchell <mitchell.liam@gmail.com>
Link: https://lore.kernel.org/linux-input/CAOQ1CL6Q+4GNy=kgisLzs0UBXFT3b02PG8t-0rPuW-Wf6NhQ6g@mail.gmail.com/
---
Changes in v2:
- revert changes to hid_io_error
- add more specific fix in hid_irq_in
- Link to v1: https://lore.kernel.org/r/20260208-usbhid-eproto-v1-1-5872c10d90bb@gmail.com
---
 drivers/hid/usbhid/hid-core.c | 5 +++++
 drivers/hid/usbhid/usbhid.h   | 1 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/hid/usbhid/hid-core.c b/drivers/hid/usbhid/hid-core.c
index 758eb21430cd..939e095eddfe 100644
--- a/drivers/hid/usbhid/hid-core.c
+++ b/drivers/hid/usbhid/hid-core.c
@@ -313,6 +313,11 @@ static void hid_irq_in(struct urb *urb)
 	case -ETIME:		/* protocol error or unplug */
 	case -ETIMEDOUT:	/* Should never happen, but... */
 		usbhid_mark_busy(usbhid);
+		/* Tolerate intermittent protocol errors */
+		if (time_after(jiffies, usbhid->last_proto_error + msecs_to_jiffies(500))) {
+			usbhid->last_proto_error = jiffies;
+			break;
+		}
 		clear_bit(HID_IN_RUNNING, &usbhid->iofl);
 		hid_io_error(hid);
 		return;
diff --git a/drivers/hid/usbhid/usbhid.h b/drivers/hid/usbhid/usbhid.h
index 75fe85d3d27a..6aab9101fe34 100644
--- a/drivers/hid/usbhid/usbhid.h
+++ b/drivers/hid/usbhid/usbhid.h
@@ -86,6 +86,7 @@ struct usbhid_device {
 	struct timer_list io_retry;                                     /* Retry timer */
 	unsigned long stop_retry;                                       /* Time to give up, in jiffies */
 	unsigned int retry_delay;                                       /* Delay length in ms */
+	unsigned long last_proto_error;					/* Last protocol error time, in jiffies */
 	struct work_struct reset_work;                                  /* Task context for resets */
 	wait_queue_head_t wait;						/* For sleeping */
 };

---
base-commit: b91e36222ccfb1b0985d1fcc4fb13b68fb99c972
change-id: 20260208-usbhid-eproto-152c7abcb185

Best regards,
-- 
Liam Mitchell <mitchell.liam@gmail.com>

Re: [PATCH v2] usbhid: tolerate intermittent errors

Posted by Alan Stern 1 month ago

On Sat, Mar 07, 2026 at 07:57:09PM +0100, Liam Mitchell wrote:
> Modifies usbhid error handling to tolerate intermittent protocol
> errors, avoiding URB resubmission delay and device reset.
> 
> ---
> Protocol errors like EPROTO can occur randomly, sometimes frequently and are often not fixed by a device reset.
> 
> The current error handling will only resubmit the URB after at least 13ms delay and may reset the USB device if another error occurs 1-1.5s later, regardless of error type or count.
> 
> These delays and device resets increase the chance that input events will be missed and that users see symptoms like missed or sticky keyboard keys.
> 
> This patch allows one protocol error per 500ms to be tolerated and have the URB re-submitted immediately.
> 
> 500ms was chosen to be well above the error rate of a malfunctioning
> device but low enough to be useful for users with devices noisier than
> mine.
> 
> Signed-off-by: Liam Mitchell <mitchell.liam@gmail.com>
> Link: https://lore.kernel.org/linux-input/CAOQ1CL6Q+4GNy=kgisLzs0UBXFT3b02PG8t-0rPuW-Wf6NhQ6g@mail.gmail.com/
> ---

Liam, take a look at

	https://bugzilla.kernel.org/show_bug.cgi?id=221184

On Roman's system, these protocol errors occur fairly regularly, 
apparently caused by high system load.

Do you think a better approach might be to reduce the 13-ms delay to 
just 1 or 2 ms, and perform a reset only there has been no successful 
communication for about one second?  This might perhaps be _too_ lenient 
sometimes, but I don't think such situations will arise in practice.

The reason for having at least a small delay is to avoid getting into a 
tight resubmit/error loop in cases where the device has been unplugged.

Alan Stern

Re: [PATCH v2] usbhid: tolerate intermittent errors

Posted by Liam Mitchell 1 month ago

On Sat, 7 Mar 2026 at 23:53, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Sat, Mar 07, 2026 at 07:57:09PM +0100, Liam Mitchell wrote:
> > Modifies usbhid error handling to tolerate intermittent protocol
> > errors, avoiding URB resubmission delay and device reset.
> >
> > ---
> > Protocol errors like EPROTO can occur randomly, sometimes frequently and are often not fixed by a device reset.
> >
> > The current error handling will only resubmit the URB after at least 13ms delay and may reset the USB device if another error occurs 1-1.5s later, regardless of error type or count.
> >
> > These delays and device resets increase the chance that input events will be missed and that users see symptoms like missed or sticky keyboard keys.
> >
> > This patch allows one protocol error per 500ms to be tolerated and have the URB re-submitted immediately.
> >
> > 500ms was chosen to be well above the error rate of a malfunctioning
> > device but low enough to be useful for users with devices noisier than
> > mine.
> >
> > Signed-off-by: Liam Mitchell <mitchell.liam@gmail.com>
> > Link: https://lore.kernel.org/linux-input/CAOQ1CL6Q+4GNy=kgisLzs0UBXFT3b02PG8t-0rPuW-Wf6NhQ6g@mail.gmail.com/
> > ---
>
> Liam, take a look at
>
>         https://bugzilla.kernel.org/show_bug.cgi?id=221184
>
> On Roman's system, these protocol errors occur fairly regularly,
> apparently caused by high system load.

Thanks Alan, I commented there.

> Do you think a better approach might be to reduce the 13-ms delay to
> just 1 or 2 ms, and perform a reset only there has been no successful
> communication for about one second?  This might perhaps be _too_ lenient
> sometimes, but I don't think such situations will arise in practice.

I would prefer to have at least the first error resubmit immediately.
I want to reduce device downtime and missed events. From that
perspective, I want to assume the error is intermittent unless we see
evidence otherwise.

The current reset logic "reset only if there has been no successful
communication for one second" is problematic because there is no sign
of successful communication if the user isn't pressing keys or moving
the mouse. Two EPROTO errors 1.4 seconds apart will trigger device
reset and 100-200ms of downtime when ideally URBs would be immediately
resubmitted with only a few ms of downtime.

Can we infer from not receiving errors that we have successful
communication? That might change the equation. If we don't receive
errors for say 10x the polling interval, can we assume it is working?

Ideally the reset is only triggered when we are very sure the device
is not working and needs it.

> The reason for having at least a small delay is to avoid getting into a
> tight resubmit/error loop in cases where the device has been unplugged.
>
> Alan Stern

This patch would only allow one immediate resubmission per window
(500ms). How costly is a URB submission? I was assuming they are
relatively cheap and even one per 100ms wouldn't cause problems.

Liam

Re: [PATCH v2] usbhid: tolerate intermittent errors

Posted by Alan Stern 1 month ago

On Sun, Mar 08, 2026 at 02:48:42PM +0100, Liam Mitchell wrote:
> On Sat, 7 Mar 2026 at 23:53, Alan Stern <stern@rowland.harvard.edu> wrote:
> > Do you think a better approach might be to reduce the 13-ms delay to
> > just 1 or 2 ms, and perform a reset only there has been no successful
> > communication for about one second?  This might perhaps be _too_ lenient
> > sometimes, but I don't think such situations will arise in practice.
> 
> I would prefer to have at least the first error resubmit immediately.
> I want to reduce device downtime and missed events. From that
> perspective, I want to assume the error is intermittent unless we see
> evidence otherwise.

Okay; a single immediate resubmission won't cause any problems.

> The current reset logic "reset only if there has been no successful
> communication for one second" is problematic because there is no sign
> of successful communication if the user isn't pressing keys or moving
> the mouse. Two EPROTO errors 1.4 seconds apart will trigger device
> reset and 100-200ms of downtime when ideally URBs would be immediately
> resubmitted with only a few ms of downtime.
> 
> Can we infer from not receiving errors that we have successful
> communication? That might change the equation. If we don't receive
> errors for say 10x the polling interval, can we assume it is working?

Pretty much, yes.  If the communication is not working at all (for 
example, if the device was unplugged) then an interrupt URB will fail 
within three polling intervals.  10 intervals seems like a reasonable 
limit.

> Ideally the reset is only triggered when we are very sure the device
> is not working and needs it.

Agreed.  I don't know how frequently the bad states that HID devices get 
into can be fixed by a reset, but I suspect it's not very frequent at 
all.

> > The reason for having at least a small delay is to avoid getting into a
> > tight resubmit/error loop in cases where the device has been unplugged.
> >
> > Alan Stern
> 
> This patch would only allow one immediate resubmission per window
> (500ms). How costly is a URB submission? I was assuming they are
> relatively cheap and even one per 100ms wouldn't cause problems.

This problem mainly shows up in syzbot testing.  Submission isn't all 
that expensive, but in the virtual environment used by syzbot, failure 
occurs during or shortly after submission.  If resubmission is then 
immediate after failure, the whole thing becomes an unending tight loop 
executing mostly in atomic context, which ties up a CPU long enough to 
trigger a warning about a possible kernel hang.

Alan Stern