From nobody Mon Oct 6 08:25:36 2025 Received: from mail.actia.se (mail.actia.se [212.181.117.226]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD2A528A73F; Wed, 23 Jul 2025 10:25:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=212.181.117.226 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753266341; cv=none; b=uh0peKqauO0cA1jIf2jfq9sp5Jk7VpYTO9e5amfygcZB1UM7PyLbiVJlEXRImXULTPqrk7u90Dby0Qke6ahQ1+ReNMYuRvwOHPy4otDIEXGmzHRhfnTOxvaL5oG3Jk56H/phYYlgYJEUJS78+btAYZGaQyd7I6TwptsHB0ZXVXw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753266341; c=relaxed/simple; bh=7WY5QMb3nxJDoy5eYEtR2dYXbPSUspjTvNcDX4R2aQE=; h=From:To:CC:Subject:Date:Message-ID:Content-Type:MIME-Version; b=EwvicOGzvGhX+0DPhJTcNfUDuAk3F8k6Iz1k/PFjvQdsyKHZ6wa4PgFlzL8ywhfH/+/FtZNqBt4mb8kZpM71QmyIiAQlj+ZV6yOtmXwsOe6YMW8CpmmHzTf98RKPrzjUoeGCubE3VXj+sf8/pKw4oDAb5EcCJFlknFk1wr0MtAo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=actia.se; spf=pass smtp.mailfrom=actia.se; arc=none smtp.client-ip=212.181.117.226 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=actia.se Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=actia.se Received: from S036ANL.actianordic.se (10.12.31.117) by S036ANL.actianordic.se (10.12.31.117) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.57; Wed, 23 Jul 2025 12:25:35 +0200 Received: from S036ANL.actianordic.se ([fe80::e13e:1feb:4ea6:ec69]) by S036ANL.actianordic.se ([fe80::e13e:1feb:4ea6:ec69%6]) with mapi id 15.01.2507.057; Wed, 23 Jul 2025 12:25:35 +0200 From: John Ernberg To: Oliver Neukum , Andrew Lunn , "David S . Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni CC: Greg Kroah-Hartman , "netdev@vger.kernel.org" , "linux-usb@vger.kernel.org" , "linux-kernel@vger.kernel.org" , John Ernberg , "stable@vger.kernel.org" Subject: [PATCH net v2] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event Thread-Topic: [PATCH net v2] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event Thread-Index: AQHb+7wgjhsMeSA3wESW6FUdmfaw8Q== Date: Wed, 23 Jul 2025 10:25:35 +0000 Message-ID: <20250723102526.1305339-1-john.ernberg@actia.se> Accept-Language: en-US, sv-SE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-mailer: git-send-email 2.49.0 x-esetresult: clean, is OK x-esetid: 37303A2955B14450647C65 Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" The Gemalto Cinterion PLS83-W modem (cdc_ether) is emitting confusing link up and down events when the WWAN interface is activated on the modem-side. Interrupt URBs will in consecutive polls grab: * Link Connected * Link Disconnected * Link Connected Where the last Connected is then a stable link state. When the system is under load this may cause the unlink_urbs() work in __handle_link_change() to not complete before the next usbnet_link_change() call turns the carrier on again, allowing rx_submit() to queue new SKBs. In that event the URB queue is filled faster than it can drain, ending up in a RCU stall: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... = } 33108 jiffies s: 201 root: 0x1/. rcu: blocking rcu_node structures (internal RCU debug): Sending NMI from CPU 1 to CPUs 0: NMI backtrace for cpu 0 Call trace: arch_local_irq_enable+0x4/0x8 local_bh_enable+0x18/0x20 __netdev_alloc_skb+0x18c/0x1cc rx_submit+0x68/0x1f8 [usbnet] rx_alloc_submit+0x4c/0x74 [usbnet] usbnet_bh+0x1d8/0x218 [usbnet] usbnet_bh_tasklet+0x10/0x18 [usbnet] tasklet_action_common+0xa8/0x110 tasklet_action+0x2c/0x34 handle_softirqs+0x2cc/0x3a0 __do_softirq+0x10/0x18 ____do_softirq+0xc/0x14 call_on_irq_stack+0x24/0x34 do_softirq_own_stack+0x18/0x20 __irq_exit_rcu+0xa8/0xb8 irq_exit_rcu+0xc/0x30 el1_interrupt+0x34/0x48 el1h_64_irq_handler+0x14/0x1c el1h_64_irq+0x68/0x6c _raw_spin_unlock_irqrestore+0x38/0x48 xhci_urb_dequeue+0x1ac/0x45c [xhci_hcd] unlink1+0xd4/0xdc [usbcore] usb_hcd_unlink_urb+0x70/0xb0 [usbcore] usb_unlink_urb+0x24/0x44 [usbcore] unlink_urbs.constprop.0.isra.0+0x64/0xa8 [usbnet] __handle_link_change+0x34/0x70 [usbnet] usbnet_deferred_kevent+0x1c0/0x320 [usbnet] process_scheduled_works+0x2d0/0x48c worker_thread+0x150/0x1dc kthread+0xd8/0xe8 ret_from_fork+0x10/0x20 Get around the problem by delaying the carrier on to the scheduled work. This needs a new flag to keep track of the necessary action. The carrier ok check cannot be removed as it remains required for the LINK_RESET event flow. Fixes: 4b49f58fff00 ("usbnet: handle link change") Cc: stable@vger.kernel.org Signed-off-by: John Ernberg --- I've been testing this quite aggressively over a night, and seems equally stable to my first approach. I'm a little bit concerned that the bit stuff can now race (although much smaller) in the opposite direction, that a carrier off can occur between test_and_clear_bit() and the carrier on action in the handler. Leaving the carrier on when it shouldn't be. v2: - target tree in patch description. - Drop Ming Lei from address list as their address bounces. - Rework solution based on feedback by Jakub (let me know if you want a Suggested-by tag, if we're keeping this direction) v1: https://lore.kernel.org/netdev/20250710085028.1070922-1-john.ernberg@ac= tia.se/ Tested on 6.12.20 and forward ported. Stack trace from 6.12.20. --- drivers/net/usb/usbnet.c | 11 ++++++++--- include/linux/usb/usbnet.h | 1 + 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index c04e715a4c2a..bc1d8631ffe0 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -1122,6 +1122,9 @@ static void __handle_link_change(struct usbnet *dev) * tx queue is stopped by netcore after link becomes off */ } else { + if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) + netif_carrier_on(dev->net); + /* submitting URBs for reading packets */ tasklet_schedule(&dev->bh); } @@ -2009,10 +2012,12 @@ EXPORT_SYMBOL(usbnet_manage_power); void usbnet_link_change(struct usbnet *dev, bool link, bool need_reset) { /* update link after link is reseted */ - if (link && !need_reset) - netif_carrier_on(dev->net); - else + if (link && !need_reset) { + set_bit(EVENT_LINK_CARRIER_ON, &dev->flags); + } else { + clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags); netif_carrier_off(dev->net); + } =20 if (need_reset && link) usbnet_defer_kevent(dev, EVENT_LINK_RESET); diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h index 0b9f1e598e3a..4bc6bb01a0eb 100644 --- a/include/linux/usb/usbnet.h +++ b/include/linux/usb/usbnet.h @@ -76,6 +76,7 @@ struct usbnet { # define EVENT_LINK_CHANGE 11 # define EVENT_SET_RX_MODE 12 # define EVENT_NO_IP_ALIGN 13 +# define EVENT_LINK_CARRIER_ON 14 /* This one is special, as it indicates that the device is going away * there are cyclic dependencies between tasklet, timer and bh * that must be broken --=20 2.49.0