[PATCH rtw v4 4/4] wifi: rtw89: avoid circular locking dependency in ser_state_run()

Fedor Pchelkin posted 4 patches 2 weeks, 1 day ago
There is a newer version of this series
[PATCH rtw v4 4/4] wifi: rtw89: avoid circular locking dependency in ser_state_run()
Posted by Fedor Pchelkin 2 weeks, 1 day ago
Lockdep gives a splat [1] when ser_hdl_work item is executed.  It is
scheduled at mac80211 workqueue via ieee80211_queue_work() and takes a
wiphy lock inside.  However, this workqueue can be flushed when e.g.
closing the interface and wiphy lock is already taken in that case.

Choosing wiphy_work_queue() for SER is likely not suitable.  Back on to
the global workqueue.

[1]:

 WARNING: possible circular locking dependency detected
 6.17.0-rc2 #17 Not tainted
 ------------------------------------------------------
 kworker/u32:1/61 is trying to acquire lock:
 ffff88811bc00768 (&rdev->wiphy.mtx){+.+.}-{4:4}, at: ser_state_run+0x5e/0x180 [rtw89_core]

 but task is already holding lock:
 ffffc9000048fd30 ((work_completion)(&ser->ser_hdl_work)){+.+.}-{0:0}, at: process_one_work+0x7b5/0x1450

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 ((work_completion)(&ser->ser_hdl_work)){+.+.}-{0:0}:
        process_one_work+0x7c6/0x1450
        worker_thread+0x49e/0xd00
        kthread+0x313/0x640
        ret_from_fork+0x221/0x300
        ret_from_fork_asm+0x1a/0x30

 -> #1 ((wq_completion)phy0){+.+.}-{0:0}:
        touch_wq_lockdep_map+0x8e/0x180
        __flush_workqueue+0x129/0x10d0
        ieee80211_stop_device+0xa8/0x110
        ieee80211_do_stop+0x14ce/0x2880
        ieee80211_stop+0x13a/0x2c0
        __dev_close_many+0x18f/0x510
        __dev_change_flags+0x25f/0x670
        netif_change_flags+0x7b/0x160
        do_setlink.isra.0+0x1640/0x35d0
        rtnl_newlink+0xd8c/0x1d30
        rtnetlink_rcv_msg+0x700/0xb80
        netlink_rcv_skb+0x11d/0x350
        netlink_unicast+0x49a/0x7a0
        netlink_sendmsg+0x759/0xc20
        ____sys_sendmsg+0x812/0xa00
        ___sys_sendmsg+0xf7/0x180
        __sys_sendmsg+0x11f/0x1b0
        do_syscall_64+0xbb/0x360
        entry_SYSCALL_64_after_hwframe+0x77/0x7f

 -> #0 (&rdev->wiphy.mtx){+.+.}-{4:4}:
        __lock_acquire+0x124c/0x1d20
        lock_acquire+0x154/0x2e0
        __mutex_lock+0x17b/0x12f0
        ser_state_run+0x5e/0x180 [rtw89_core]
        rtw89_ser_hdl_work+0x119/0x220 [rtw89_core]
        process_one_work+0x82d/0x1450
        worker_thread+0x49e/0xd00
        kthread+0x313/0x640
        ret_from_fork+0x221/0x300
        ret_from_fork_asm+0x1a/0x30

 other info that might help us debug this:

 Chain exists of:
   &rdev->wiphy.mtx --> (wq_completion)phy0 --> (work_completion)(&ser->ser_hdl_work)

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock((work_completion)(&ser->ser_hdl_work));
                                lock((wq_completion)phy0);
                                lock((work_completion)(&ser->ser_hdl_work));
   lock(&rdev->wiphy.mtx);

  *** DEADLOCK ***

 2 locks held by kworker/u32:1/61:
  #0: ffff888103835148 ((wq_completion)phy0){+.+.}-{0:0}, at: process_one_work+0xefa/0x1450
  #1: ffffc9000048fd30 ((work_completion)(&ser->ser_hdl_work)){+.+.}-{0:0}, at: process_one_work+0x7b5/0x1450

 stack backtrace:
 CPU: 0 UID: 0 PID: 61 Comm: kworker/u32:1 Not tainted 6.17.0-rc2 #17 PREEMPT(voluntary)
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS edk2-20250523-14.fc42 05/23/2025
 Workqueue: phy0 rtw89_ser_hdl_work [rtw89_core]
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5d/0x80
  print_circular_bug.cold+0x178/0x1be
  check_noncircular+0x14c/0x170
  __lock_acquire+0x124c/0x1d20
  lock_acquire+0x154/0x2e0
  __mutex_lock+0x17b/0x12f0
  ser_state_run+0x5e/0x180 [rtw89_core]
  rtw89_ser_hdl_work+0x119/0x220 [rtw89_core]
  process_one_work+0x82d/0x1450
  worker_thread+0x49e/0xd00
  kthread+0x313/0x640
  ret_from_fork+0x221/0x300
  ret_from_fork_asm+0x1a/0x30
  </TASK>

Found by Linux Verification Center (linuxtesting.org).

Fixes: ebfc9199df05 ("wifi: rtw89: add wiphy_lock() to work that isn't held wiphy_lock() yet")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
---
 drivers/net/wireless/realtek/rtw89/ser.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtw89/ser.c b/drivers/net/wireless/realtek/rtw89/ser.c
index fe7beff8c424..f99e179f7ff9 100644
--- a/drivers/net/wireless/realtek/rtw89/ser.c
+++ b/drivers/net/wireless/realtek/rtw89/ser.c
@@ -205,7 +205,6 @@ static void rtw89_ser_hdl_work(struct work_struct *work)
 
 static int ser_send_msg(struct rtw89_ser *ser, u8 event)
 {
-	struct rtw89_dev *rtwdev = container_of(ser, struct rtw89_dev, ser);
 	struct ser_msg *msg = NULL;
 
 	if (test_bit(RTW89_SER_DRV_STOP_RUN, ser->flags))
@@ -221,7 +220,7 @@ static int ser_send_msg(struct rtw89_ser *ser, u8 event)
 	list_add(&msg->list, &ser->msg_q);
 	spin_unlock_irq(&ser->msg_q_lock);
 
-	ieee80211_queue_work(rtwdev->hw, &ser->ser_hdl_work);
+	schedule_work(&ser->ser_hdl_work);
 	return 0;
 }
 
-- 
2.51.0
RE: [PATCH rtw v4 4/4] wifi: rtw89: avoid circular locking dependency in ser_state_run()
Posted by Ping-Ke Shih 2 weeks ago
Fedor Pchelkin <pchelkin@ispras.ru> wrote:
> Lockdep gives a splat [1] when ser_hdl_work item is executed.  It is
> scheduled at mac80211 workqueue via ieee80211_queue_work() and takes a
> wiphy lock inside.  However, this workqueue can be flushed when e.g.
> closing the interface and wiphy lock is already taken in that case.
> 
> Choosing wiphy_work_queue() for SER is likely not suitable.  Back on to
> the global workqueue.
> 
> [1]:
> 
>  WARNING: possible circular locking dependency detected
>  6.17.0-rc2 #17 Not tainted
>  ------------------------------------------------------
>  kworker/u32:1/61 is trying to acquire lock:
>  ffff88811bc00768 (&rdev->wiphy.mtx){+.+.}-{4:4}, at: ser_state_run+0x5e/0x180 [rtw89_core]
> 
>  but task is already holding lock:
>  ffffc9000048fd30 ((work_completion)(&ser->ser_hdl_work)){+.+.}-{0:0}, at:
> process_one_work+0x7b5/0x1450
> 
>  which lock already depends on the new lock.
> 
>  the existing dependency chain (in reverse order) is:
> 
>  -> #2 ((work_completion)(&ser->ser_hdl_work)){+.+.}-{0:0}:
>         process_one_work+0x7c6/0x1450
>         worker_thread+0x49e/0xd00
>         kthread+0x313/0x640
>         ret_from_fork+0x221/0x300
>         ret_from_fork_asm+0x1a/0x30
> 
>  -> #1 ((wq_completion)phy0){+.+.}-{0:0}:
>         touch_wq_lockdep_map+0x8e/0x180
>         __flush_workqueue+0x129/0x10d0
>         ieee80211_stop_device+0xa8/0x110
>         ieee80211_do_stop+0x14ce/0x2880
>         ieee80211_stop+0x13a/0x2c0
>         __dev_close_many+0x18f/0x510
>         __dev_change_flags+0x25f/0x670
>         netif_change_flags+0x7b/0x160
>         do_setlink.isra.0+0x1640/0x35d0
>         rtnl_newlink+0xd8c/0x1d30
>         rtnetlink_rcv_msg+0x700/0xb80
>         netlink_rcv_skb+0x11d/0x350
>         netlink_unicast+0x49a/0x7a0
>         netlink_sendmsg+0x759/0xc20
>         ____sys_sendmsg+0x812/0xa00
>         ___sys_sendmsg+0xf7/0x180
>         __sys_sendmsg+0x11f/0x1b0
>         do_syscall_64+0xbb/0x360
>         entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
>  -> #0 (&rdev->wiphy.mtx){+.+.}-{4:4}:
>         __lock_acquire+0x124c/0x1d20
>         lock_acquire+0x154/0x2e0
>         __mutex_lock+0x17b/0x12f0
>         ser_state_run+0x5e/0x180 [rtw89_core]
>         rtw89_ser_hdl_work+0x119/0x220 [rtw89_core]
>         process_one_work+0x82d/0x1450
>         worker_thread+0x49e/0xd00
>         kthread+0x313/0x640
>         ret_from_fork+0x221/0x300
>         ret_from_fork_asm+0x1a/0x30
> 
>  other info that might help us debug this:
> 
>  Chain exists of:
>    &rdev->wiphy.mtx --> (wq_completion)phy0 --> (work_completion)(&ser->ser_hdl_work)
> 
>   Possible unsafe locking scenario:
> 
>         CPU0                    CPU1
>         ----                    ----
>    lock((work_completion)(&ser->ser_hdl_work));
>                                 lock((wq_completion)phy0);
>                                 lock((work_completion)(&ser->ser_hdl_work));
>    lock(&rdev->wiphy.mtx);
> 
>   *** DEADLOCK ***
> 
>  2 locks held by kworker/u32:1/61:
>   #0: ffff888103835148 ((wq_completion)phy0){+.+.}-{0:0}, at: process_one_work+0xefa/0x1450
>   #1: ffffc9000048fd30 ((work_completion)(&ser->ser_hdl_work)){+.+.}-{0:0}, at:
> process_one_work+0x7b5/0x1450
> 
>  stack backtrace:
>  CPU: 0 UID: 0 PID: 61 Comm: kworker/u32:1 Not tainted 6.17.0-rc2 #17 PREEMPT(voluntary)
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS edk2-20250523-14.fc42 05/23/2025
>  Workqueue: phy0 rtw89_ser_hdl_work [rtw89_core]
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5d/0x80
>   print_circular_bug.cold+0x178/0x1be
>   check_noncircular+0x14c/0x170
>   __lock_acquire+0x124c/0x1d20
>   lock_acquire+0x154/0x2e0
>   __mutex_lock+0x17b/0x12f0
>   ser_state_run+0x5e/0x180 [rtw89_core]
>   rtw89_ser_hdl_work+0x119/0x220 [rtw89_core]
>   process_one_work+0x82d/0x1450
>   worker_thread+0x49e/0xd00
>   kthread+0x313/0x640
>   ret_from_fork+0x221/0x300
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
> 
> Found by Linux Verification Center (linuxtesting.org).
> 
> Fixes: ebfc9199df05 ("wifi: rtw89: add wiphy_lock() to work that isn't held wiphy_lock() yet")
> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>

Acked-by: Ping-Ke Shih <pkshih@realtek.com>

By the way, you mark this patchset as 'rtw'. Does it mean this patchset is
urgent to you? If not, it will be more smooth (avoid possible merge conflict)
if it goes via 'rtw-next'. Let me know your preference. 
Re: [PATCH rtw v4 4/4] wifi: rtw89: avoid circular locking dependency in ser_state_run()
Posted by Fedor Pchelkin 1 week, 6 days ago
On Thu, 18. Sep 05:52, Ping-Ke Shih wrote:
> Fedor Pchelkin <pchelkin@ispras.ru> wrote:
> By the way, you mark this patchset as 'rtw'. Does it mean this patchset is
> urgent to you? If not, it will be more smooth (avoid possible merge conflict)
> if it goes via 'rtw-next'. Let me know your preference. 

The first patch of the series is rather urgent compared to the others
because it addresses the issue occasionally banging on a working system.
The other ones are less urgent.

TBH I'm not aware of your development process in details.  It's v6.17-rc6
at the moment.  If I target all patches at rtw-next, are they to land in
upcoming merge window for v6.18 release (a couple of weeks from now)?
If yes then no hurries from me, rtw-next is ok.
RE: [PATCH rtw v4 4/4] wifi: rtw89: avoid circular locking dependency in ser_state_run()
Posted by Ping-Ke Shih 1 week, 6 days ago
Fedor Pchelkin <pchelkin@ispras.ru> wrote:
> On Thu, 18. Sep 05:52, Ping-Ke Shih wrote:
> > Fedor Pchelkin <pchelkin@ispras.ru> wrote:
> > By the way, you mark this patchset as 'rtw'. Does it mean this patchset is
> > urgent to you? If not, it will be more smooth (avoid possible merge conflict)
> > if it goes via 'rtw-next'. Let me know your preference.
> 
> The first patch of the series is rather urgent compared to the others
> because it addresses the issue occasionally banging on a working system.
> The other ones are less urgent.
> 
> TBH I'm not aware of your development process in details.  It's v6.17-rc6
> at the moment.  If I target all patches at rtw-next, are they to land in
> upcoming merge window for v6.18 release (a couple of weeks from now)?
> If yes then no hurries from me, rtw-next is ok.

It's v6.17-rc6 (-rc eve), so I don't plan to send a pull-request.

Originally I plan to send the last pull-request to v6.18 today, so I did
review this patchset yesterday to see if I can merge it before sending. 
Since only two or three minor changes are needed, I can wait a while and
send the pull-request next Monday if you can re-spin the patchset this
weekend. 

If not, I can still merge this patchset in v6.18-rc cycle to rtw tree.
However, this might cause merge conflict with -next, so I don't prefer
this. Upper maintainers need to spend extra time to resolve conflicts.

Re: [PATCH rtw v4 4/4] wifi: rtw89: avoid circular locking dependency in ser_state_run()
Posted by Fedor Pchelkin 1 week, 6 days ago
On Fri, 19. Sep 00:46, Ping-Ke Shih wrote:
> Fedor Pchelkin <pchelkin@ispras.ru> wrote:
> > On Thu, 18. Sep 05:52, Ping-Ke Shih wrote:
> > > Fedor Pchelkin <pchelkin@ispras.ru> wrote:
> > > By the way, you mark this patchset as 'rtw'. Does it mean this patchset is
> > > urgent to you? If not, it will be more smooth (avoid possible merge conflict)
> > > if it goes via 'rtw-next'. Let me know your preference.
> > 
> > The first patch of the series is rather urgent compared to the others
> > because it addresses the issue occasionally banging on a working system.
> > The other ones are less urgent.
> > 
> > TBH I'm not aware of your development process in details.  It's v6.17-rc6
> > at the moment.  If I target all patches at rtw-next, are they to land in
> > upcoming merge window for v6.18 release (a couple of weeks from now)?
> > If yes then no hurries from me, rtw-next is ok.
> 
> It's v6.17-rc6 (-rc eve), so I don't plan to send a pull-request.
> 
> Originally I plan to send the last pull-request to v6.18 today, so I did
> review this patchset yesterday to see if I can merge it before sending. 
> Since only two or three minor changes are needed, I can wait a while and
> send the pull-request next Monday if you can re-spin the patchset this
> weekend. 
> 
> If not, I can still merge this patchset in v6.18-rc cycle to rtw tree.
> However, this might cause merge conflict with -next, so I don't prefer
> this. Upper maintainers need to spend extra time to resolve conflicts.

One thing that I forgot to mention is about rtw89 USB part.

"BUG: KFENCE: use-after-free write in rtw89_core_tx_kick_off_and_wait"
which is fixed by the first patch of the current series is reproduced
reliably with USB HCI because there is no TX completion there yet, i.e.
rtw89_core_tx_kick_off_and_wait() always exits with a timeout and touches
skb parts which are most probably already freed by the call to
ieee80211_tx_status_irqsafe() from URB completion callback.  I've got
a dongle now and confirm the bug.  Turns out it was reported here [1] by
Bitterblue Smith as well.

[1]: https://lore.kernel.org/linux-wireless/1e5e97d4-8267-4f77-a4bf-1fe23ea40f77@gmail.com/

The first patch does avoid use-after-free bug for USB, too.  But, as
there is no TX ACK completion implemented for USB yet, tx_wait_list will
be piled up with wasted items which can't be freed due to the lack of
completion.  It's better than crash but still a problem.

Bitterblue suggested [2] implementing the missing TX completion parts for
USB to fix this entirely.  I've got a bunch of patches for it which will
send as a separate USB-series today or tomorrow.  I expect it'll require
time for review and it probably should have to be improved/reworked in
several places.  Anyway I'll send it soon so you've got a more complete
picture and some time until Monday to decide how to handle it.

[2]: https://lore.kernel.org/linux-wireless/0cb4d19b-94c7-450e-ac56-8b0d4a1d889f@gmail.com/