drivers/base/dd.c | 1 + 1 file changed, 1 insertion(+)
The deferred_probe_timeout_work may be canceled forever unexpected when
deferred_probe_extend_timeout() executes concurrently. Start with
deferred_probe_timeout_work pending, and the problem would
occur after the following sequence.
CPU0 CPU1
deferred_probe_extend_timeout
-> cancel_delayed_work => true
deferred_probe_extend_timeout
-> cancel_delayed_wrok
-> __cancel_work
-> try_grab_pending
-> schedule_delayed_work
-> queue_delayed_work_on
since pending bit is grabbed,
just return without doing anything
-> set_work_pool_and_clear_pending
this __cancel_work return false and
the work would never be queued again
The root cause is that the PENDING_BIT of the work_struct would be set
temporaily in __cancel_work and this bit could prevent the work_struct
to be queued in another CPU.
Use deferred_probe_mutex to protect the cancel and queue operations for
the deferred_probe_timeout_work to fix this problem.
Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
Cc: <stable@vger.kernel.org>
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
---
drivers/base/dd.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 13ab98e033ea..00419d2ee910 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
void deferred_probe_extend_timeout(void)
{
+ guard(mutex)(&deferred_probe_mutex);
/*
* If the work hasn't been queued yet or if the work expired, don't
* start a new one.
--
2.22.0
On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote: > The deferred_probe_timeout_work may be canceled forever unexpected when > deferred_probe_extend_timeout() executes concurrently. Start with > deferred_probe_timeout_work pending, and the problem would > occur after the following sequence. > > CPU0 CPU1 > deferred_probe_extend_timeout > -> cancel_delayed_work => true > deferred_probe_extend_timeout > -> cancel_delayed_wrok > -> __cancel_work > -> try_grab_pending > -> schedule_delayed_work > -> queue_delayed_work_on > since pending bit is grabbed, > just return without doing anything > -> set_work_pool_and_clear_pending > this __cancel_work return false and > the work would never be queued again > > The root cause is that the PENDING_BIT of the work_struct would be set > temporaily in __cancel_work and this bit could prevent the work_struct > to be queued in another CPU. > > Use deferred_probe_mutex to protect the cancel and queue operations for > the deferred_probe_timeout_work to fix this problem. > > Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration") > Cc: <stable@vger.kernel.org> > Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com> > --- > drivers/base/dd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index 13ab98e033ea..00419d2ee910 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_ > > void deferred_probe_extend_timeout(void) > { > + guard(mutex)(&deferred_probe_mutex); But if you grab the lock here, in the probe timeout function, the lock will be grabbed again, causing a deadlock, right? If not, why not? Have you run this patch with lockdep enabled? This feels broken to me, what am I missing? thanks, greg k-h
在 2025/8/14 22:19, Greg KH 写道: > On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote: >> The deferred_probe_timeout_work may be canceled forever unexpected when >> deferred_probe_extend_timeout() executes concurrently. Start with >> deferred_probe_timeout_work pending, and the problem would >> occur after the following sequence. >> >> CPU0 CPU1 >> deferred_probe_extend_timeout >> -> cancel_delayed_work => true >> deferred_probe_extend_timeout >> -> cancel_delayed_wrok >> -> __cancel_work >> -> try_grab_pending >> -> schedule_delayed_work >> -> queue_delayed_work_on >> since pending bit is grabbed, >> just return without doing anything >> -> set_work_pool_and_clear_pending >> this __cancel_work return false and >> the work would never be queued again >> >> The root cause is that the PENDING_BIT of the work_struct would be set >> temporaily in __cancel_work and this bit could prevent the work_struct >> to be queued in another CPU. >> >> Use deferred_probe_mutex to protect the cancel and queue operations for >> the deferred_probe_timeout_work to fix this problem. >> >> Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration") >> Cc: <stable@vger.kernel.org> >> Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com> >> --- >> drivers/base/dd.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/base/dd.c b/drivers/base/dd.c >> index 13ab98e033ea..00419d2ee910 100644 >> --- a/drivers/base/dd.c >> +++ b/drivers/base/dd.c >> @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_ >> >> void deferred_probe_extend_timeout(void) >> { >> + guard(mutex)(&deferred_probe_mutex); > > But if you grab the lock here, in the probe timeout function, the lock > will be grabbed again, causing a deadlock, right? If not, why not? It's not a sync version of cancel_work, so the execuation of the work function doesn't block us here, nor does the schedule_delayed_work does. Indead, deferred_probe_mutex is used to protect the deferred_probe_*_list, it looks better to use a new lock here. Right? > > Have you run this patch with lockdep enabled? > > This feels broken to me, what am I missing? > > thanks, > > greg k-h >
On Thu, Aug 14, 2025 at 7:20 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote: > > The deferred_probe_timeout_work may be canceled forever unexpected when > > deferred_probe_extend_timeout() executes concurrently. Start with > > deferred_probe_timeout_work pending, and the problem would > > occur after the following sequence. > > > > CPU0 CPU1 > > deferred_probe_extend_timeout > > -> cancel_delayed_work => true > > deferred_probe_extend_timeout > > -> cancel_delayed_wrok > > -> __cancel_work > > -> try_grab_pending > > -> schedule_delayed_work > > -> queue_delayed_work_on > > since pending bit is grabbed, > > just return without doing anything > > -> set_work_pool_and_clear_pending > > this __cancel_work return false and > > the work would never be queued again > > > > The root cause is that the PENDING_BIT of the work_struct would be set > > temporaily in __cancel_work and this bit could prevent the work_struct > > to be queued in another CPU. This feels more like a workqueue API issue (this isn't too obvious from the documentation) or me misusing the workqueue API. Is this issue still there if you use cancel_delayed_work_sync() instead of cancel_delayed_work()? If so, just switch to that and add proper comment on why it needs to by "sync". -Saravana > > > > Use deferred_probe_mutex to protect the cancel and queue operations for > > the deferred_probe_timeout_work to fix this problem. > > > > Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration") > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com> > > --- > > drivers/base/dd.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > > index 13ab98e033ea..00419d2ee910 100644 > > --- a/drivers/base/dd.c > > +++ b/drivers/base/dd.c > > @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_ > > > > void deferred_probe_extend_timeout(void) > > { > > + guard(mutex)(&deferred_probe_mutex); > > But if you grab the lock here, in the probe timeout function, the lock > will be grabbed again, causing a deadlock, right? If not, why not? > > Have you run this patch with lockdep enabled? > > This feels broken to me, what am I missing? > > thanks, > > greg k-h
在 2025/8/15 2:16, Saravana Kannan 写道: > On Thu, Aug 14, 2025 at 7:20 AM Greg KH <gregkh@linuxfoundation.org> wrote: >> >> On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote: >>> The deferred_probe_timeout_work may be canceled forever unexpected when >>> deferred_probe_extend_timeout() executes concurrently. Start with >>> deferred_probe_timeout_work pending, and the problem would >>> occur after the following sequence. >>> >>> CPU0 CPU1 >>> deferred_probe_extend_timeout >>> -> cancel_delayed_work => true >>> deferred_probe_extend_timeout >>> -> cancel_delayed_wrok >>> -> __cancel_work >>> -> try_grab_pending >>> -> schedule_delayed_work >>> -> queue_delayed_work_on >>> since pending bit is grabbed, >>> just return without doing anything >>> -> set_work_pool_and_clear_pending >>> this __cancel_work return false and >>> the work would never be queued again >>> >>> The root cause is that the PENDING_BIT of the work_struct would be set >>> temporaily in __cancel_work and this bit could prevent the work_struct >>> to be queued in another CPU. > > This feels more like a workqueue API issue (this isn't too obvious > from the documentation) or me misusing the workqueue API. > > Is this issue still there if you use cancel_delayed_work_sync() > instead of cancel_delayed_work()? If so, just switch to that and add > proper comment on why it needs to by "sync". > > -Saravana > cancel_delayed_work_sync() cannot solve the issue. Becasue this issue is to do with the interaction between cancel and queue operations for a work. The synchronization of the single cancel operation doesn't matter. >>> >>> Use deferred_probe_mutex to protect the cancel and queue operations for >>> the deferred_probe_timeout_work to fix this problem. >>> >>> Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration") >>> Cc: <stable@vger.kernel.org> >>> Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com> >>> --- >>> drivers/base/dd.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/base/dd.c b/drivers/base/dd.c >>> index 13ab98e033ea..00419d2ee910 100644 >>> --- a/drivers/base/dd.c >>> +++ b/drivers/base/dd.c >>> @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_ >>> >>> void deferred_probe_extend_timeout(void) >>> { >>> + guard(mutex)(&deferred_probe_mutex); >> >> But if you grab the lock here, in the probe timeout function, the lock >> will be grabbed again, causing a deadlock, right? If not, why not? >> >> Have you run this patch with lockdep enabled? >> >> This feels broken to me, what am I missing? >> >> thanks, >> >> greg k-h >
© 2016 - 2025 Red Hat, Inc.