* [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout()
@ 2025-08-14 12:29 Wang Wensheng
2025-08-14 14:19 ` Greg KH
0 siblings, 1 reply; 5+ messages in thread
From: Wang Wensheng @ 2025-08-14 12:29 UTC (permalink / raw)
To: gregkh, rafael, dakr, robh, broonie, saravanak, linux-kernel
Cc: chenjun102, wangwensheng4
The deferred_probe_timeout_work may be canceled forever unexpected when
deferred_probe_extend_timeout() executes concurrently. Start with
deferred_probe_timeout_work pending, and the problem would
occur after the following sequence.
CPU0 CPU1
deferred_probe_extend_timeout
-> cancel_delayed_work => true
deferred_probe_extend_timeout
-> cancel_delayed_wrok
-> __cancel_work
-> try_grab_pending
-> schedule_delayed_work
-> queue_delayed_work_on
since pending bit is grabbed,
just return without doing anything
-> set_work_pool_and_clear_pending
this __cancel_work return false and
the work would never be queued again
The root cause is that the PENDING_BIT of the work_struct would be set
temporaily in __cancel_work and this bit could prevent the work_struct
to be queued in another CPU.
Use deferred_probe_mutex to protect the cancel and queue operations for
the deferred_probe_timeout_work to fix this problem.
Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
Cc: <stable@vger.kernel.org>
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
---
drivers/base/dd.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 13ab98e033ea..00419d2ee910 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
void deferred_probe_extend_timeout(void)
{
+ guard(mutex)(&deferred_probe_mutex);
/*
* If the work hasn't been queued yet or if the work expired, don't
* start a new one.
--
2.22.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout()
2025-08-14 12:29 [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout() Wang Wensheng
@ 2025-08-14 14:19 ` Greg KH
2025-08-14 18:16 ` Saravana Kannan
2025-08-15 1:43 ` wangwensheng (C)
0 siblings, 2 replies; 5+ messages in thread
From: Greg KH @ 2025-08-14 14:19 UTC (permalink / raw)
To: Wang Wensheng
Cc: rafael, dakr, robh, broonie, saravanak, linux-kernel, chenjun102
On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote:
> The deferred_probe_timeout_work may be canceled forever unexpected when
> deferred_probe_extend_timeout() executes concurrently. Start with
> deferred_probe_timeout_work pending, and the problem would
> occur after the following sequence.
>
> CPU0 CPU1
> deferred_probe_extend_timeout
> -> cancel_delayed_work => true
> deferred_probe_extend_timeout
> -> cancel_delayed_wrok
> -> __cancel_work
> -> try_grab_pending
> -> schedule_delayed_work
> -> queue_delayed_work_on
> since pending bit is grabbed,
> just return without doing anything
> -> set_work_pool_and_clear_pending
> this __cancel_work return false and
> the work would never be queued again
>
> The root cause is that the PENDING_BIT of the work_struct would be set
> temporaily in __cancel_work and this bit could prevent the work_struct
> to be queued in another CPU.
>
> Use deferred_probe_mutex to protect the cancel and queue operations for
> the deferred_probe_timeout_work to fix this problem.
>
> Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
> ---
> drivers/base/dd.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 13ab98e033ea..00419d2ee910 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
>
> void deferred_probe_extend_timeout(void)
> {
> + guard(mutex)(&deferred_probe_mutex);
But if you grab the lock here, in the probe timeout function, the lock
will be grabbed again, causing a deadlock, right? If not, why not?
Have you run this patch with lockdep enabled?
This feels broken to me, what am I missing?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout()
2025-08-14 14:19 ` Greg KH
@ 2025-08-14 18:16 ` Saravana Kannan
2025-08-15 1:56 ` wangwensheng (C)
2025-08-15 1:43 ` wangwensheng (C)
1 sibling, 1 reply; 5+ messages in thread
From: Saravana Kannan @ 2025-08-14 18:16 UTC (permalink / raw)
To: Greg KH
Cc: Wang Wensheng, rafael, dakr, robh, broonie, linux-kernel,
chenjun102
On Thu, Aug 14, 2025 at 7:20 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote:
> > The deferred_probe_timeout_work may be canceled forever unexpected when
> > deferred_probe_extend_timeout() executes concurrently. Start with
> > deferred_probe_timeout_work pending, and the problem would
> > occur after the following sequence.
> >
> > CPU0 CPU1
> > deferred_probe_extend_timeout
> > -> cancel_delayed_work => true
> > deferred_probe_extend_timeout
> > -> cancel_delayed_wrok
> > -> __cancel_work
> > -> try_grab_pending
> > -> schedule_delayed_work
> > -> queue_delayed_work_on
> > since pending bit is grabbed,
> > just return without doing anything
> > -> set_work_pool_and_clear_pending
> > this __cancel_work return false and
> > the work would never be queued again
> >
> > The root cause is that the PENDING_BIT of the work_struct would be set
> > temporaily in __cancel_work and this bit could prevent the work_struct
> > to be queued in another CPU.
This feels more like a workqueue API issue (this isn't too obvious
from the documentation) or me misusing the workqueue API.
Is this issue still there if you use cancel_delayed_work_sync()
instead of cancel_delayed_work()? If so, just switch to that and add
proper comment on why it needs to by "sync".
-Saravana
> >
> > Use deferred_probe_mutex to protect the cancel and queue operations for
> > the deferred_probe_timeout_work to fix this problem.
> >
> > Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
> > ---
> > drivers/base/dd.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index 13ab98e033ea..00419d2ee910 100644
> > --- a/drivers/base/dd.c
> > +++ b/drivers/base/dd.c
> > @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
> >
> > void deferred_probe_extend_timeout(void)
> > {
> > + guard(mutex)(&deferred_probe_mutex);
>
> But if you grab the lock here, in the probe timeout function, the lock
> will be grabbed again, causing a deadlock, right? If not, why not?
>
> Have you run this patch with lockdep enabled?
>
> This feels broken to me, what am I missing?
>
> thanks,
>
> greg k-h
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout()
2025-08-14 14:19 ` Greg KH
2025-08-14 18:16 ` Saravana Kannan
@ 2025-08-15 1:43 ` wangwensheng (C)
1 sibling, 0 replies; 5+ messages in thread
From: wangwensheng (C) @ 2025-08-15 1:43 UTC (permalink / raw)
To: Greg KH; +Cc: rafael, dakr, robh, broonie, saravanak, linux-kernel, chenjun102
在 2025/8/14 22:19, Greg KH 写道:
> On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote:
>> The deferred_probe_timeout_work may be canceled forever unexpected when
>> deferred_probe_extend_timeout() executes concurrently. Start with
>> deferred_probe_timeout_work pending, and the problem would
>> occur after the following sequence.
>>
>> CPU0 CPU1
>> deferred_probe_extend_timeout
>> -> cancel_delayed_work => true
>> deferred_probe_extend_timeout
>> -> cancel_delayed_wrok
>> -> __cancel_work
>> -> try_grab_pending
>> -> schedule_delayed_work
>> -> queue_delayed_work_on
>> since pending bit is grabbed,
>> just return without doing anything
>> -> set_work_pool_and_clear_pending
>> this __cancel_work return false and
>> the work would never be queued again
>>
>> The root cause is that the PENDING_BIT of the work_struct would be set
>> temporaily in __cancel_work and this bit could prevent the work_struct
>> to be queued in another CPU.
>>
>> Use deferred_probe_mutex to protect the cancel and queue operations for
>> the deferred_probe_timeout_work to fix this problem.
>>
>> Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
>> ---
>> drivers/base/dd.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
>> index 13ab98e033ea..00419d2ee910 100644
>> --- a/drivers/base/dd.c
>> +++ b/drivers/base/dd.c
>> @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
>>
>> void deferred_probe_extend_timeout(void)
>> {
>> + guard(mutex)(&deferred_probe_mutex);
>
> But if you grab the lock here, in the probe timeout function, the lock
> will be grabbed again, causing a deadlock, right? If not, why not?
It's not a sync version of cancel_work, so the execuation of the work
function doesn't block us here, nor does the schedule_delayed_work does.
Indead, deferred_probe_mutex is used to protect the
deferred_probe_*_list, it looks better to use a new lock here. Right?
>
> Have you run this patch with lockdep enabled?
>
> This feels broken to me, what am I missing?
>
> thanks,
>
> greg k-h
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout()
2025-08-14 18:16 ` Saravana Kannan
@ 2025-08-15 1:56 ` wangwensheng (C)
0 siblings, 0 replies; 5+ messages in thread
From: wangwensheng (C) @ 2025-08-15 1:56 UTC (permalink / raw)
To: Saravana Kannan, Greg KH
Cc: rafael, dakr, robh, broonie, linux-kernel, chenjun102
在 2025/8/15 2:16, Saravana Kannan 写道:
> On Thu, Aug 14, 2025 at 7:20 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>>
>> On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote:
>>> The deferred_probe_timeout_work may be canceled forever unexpected when
>>> deferred_probe_extend_timeout() executes concurrently. Start with
>>> deferred_probe_timeout_work pending, and the problem would
>>> occur after the following sequence.
>>>
>>> CPU0 CPU1
>>> deferred_probe_extend_timeout
>>> -> cancel_delayed_work => true
>>> deferred_probe_extend_timeout
>>> -> cancel_delayed_wrok
>>> -> __cancel_work
>>> -> try_grab_pending
>>> -> schedule_delayed_work
>>> -> queue_delayed_work_on
>>> since pending bit is grabbed,
>>> just return without doing anything
>>> -> set_work_pool_and_clear_pending
>>> this __cancel_work return false and
>>> the work would never be queued again
>>>
>>> The root cause is that the PENDING_BIT of the work_struct would be set
>>> temporaily in __cancel_work and this bit could prevent the work_struct
>>> to be queued in another CPU.
>
> This feels more like a workqueue API issue (this isn't too obvious
> from the documentation) or me misusing the workqueue API.
>
> Is this issue still there if you use cancel_delayed_work_sync()
> instead of cancel_delayed_work()? If so, just switch to that and add
> proper comment on why it needs to by "sync".
>
> -Saravana
>
cancel_delayed_work_sync() cannot solve the issue. Becasue this issue is
to do with the interaction between cancel and queue operations for a
work. The synchronization of the single cancel operation doesn't matter.
>>>
>>> Use deferred_probe_mutex to protect the cancel and queue operations for
>>> the deferred_probe_timeout_work to fix this problem.
>>>
>>> Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
>>> Cc: <stable@vger.kernel.org>
>>> Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
>>> ---
>>> drivers/base/dd.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
>>> index 13ab98e033ea..00419d2ee910 100644
>>> --- a/drivers/base/dd.c
>>> +++ b/drivers/base/dd.c
>>> @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
>>>
>>> void deferred_probe_extend_timeout(void)
>>> {
>>> + guard(mutex)(&deferred_probe_mutex);
>>
>> But if you grab the lock here, in the probe timeout function, the lock
>> will be grabbed again, causing a deadlock, right? If not, why not?
>>
>> Have you run this patch with lockdep enabled?
>>
>> This feels broken to me, what am I missing?
>>
>> thanks,
>>
>> greg k-h
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-08-15 1:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14 12:29 [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout() Wang Wensheng
2025-08-14 14:19 ` Greg KH
2025-08-14 18:16 ` Saravana Kannan
2025-08-15 1:56 ` wangwensheng (C)
2025-08-15 1:43 ` wangwensheng (C)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).