[PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()

Jingyi Wang posted 7 patches 4 weeks, 1 day ago
[PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Jingyi Wang 4 weeks, 1 day ago
rproc_add() called by rproc probe function failure will tear down all
the resources including do device_del() and remove subdev etc. If
rproc_report_crash() is called in this path, the rproc_crash_handler_work
could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
be called with recovery enabled, which may cause NULL pointer dereference
if the resource has already been cleaned up.

[    5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
[    5.260499] Mem abort info:
[    5.263384]   ESR = 0x0000000096000006
[    5.267248]   EC = 0x25: DABT (current EL), IL = 32 bits
[    5.272711]   SET = 0, FnV = 0
[    5.275865]   EA = 0, S1PTW = 0
[    5.279106]   FSC = 0x06: level 2 translation fault
[    5.284125] Data abort info:
[    5.287101]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[    5.292742]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    5.297939]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
[    5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
[    5.320917] Internal error: Oops: 0000000096000006 [#1]  SMP
[    5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
[    5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
[    5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[    5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
[    5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
[    5.425308] sp : ffff800080ffbc90
[    5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
[    5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
[    5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
[    5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
[    5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
[    5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
[    5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
[    5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
[    5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
[    5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000
[    5.502028] Call trace:
[    5.504549]  qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
[    5.511344]  glink_subdev_stop+0x1c/0x30 [qcom_common]
[    5.516622]  rproc_stop+0x58/0x17c
[    5.520127]  rproc_trigger_recovery+0xb0/0x150
[    5.524693]  rproc_crash_handler_work+0xa4/0xc4
[    5.529346]  process_scheduled_works+0x18c/0x2d8
[    5.534092]  worker_thread+0x144/0x280
[    5.537952]  kthread+0x124/0x138
[    5.541280]  ret_from_fork+0x10/0x20
[    5.544965] Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
[    5.551224] ---[ end trace 0000000000000000 ]---

So set recovery_disabled during rproc_add().

Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
 drivers/remoteproc/remoteproc_core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index b087ed21858a..f66dde712cec 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -2286,7 +2286,10 @@ int rproc_add(struct rproc *rproc)
 {
 	struct device *dev = &rproc->dev;
 	int ret;
+	bool rproc_recovery_save;
 
+	rproc_recovery_save  = rproc->recovery_disabled;
+	rproc->recovery_disabled = true;
 	ret = rproc_validate(rproc);
 	if (ret < 0)
 		return ret;
@@ -2319,6 +2322,7 @@ int rproc_add(struct rproc *rproc)
 	list_add_rcu(&rproc->node, &rproc_list);
 	mutex_unlock(&rproc_list_mutex);
 
+	rproc->recovery_disabled = rproc_recovery_save;
 	return 0;
 
 rproc_remove_dev:

-- 
2.25.1
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Krzysztof Kozlowski 4 weeks, 1 day ago
On Tue, Mar 10, 2026 at 03:03:21AM -0700, Jingyi Wang wrote:
> rproc_add() called by rproc probe function failure will tear down all
> the resources including do device_del() and remove subdev etc. If
> rproc_report_crash() is called in this path, the rproc_crash_handler_work
> could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
> be called with recovery enabled, which may cause NULL pointer dereference
> if the resource has already been cleaned up.
> 
> [    5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> [    5.260499] Mem abort info:
> [    5.263384]   ESR = 0x0000000096000006
> [    5.267248]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    5.272711]   SET = 0, FnV = 0
> [    5.275865]   EA = 0, S1PTW = 0
> [    5.279106]   FSC = 0x06: level 2 translation fault
> [    5.284125] Data abort info:
> [    5.287101]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
> [    5.292742]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [    5.297939]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [    5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
> [    5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
> [    5.320917] Internal error: Oops: 0000000096000006 [#1]  SMP
> [    5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
> [    5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
> [    5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [    5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> [    5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> [    5.425308] sp : ffff800080ffbc90
> [    5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
> [    5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
> [    5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
> [    5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
> [    5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
> [    5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
> [    5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
> [    5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
> [    5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
> [    5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000

Please trim your commit msg from irrelevant data, so this will be easier
to read.

Best regards,
Krzysztof
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Jingyi Wang 3 weeks ago

On 3/11/2026 2:20 PM, Krzysztof Kozlowski wrote:
> On Tue, Mar 10, 2026 at 03:03:21AM -0700, Jingyi Wang wrote:
>> rproc_add() called by rproc probe function failure will tear down all
>> the resources including do device_del() and remove subdev etc. If
>> rproc_report_crash() is called in this path, the rproc_crash_handler_work
>> could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
>> be called with recovery enabled, which may cause NULL pointer dereference
>> if the resource has already been cleaned up.
>>
>> [    5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
>> [    5.260499] Mem abort info:
>> [    5.263384]   ESR = 0x0000000096000006
>> [    5.267248]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [    5.272711]   SET = 0, FnV = 0
>> [    5.275865]   EA = 0, S1PTW = 0
>> [    5.279106]   FSC = 0x06: level 2 translation fault
>> [    5.284125] Data abort info:
>> [    5.287101]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
>> [    5.292742]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>> [    5.297939]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> [    5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
>> [    5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
>> [    5.320917] Internal error: Oops: 0000000096000006 [#1]  SMP
>> [    5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
>> [    5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
>> [    5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>> [    5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
>> [    5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
>> [    5.425308] sp : ffff800080ffbc90
>> [    5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
>> [    5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
>> [    5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
>> [    5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
>> [    5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
>> [    5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
>> [    5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
>> [    5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
>> [    5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
>> [    5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000
> 
> Please trim your commit msg from irrelevant data, so this will be easier
> to read.
> 
> Best regards,
> Krzysztof
> 

Well noted.

Thanks,
Jingyi
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Bartosz Golaszewski 4 weeks, 1 day ago
On Tue, 10 Mar 2026 11:03:21 +0100, Jingyi Wang
<jingyi.wang@oss.qualcomm.com> said:
> rproc_add() called by rproc probe function failure will tear down all
> the resources including do device_del() and remove subdev etc. If
> rproc_report_crash() is called in this path, the rproc_crash_handler_work
> could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
> be called with recovery enabled, which may cause NULL pointer dereference
> if the resource has already been cleaned up.
>
> [    5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> [    5.260499] Mem abort info:
> [    5.263384]   ESR = 0x0000000096000006
> [    5.267248]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    5.272711]   SET = 0, FnV = 0
> [    5.275865]   EA = 0, S1PTW = 0
> [    5.279106]   FSC = 0x06: level 2 translation fault
> [    5.284125] Data abort info:
> [    5.287101]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
> [    5.292742]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [    5.297939]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [    5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
> [    5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
> [    5.320917] Internal error: Oops: 0000000096000006 [#1]  SMP
> [    5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
> [    5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
> [    5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [    5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> [    5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> [    5.425308] sp : ffff800080ffbc90
> [    5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
> [    5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
> [    5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
> [    5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
> [    5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
> [    5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
> [    5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
> [    5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
> [    5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
> [    5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000
> [    5.502028] Call trace:
> [    5.504549]  qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> [    5.511344]  glink_subdev_stop+0x1c/0x30 [qcom_common]
> [    5.516622]  rproc_stop+0x58/0x17c
> [    5.520127]  rproc_trigger_recovery+0xb0/0x150
> [    5.524693]  rproc_crash_handler_work+0xa4/0xc4
> [    5.529346]  process_scheduled_works+0x18c/0x2d8
> [    5.534092]  worker_thread+0x144/0x280
> [    5.537952]  kthread+0x124/0x138
> [    5.541280]  ret_from_fork+0x10/0x20
> [    5.544965] Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> [    5.551224] ---[ end trace 0000000000000000 ]---
>
> So set recovery_disabled during rproc_add().
>
> Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> ---
>  drivers/remoteproc/remoteproc_core.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index b087ed21858a..f66dde712cec 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -2286,7 +2286,10 @@ int rproc_add(struct rproc *rproc)
>  {
>  	struct device *dev = &rproc->dev;
>  	int ret;
> +	bool rproc_recovery_save;
>
> +	rproc_recovery_save  = rproc->recovery_disabled;
> +	rproc->recovery_disabled = true;
>  	ret = rproc_validate(rproc);
>  	if (ret < 0)
>  		return ret;
> @@ -2319,6 +2322,7 @@ int rproc_add(struct rproc *rproc)
>  	list_add_rcu(&rproc->node, &rproc_list);
>  	mutex_unlock(&rproc_list_mutex);
>
> +	rproc->recovery_disabled = rproc_recovery_save;
>  	return 0;
>
>  rproc_remove_dev:
>
> --
> 2.25.1
>
>

Ideally things like this would be passed to the rproc core in some kind of a
config structure and only set when registration succeeds. This looks to me
like papering over the real issue and I think it's still racy as there's no
true synchronization.

Wouldn't it be better to take rproc->lock for the entire duration of
rproc_add()? It's already initialized in rproc_alloc().

Bart
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Dmitry Baryshkov 4 weeks, 1 day ago
On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
> On Tue, 10 Mar 2026 11:03:21 +0100, Jingyi Wang
> <jingyi.wang@oss.qualcomm.com> said:
> > rproc_add() called by rproc probe function failure will tear down all
> > the resources including do device_del() and remove subdev etc. If
> > rproc_report_crash() is called in this path, the rproc_crash_handler_work
> > could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
> > be called with recovery enabled, which may cause NULL pointer dereference
> > if the resource has already been cleaned up.
> >
> > [    5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> > [    5.260499] Mem abort info:
> > [    5.263384]   ESR = 0x0000000096000006
> > [    5.267248]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [    5.272711]   SET = 0, FnV = 0
> > [    5.275865]   EA = 0, S1PTW = 0
> > [    5.279106]   FSC = 0x06: level 2 translation fault
> > [    5.284125] Data abort info:
> > [    5.287101]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
> > [    5.292742]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [    5.297939]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [    5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
> > [    5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
> > [    5.320917] Internal error: Oops: 0000000096000006 [#1]  SMP
> > [    5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
> > [    5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
> > [    5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [    5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> > [    5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> > [    5.425308] sp : ffff800080ffbc90
> > [    5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
> > [    5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
> > [    5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
> > [    5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
> > [    5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
> > [    5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
> > [    5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
> > [    5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
> > [    5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
> > [    5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000
> > [    5.502028] Call trace:
> > [    5.504549]  qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> > [    5.511344]  glink_subdev_stop+0x1c/0x30 [qcom_common]
> > [    5.516622]  rproc_stop+0x58/0x17c
> > [    5.520127]  rproc_trigger_recovery+0xb0/0x150
> > [    5.524693]  rproc_crash_handler_work+0xa4/0xc4
> > [    5.529346]  process_scheduled_works+0x18c/0x2d8
> > [    5.534092]  worker_thread+0x144/0x280
> > [    5.537952]  kthread+0x124/0x138
> > [    5.541280]  ret_from_fork+0x10/0x20
> > [    5.544965] Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> > [    5.551224] ---[ end trace 0000000000000000 ]---
> >
> > So set recovery_disabled during rproc_add().
> >
> > Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> > ---
> >  drivers/remoteproc/remoteproc_core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > index b087ed21858a..f66dde712cec 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -2286,7 +2286,10 @@ int rproc_add(struct rproc *rproc)
> >  {
> >  	struct device *dev = &rproc->dev;
> >  	int ret;
> > +	bool rproc_recovery_save;
> >
> > +	rproc_recovery_save  = rproc->recovery_disabled;
> > +	rproc->recovery_disabled = true;
> >  	ret = rproc_validate(rproc);
> >  	if (ret < 0)
> >  		return ret;
> > @@ -2319,6 +2322,7 @@ int rproc_add(struct rproc *rproc)
> >  	list_add_rcu(&rproc->node, &rproc_list);
> >  	mutex_unlock(&rproc_list_mutex);
> >
> > +	rproc->recovery_disabled = rproc_recovery_save;
> >  	return 0;
> >
> >  rproc_remove_dev:
> >
> > --
> > 2.25.1
> >
> >
> 
> Ideally things like this would be passed to the rproc core in some kind of a
> config structure and only set when registration succeeds. This looks to me
> like papering over the real issue and I think it's still racy as there's no
> true synchronization.
> 
> Wouldn't it be better to take rproc->lock for the entire duration of
> rproc_add()? It's already initialized in rproc_alloc().

It would still be racy as rproc_trigger_recovery() is called outside of
the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
must explicitly call cancel_work_sync() on the crash_handler work (and
any other work items that can be scheduled).

> 
> Bart

-- 
With best wishes
Dmitry
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Bartosz Golaszewski 4 weeks ago
On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov
<dmitry.baryshkov@oss.qualcomm.com> said:
> On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
>>
>> Ideally things like this would be passed to the rproc core in some kind of a
>> config structure and only set when registration succeeds. This looks to me
>> like papering over the real issue and I think it's still racy as there's no
>> true synchronization.
>>
>> Wouldn't it be better to take rproc->lock for the entire duration of
>> rproc_add()? It's already initialized in rproc_alloc().
>
> It would still be racy as rproc_trigger_recovery() is called outside of
> the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
> must explicitly call cancel_work_sync() on the crash_handler work (and
> any other work items that can be scheduled).
>

This looks weird TBH. For example: rproc_crash_handler_work() takes the lock,
but releases it right before calling inspecting rproc->recovery_disabled and
calling rproc_trigger_recovery(). It looks wrong, I think it should keep the
lock and rptoc_trigger_recovery() should enforce it being taken before the
call.

Bart
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Dmitry Baryshkov 3 weeks, 6 days ago
On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote:
> On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov
> <dmitry.baryshkov@oss.qualcomm.com> said:
> > On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
> >>
> >> Ideally things like this would be passed to the rproc core in some kind of a
> >> config structure and only set when registration succeeds. This looks to me
> >> like papering over the real issue and I think it's still racy as there's no
> >> true synchronization.
> >>
> >> Wouldn't it be better to take rproc->lock for the entire duration of
> >> rproc_add()? It's already initialized in rproc_alloc().
> >
> > It would still be racy as rproc_trigger_recovery() is called outside of
> > the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
> > must explicitly call cancel_work_sync() on the crash_handler work (and
> > any other work items that can be scheduled).
> >
> 
> This looks weird TBH. For example: rproc_crash_handler_work() takes the lock,
> but releases it right before calling inspecting rproc->recovery_disabled and
> calling rproc_trigger_recovery(). It looks wrong, I think it should keep the
> lock and rptoc_trigger_recovery() should enforce it being taken before the
> call.

Yes. Nevertheless the driver should cancel the work too.

-- 
With best wishes
Dmitry
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Jingyi Wang 3 weeks ago

On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote:
> On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote:
>> On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov
>> <dmitry.baryshkov@oss.qualcomm.com> said:
>>> On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
>>>>
>>>> Ideally things like this would be passed to the rproc core in some kind of a
>>>> config structure and only set when registration succeeds. This looks to me
>>>> like papering over the real issue and I think it's still racy as there's no
>>>> true synchronization.
>>>>
>>>> Wouldn't it be better to take rproc->lock for the entire duration of
>>>> rproc_add()? It's already initialized in rproc_alloc().
>>>
>>> It would still be racy as rproc_trigger_recovery() is called outside of
>>> the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
>>> must explicitly call cancel_work_sync() on the crash_handler work (and
>>> any other work items that can be scheduled).
>>>
>>
>> This looks weird TBH. For example: rproc_crash_handler_work() takes the lock,
>> but releases it right before calling inspecting rproc->recovery_disabled and
>> calling rproc_trigger_recovery(). It looks wrong, I think it should keep the
>> lock and rptoc_trigger_recovery() should enforce it being taken before the
>> call.
> 
> Yes. Nevertheless the driver should cancel the work too.
> 

Hi Dmitry & Bartosz,

rproc_crash_handler_work() may call rproc_trigger_recovery() and
rproc_add() may call rproc_boot(), both the function have already
hold the lock. And the lock cannot protect resources like glink_subdev
in the patch.

And there is a possible case for cancel_work, rproc_add tear down call
cancel work and wait for the work finished, the reboot run successfully,
and the tear down continued and the resources all released, including sysfs
and glink_subdev.

Indeed recovery_disabled is kind of hacky.
The root cause for this issue is that for remoteproc with RPROC_OFFLINE
state, the rproc_start will be called asynchronously, but for the remoteproc
with RPROC_DETACHED, the attach function is called directly, the failure
in this path will cause the rproc_add() fail and the resource release.
I think the current patch can be dropped, we are thinking about make rproc_attach
called asynchronously to avoid this race.

Thanks,
Jingyi
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Dmitry Baryshkov 3 weeks ago
On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote:
> 
> 
> On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote:
> > On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote:
> > > On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov
> > > <dmitry.baryshkov@oss.qualcomm.com> said:
> > > > On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
> > > > > 
> > > > > Ideally things like this would be passed to the rproc core in some kind of a
> > > > > config structure and only set when registration succeeds. This looks to me
> > > > > like papering over the real issue and I think it's still racy as there's no
> > > > > true synchronization.
> > > > > 
> > > > > Wouldn't it be better to take rproc->lock for the entire duration of
> > > > > rproc_add()? It's already initialized in rproc_alloc().
> > > > 
> > > > It would still be racy as rproc_trigger_recovery() is called outside of
> > > > the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
> > > > must explicitly call cancel_work_sync() on the crash_handler work (and
> > > > any other work items that can be scheduled).
> > > > 
> > > 
> > > This looks weird TBH. For example: rproc_crash_handler_work() takes the lock,
> > > but releases it right before calling inspecting rproc->recovery_disabled and
> > > calling rproc_trigger_recovery(). It looks wrong, I think it should keep the
> > > lock and rptoc_trigger_recovery() should enforce it being taken before the
> > > call.
> > 
> > Yes. Nevertheless the driver should cancel the work too.
> > 
> 
> Hi Dmitry & Bartosz,
> 
> rproc_crash_handler_work() may call rproc_trigger_recovery() and
> rproc_add() may call rproc_boot(), both the function have already
> hold the lock. And the lock cannot protect resources like glink_subdev
> in the patch.
> 
> And there is a possible case for cancel_work, rproc_add tear down call
> cancel work and wait for the work finished, the reboot run successfully,
> and the tear down continued and the resources all released, including sysfs
> and glink_subdev.
> 
> Indeed recovery_disabled is kind of hacky.
> The root cause for this issue is that for remoteproc with RPROC_OFFLINE
> state, the rproc_start will be called asynchronously, but for the remoteproc
> with RPROC_DETACHED, the attach function is called directly, the failure
> in this path will cause the rproc_add() fail and the resource release.
> I think the current patch can be dropped, we are thinking about make rproc_attach
> called asynchronously to avoid this race.

Isn't this patch necessary for SoCCP bringup? If not, why did you
include it into the series?

-- 
With best wishes
Dmitry
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Jingyi Wang 3 weeks ago

On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote:
> On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote:
>>
>>
>> On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote:
>>> On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote:
>>>> On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov
>>>> <dmitry.baryshkov@oss.qualcomm.com> said:
>>>>> On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
>>>>>>
>>>>>> Ideally things like this would be passed to the rproc core in some kind of a
>>>>>> config structure and only set when registration succeeds. This looks to me
>>>>>> like papering over the real issue and I think it's still racy as there's no
>>>>>> true synchronization.
>>>>>>
>>>>>> Wouldn't it be better to take rproc->lock for the entire duration of
>>>>>> rproc_add()? It's already initialized in rproc_alloc().
>>>>>
>>>>> It would still be racy as rproc_trigger_recovery() is called outside of
>>>>> the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
>>>>> must explicitly call cancel_work_sync() on the crash_handler work (and
>>>>> any other work items that can be scheduled).
>>>>>
>>>>
>>>> This looks weird TBH. For example: rproc_crash_handler_work() takes the lock,
>>>> but releases it right before calling inspecting rproc->recovery_disabled and
>>>> calling rproc_trigger_recovery(). It looks wrong, I think it should keep the
>>>> lock and rptoc_trigger_recovery() should enforce it being taken before the
>>>> call.
>>>
>>> Yes. Nevertheless the driver should cancel the work too.
>>>
>>
>> Hi Dmitry & Bartosz,
>>
>> rproc_crash_handler_work() may call rproc_trigger_recovery() and
>> rproc_add() may call rproc_boot(), both the function have already
>> hold the lock. And the lock cannot protect resources like glink_subdev
>> in the patch.
>>
>> And there is a possible case for cancel_work, rproc_add tear down call
>> cancel work and wait for the work finished, the reboot run successfully,
>> and the tear down continued and the resources all released, including sysfs
>> and glink_subdev.
>>
>> Indeed recovery_disabled is kind of hacky.
>> The root cause for this issue is that for remoteproc with RPROC_OFFLINE
>> state, the rproc_start will be called asynchronously, but for the remoteproc
>> with RPROC_DETACHED, the attach function is called directly, the failure
>> in this path will cause the rproc_add() fail and the resource release.
>> I think the current patch can be dropped, we are thinking about make rproc_attach
>> called asynchronously to avoid this race.
> 
> Isn't this patch necessary for SoCCP bringup? If not, why did you
> include it into the series?
> 
yes, will squash to soccp patch in next versoin.

Thanks,
Jingyi
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Bjorn Andersson 2 days, 16 hours ago
On Thu, Mar 19, 2026 at 01:44:48PM +0800, Jingyi Wang wrote:
> 
> 
> On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote:
> > On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote:
> > > 
> > > 
> > > On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote:
> > > > On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote:
> > > > > On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov
> > > > > <dmitry.baryshkov@oss.qualcomm.com> said:
> > > > > > On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
> > > > > > > 
> > > > > > > Ideally things like this would be passed to the rproc core in some kind of a
> > > > > > > config structure and only set when registration succeeds. This looks to me
> > > > > > > like papering over the real issue and I think it's still racy as there's no
> > > > > > > true synchronization.
> > > > > > > 
> > > > > > > Wouldn't it be better to take rproc->lock for the entire duration of
> > > > > > > rproc_add()? It's already initialized in rproc_alloc().
> > > > > > 
> > > > > > It would still be racy as rproc_trigger_recovery() is called outside of
> > > > > > the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
> > > > > > must explicitly call cancel_work_sync() on the crash_handler work (and
> > > > > > any other work items that can be scheduled).
> > > > > > 
> > > > > 
> > > > > This looks weird TBH. For example: rproc_crash_handler_work() takes the lock,
> > > > > but releases it right before calling inspecting rproc->recovery_disabled and
> > > > > calling rproc_trigger_recovery(). It looks wrong, I think it should keep the
> > > > > lock and rptoc_trigger_recovery() should enforce it being taken before the
> > > > > call.
> > > > 
> > > > Yes. Nevertheless the driver should cancel the work too.
> > > > 
> > > 
> > > Hi Dmitry & Bartosz,
> > > 
> > > rproc_crash_handler_work() may call rproc_trigger_recovery() and
> > > rproc_add() may call rproc_boot(), both the function have already
> > > hold the lock. And the lock cannot protect resources like glink_subdev
> > > in the patch.
> > > 
> > > And there is a possible case for cancel_work, rproc_add tear down call
> > > cancel work and wait for the work finished, the reboot run successfully,
> > > and the tear down continued and the resources all released, including sysfs
> > > and glink_subdev.
> > > 
> > > Indeed recovery_disabled is kind of hacky.
> > > The root cause for this issue is that for remoteproc with RPROC_OFFLINE
> > > state, the rproc_start will be called asynchronously, but for the remoteproc
> > > with RPROC_DETACHED, the attach function is called directly, the failure
> > > in this path will cause the rproc_add() fail and the resource release.
> > > I think the current patch can be dropped, we are thinking about make rproc_attach
> > > called asynchronously to avoid this race.
> > 
> > Isn't this patch necessary for SoCCP bringup? If not, why did you
> > include it into the series?
> > 
> yes, will squash to soccp patch in next versoin.

I'm sorry, but that doesn't make sense to me.

The SoCCP patch adds support for attaching SoCCP. This change tries to
address a generic problem shared across all remoteproc drivers (that
does attach?).

I think you should interpret Dmitry's comment as "why is this part of
this series, please fix this problem in a separate series".

Regards,
Bjorn
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Jingyi Wang 2 days, 4 hours ago

On 4/6/2026 11:04 PM, Bjorn Andersson wrote:
> On Thu, Mar 19, 2026 at 01:44:48PM +0800, Jingyi Wang wrote:
>>
>>
>> On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote:
>>> On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote:
>>>>
>>>>
>>>> On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote:
>>>>> On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote:
>>>>>> On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov
>>>>>> <dmitry.baryshkov@oss.qualcomm.com> said:
>>>>>>> On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
>>>>>>>>
>>>>>>>> Ideally things like this would be passed to the rproc core in some kind of a
>>>>>>>> config structure and only set when registration succeeds. This looks to me
>>>>>>>> like papering over the real issue and I think it's still racy as there's no
>>>>>>>> true synchronization.
>>>>>>>>
>>>>>>>> Wouldn't it be better to take rproc->lock for the entire duration of
>>>>>>>> rproc_add()? It's already initialized in rproc_alloc().
>>>>>>>
>>>>>>> It would still be racy as rproc_trigger_recovery() is called outside of
>>>>>>> the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
>>>>>>> must explicitly call cancel_work_sync() on the crash_handler work (and
>>>>>>> any other work items that can be scheduled).
>>>>>>>
>>>>>>
>>>>>> This looks weird TBH. For example: rproc_crash_handler_work() takes the lock,
>>>>>> but releases it right before calling inspecting rproc->recovery_disabled and
>>>>>> calling rproc_trigger_recovery(). It looks wrong, I think it should keep the
>>>>>> lock and rptoc_trigger_recovery() should enforce it being taken before the
>>>>>> call.
>>>>>
>>>>> Yes. Nevertheless the driver should cancel the work too.
>>>>>
>>>>
>>>> Hi Dmitry & Bartosz,
>>>>
>>>> rproc_crash_handler_work() may call rproc_trigger_recovery() and
>>>> rproc_add() may call rproc_boot(), both the function have already
>>>> hold the lock. And the lock cannot protect resources like glink_subdev
>>>> in the patch.
>>>>
>>>> And there is a possible case for cancel_work, rproc_add tear down call
>>>> cancel work and wait for the work finished, the reboot run successfully,
>>>> and the tear down continued and the resources all released, including sysfs
>>>> and glink_subdev.
>>>>
>>>> Indeed recovery_disabled is kind of hacky.
>>>> The root cause for this issue is that for remoteproc with RPROC_OFFLINE
>>>> state, the rproc_start will be called asynchronously, but for the remoteproc
>>>> with RPROC_DETACHED, the attach function is called directly, the failure
>>>> in this path will cause the rproc_add() fail and the resource release.
>>>> I think the current patch can be dropped, we are thinking about make rproc_attach
>>>> called asynchronously to avoid this race.
>>>
>>> Isn't this patch necessary for SoCCP bringup? If not, why did you
>>> include it into the series?
>>>
>> yes, will squash to soccp patch in next versoin.
> 
> I'm sorry, but that doesn't make sense to me.
> 
> The SoCCP patch adds support for attaching SoCCP. This change tries to
> address a generic problem shared across all remoteproc drivers (that
> does attach?).
> 
> I think you should interpret Dmitry's comment as "why is this part of
> this series, please fix this problem in a separate series".
> 
> Regards,
> Bjorn

Sorry I might misunderstand this comment, this patch only address problem for
remoteproc that does attach, I will send a separate series to make rproc_attach
called asynchronously

Thanks,
Jingyi
Re: [PATCH v4 5/7] remoteproc: core: set recovery_disabled when doing rproc_add()
Posted by Dmitry Baryshkov 2 days, 11 hours ago
On Mon, Apr 06, 2026 at 10:04:03AM -0500, Bjorn Andersson wrote:
> On Thu, Mar 19, 2026 at 01:44:48PM +0800, Jingyi Wang wrote:
> > 
> > 
> > On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote:
> > > 
> > > Isn't this patch necessary for SoCCP bringup? If not, why did you
> > > include it into the series?
> > > 
> > yes, will squash to soccp patch in next versoin.
> 
> I'm sorry, but that doesn't make sense to me.
> 
> The SoCCP patch adds support for attaching SoCCP. This change tries to
> address a generic problem shared across all remoteproc drivers (that
> does attach?).
> 
> I think you should interpret Dmitry's comment as "why is this part of
> this series, please fix this problem in a separate series".

Exactly.


-- 
With best wishes
Dmitry