rproc_add() called by rproc probe function failure will tear down all
the resources including do device_del() and remove subdev etc. If
rproc_report_crash() is called in this path, the rproc_crash_handler_work
could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
be called with recovery enabled, which may cause NULL pointer dereference
if the resource has already been cleaned up.
[ 5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
[ 5.260499] Mem abort info:
[ 5.263384] ESR = 0x0000000096000006
[ 5.267248] EC = 0x25: DABT (current EL), IL = 32 bits
[ 5.272711] SET = 0, FnV = 0
[ 5.275865] EA = 0, S1PTW = 0
[ 5.279106] FSC = 0x06: level 2 translation fault
[ 5.284125] Data abort info:
[ 5.287101] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[ 5.292742] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 5.297939] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
[ 5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
[ 5.320917] Internal error: Oops: 0000000096000006 [#1] SMP
[ 5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
[ 5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
[ 5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
[ 5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
[ 5.425308] sp : ffff800080ffbc90
[ 5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
[ 5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
[ 5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
[ 5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
[ 5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
[ 5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
[ 5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
[ 5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
[ 5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
[ 5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000
[ 5.502028] Call trace:
[ 5.504549] qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
[ 5.511344] glink_subdev_stop+0x1c/0x30 [qcom_common]
[ 5.516622] rproc_stop+0x58/0x17c
[ 5.520127] rproc_trigger_recovery+0xb0/0x150
[ 5.524693] rproc_crash_handler_work+0xa4/0xc4
[ 5.529346] process_scheduled_works+0x18c/0x2d8
[ 5.534092] worker_thread+0x144/0x280
[ 5.537952] kthread+0x124/0x138
[ 5.541280] ret_from_fork+0x10/0x20
[ 5.544965] Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
[ 5.551224] ---[ end trace 0000000000000000 ]---
So set recovery_disabled during rproc_add().
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
drivers/remoteproc/remoteproc_core.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index b087ed21858a..f66dde712cec 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -2286,7 +2286,10 @@ int rproc_add(struct rproc *rproc)
{
struct device *dev = &rproc->dev;
int ret;
+ bool rproc_recovery_save;
+ rproc_recovery_save = rproc->recovery_disabled;
+ rproc->recovery_disabled = true;
ret = rproc_validate(rproc);
if (ret < 0)
return ret;
@@ -2319,6 +2322,7 @@ int rproc_add(struct rproc *rproc)
list_add_rcu(&rproc->node, &rproc_list);
mutex_unlock(&rproc_list_mutex);
+ rproc->recovery_disabled = rproc_recovery_save;
return 0;
rproc_remove_dev:
--
2.25.1
On Tue, Mar 10, 2026 at 03:03:21AM -0700, Jingyi Wang wrote: > rproc_add() called by rproc probe function failure will tear down all > the resources including do device_del() and remove subdev etc. If > rproc_report_crash() is called in this path, the rproc_crash_handler_work > could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will > be called with recovery enabled, which may cause NULL pointer dereference > if the resource has already been cleaned up. > > [ 5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300 > [ 5.260499] Mem abort info: > [ 5.263384] ESR = 0x0000000096000006 > [ 5.267248] EC = 0x25: DABT (current EL), IL = 32 bits > [ 5.272711] SET = 0, FnV = 0 > [ 5.275865] EA = 0, S1PTW = 0 > [ 5.279106] FSC = 0x06: level 2 translation fault > [ 5.284125] Data abort info: > [ 5.287101] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 > [ 5.292742] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > [ 5.297939] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [ 5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000 > [ 5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000 > [ 5.320917] Internal error: Oops: 0000000096000006 [#1] SMP > [ 5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT) > [ 5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work > [ 5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) > [ 5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] > [ 5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common] > [ 5.425308] sp : ffff800080ffbc90 > [ 5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00 > [ 5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb > [ 5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000 > [ 5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010 > [ 5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8 > [ 5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0 > [ 5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818 > [ 5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001 > [ 5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000 > [ 5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000 Please trim your commit msg from irrelevant data, so this will be easier to read. Best regards, Krzysztof
On 3/11/2026 2:20 PM, Krzysztof Kozlowski wrote: > On Tue, Mar 10, 2026 at 03:03:21AM -0700, Jingyi Wang wrote: >> rproc_add() called by rproc probe function failure will tear down all >> the resources including do device_del() and remove subdev etc. If >> rproc_report_crash() is called in this path, the rproc_crash_handler_work >> could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will >> be called with recovery enabled, which may cause NULL pointer dereference >> if the resource has already been cleaned up. >> >> [ 5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300 >> [ 5.260499] Mem abort info: >> [ 5.263384] ESR = 0x0000000096000006 >> [ 5.267248] EC = 0x25: DABT (current EL), IL = 32 bits >> [ 5.272711] SET = 0, FnV = 0 >> [ 5.275865] EA = 0, S1PTW = 0 >> [ 5.279106] FSC = 0x06: level 2 translation fault >> [ 5.284125] Data abort info: >> [ 5.287101] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 >> [ 5.292742] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >> [ 5.297939] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >> [ 5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000 >> [ 5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000 >> [ 5.320917] Internal error: Oops: 0000000096000006 [#1] SMP >> [ 5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT) >> [ 5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work >> [ 5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) >> [ 5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] >> [ 5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common] >> [ 5.425308] sp : ffff800080ffbc90 >> [ 5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00 >> [ 5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb >> [ 5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000 >> [ 5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010 >> [ 5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8 >> [ 5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0 >> [ 5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818 >> [ 5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001 >> [ 5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000 >> [ 5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000 > > Please trim your commit msg from irrelevant data, so this will be easier > to read. > > Best regards, > Krzysztof > Well noted. Thanks, Jingyi
On Tue, 10 Mar 2026 11:03:21 +0100, Jingyi Wang
<jingyi.wang@oss.qualcomm.com> said:
> rproc_add() called by rproc probe function failure will tear down all
> the resources including do device_del() and remove subdev etc. If
> rproc_report_crash() is called in this path, the rproc_crash_handler_work
> could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
> be called with recovery enabled, which may cause NULL pointer dereference
> if the resource has already been cleaned up.
>
> [ 5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> [ 5.260499] Mem abort info:
> [ 5.263384] ESR = 0x0000000096000006
> [ 5.267248] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 5.272711] SET = 0, FnV = 0
> [ 5.275865] EA = 0, S1PTW = 0
> [ 5.279106] FSC = 0x06: level 2 translation fault
> [ 5.284125] Data abort info:
> [ 5.287101] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
> [ 5.292742] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [ 5.297939] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
> [ 5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
> [ 5.320917] Internal error: Oops: 0000000096000006 [#1] SMP
> [ 5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
> [ 5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
> [ 5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [ 5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> [ 5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> [ 5.425308] sp : ffff800080ffbc90
> [ 5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
> [ 5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
> [ 5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
> [ 5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
> [ 5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
> [ 5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
> [ 5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
> [ 5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
> [ 5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
> [ 5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000
> [ 5.502028] Call trace:
> [ 5.504549] qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> [ 5.511344] glink_subdev_stop+0x1c/0x30 [qcom_common]
> [ 5.516622] rproc_stop+0x58/0x17c
> [ 5.520127] rproc_trigger_recovery+0xb0/0x150
> [ 5.524693] rproc_crash_handler_work+0xa4/0xc4
> [ 5.529346] process_scheduled_works+0x18c/0x2d8
> [ 5.534092] worker_thread+0x144/0x280
> [ 5.537952] kthread+0x124/0x138
> [ 5.541280] ret_from_fork+0x10/0x20
> [ 5.544965] Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> [ 5.551224] ---[ end trace 0000000000000000 ]---
>
> So set recovery_disabled during rproc_add().
>
> Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> ---
> drivers/remoteproc/remoteproc_core.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index b087ed21858a..f66dde712cec 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -2286,7 +2286,10 @@ int rproc_add(struct rproc *rproc)
> {
> struct device *dev = &rproc->dev;
> int ret;
> + bool rproc_recovery_save;
>
> + rproc_recovery_save = rproc->recovery_disabled;
> + rproc->recovery_disabled = true;
> ret = rproc_validate(rproc);
> if (ret < 0)
> return ret;
> @@ -2319,6 +2322,7 @@ int rproc_add(struct rproc *rproc)
> list_add_rcu(&rproc->node, &rproc_list);
> mutex_unlock(&rproc_list_mutex);
>
> + rproc->recovery_disabled = rproc_recovery_save;
> return 0;
>
> rproc_remove_dev:
>
> --
> 2.25.1
>
>
Ideally things like this would be passed to the rproc core in some kind of a
config structure and only set when registration succeeds. This looks to me
like papering over the real issue and I think it's still racy as there's no
true synchronization.
Wouldn't it be better to take rproc->lock for the entire duration of
rproc_add()? It's already initialized in rproc_alloc().
Bart
On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
> On Tue, 10 Mar 2026 11:03:21 +0100, Jingyi Wang
> <jingyi.wang@oss.qualcomm.com> said:
> > rproc_add() called by rproc probe function failure will tear down all
> > the resources including do device_del() and remove subdev etc. If
> > rproc_report_crash() is called in this path, the rproc_crash_handler_work
> > could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
> > be called with recovery enabled, which may cause NULL pointer dereference
> > if the resource has already been cleaned up.
> >
> > [ 5.251483] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> > [ 5.260499] Mem abort info:
> > [ 5.263384] ESR = 0x0000000096000006
> > [ 5.267248] EC = 0x25: DABT (current EL), IL = 32 bits
> > [ 5.272711] SET = 0, FnV = 0
> > [ 5.275865] EA = 0, S1PTW = 0
> > [ 5.279106] FSC = 0x06: level 2 translation fault
> > [ 5.284125] Data abort info:
> > [ 5.287101] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
> > [ 5.292742] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [ 5.297939] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [ 5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
> > [ 5.310022] [0000000000000300] pgd=080000089e087403, p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
> > [ 5.320917] Internal error: Oops: 0000000096000006 [#1] SMP
> > [ 5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
> > [ 5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
> > [ 5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [ 5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> > [ 5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> > [ 5.425308] sp : ffff800080ffbc90
> > [ 5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: ffff000800059c00
> > [ 5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 61c8864680b583eb
> > [ 5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: ffff00081be83000
> > [ 5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 0000000000000010
> > [ 5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0008042684f8
> > [ 5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: ffffd37f69f967a0
> > [ 5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : ffffd37f69fee818
> > [ 5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
> > [ 5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 0000000000000000
> > [ 5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 0000000000000000
> > [ 5.502028] Call trace:
> > [ 5.504549] qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> > [ 5.511344] glink_subdev_stop+0x1c/0x30 [qcom_common]
> > [ 5.516622] rproc_stop+0x58/0x17c
> > [ 5.520127] rproc_trigger_recovery+0xb0/0x150
> > [ 5.524693] rproc_crash_handler_work+0xa4/0xc4
> > [ 5.529346] process_scheduled_works+0x18c/0x2d8
> > [ 5.534092] worker_thread+0x144/0x280
> > [ 5.537952] kthread+0x124/0x138
> > [ 5.541280] ret_from_fork+0x10/0x20
> > [ 5.544965] Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> > [ 5.551224] ---[ end trace 0000000000000000 ]---
> >
> > So set recovery_disabled during rproc_add().
> >
> > Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> > ---
> > drivers/remoteproc/remoteproc_core.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > index b087ed21858a..f66dde712cec 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -2286,7 +2286,10 @@ int rproc_add(struct rproc *rproc)
> > {
> > struct device *dev = &rproc->dev;
> > int ret;
> > + bool rproc_recovery_save;
> >
> > + rproc_recovery_save = rproc->recovery_disabled;
> > + rproc->recovery_disabled = true;
> > ret = rproc_validate(rproc);
> > if (ret < 0)
> > return ret;
> > @@ -2319,6 +2322,7 @@ int rproc_add(struct rproc *rproc)
> > list_add_rcu(&rproc->node, &rproc_list);
> > mutex_unlock(&rproc_list_mutex);
> >
> > + rproc->recovery_disabled = rproc_recovery_save;
> > return 0;
> >
> > rproc_remove_dev:
> >
> > --
> > 2.25.1
> >
> >
>
> Ideally things like this would be passed to the rproc core in some kind of a
> config structure and only set when registration succeeds. This looks to me
> like papering over the real issue and I think it's still racy as there's no
> true synchronization.
>
> Wouldn't it be better to take rproc->lock for the entire duration of
> rproc_add()? It's already initialized in rproc_alloc().
It would still be racy as rproc_trigger_recovery() is called outside of
the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
must explicitly call cancel_work_sync() on the crash_handler work (and
any other work items that can be scheduled).
>
> Bart
--
With best wishes
Dmitry
On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> said: > On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote: >> >> Ideally things like this would be passed to the rproc core in some kind of a >> config structure and only set when registration succeeds. This looks to me >> like papering over the real issue and I think it's still racy as there's no >> true synchronization. >> >> Wouldn't it be better to take rproc->lock for the entire duration of >> rproc_add()? It's already initialized in rproc_alloc(). > > It would still be racy as rproc_trigger_recovery() is called outside of > the lock. Instead the error cleanup path (and BTW, rproc_del() path too) > must explicitly call cancel_work_sync() on the crash_handler work (and > any other work items that can be scheduled). > This looks weird TBH. For example: rproc_crash_handler_work() takes the lock, but releases it right before calling inspecting rproc->recovery_disabled and calling rproc_trigger_recovery(). It looks wrong, I think it should keep the lock and rptoc_trigger_recovery() should enforce it being taken before the call. Bart
On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote: > On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov > <dmitry.baryshkov@oss.qualcomm.com> said: > > On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote: > >> > >> Ideally things like this would be passed to the rproc core in some kind of a > >> config structure and only set when registration succeeds. This looks to me > >> like papering over the real issue and I think it's still racy as there's no > >> true synchronization. > >> > >> Wouldn't it be better to take rproc->lock for the entire duration of > >> rproc_add()? It's already initialized in rproc_alloc(). > > > > It would still be racy as rproc_trigger_recovery() is called outside of > > the lock. Instead the error cleanup path (and BTW, rproc_del() path too) > > must explicitly call cancel_work_sync() on the crash_handler work (and > > any other work items that can be scheduled). > > > > This looks weird TBH. For example: rproc_crash_handler_work() takes the lock, > but releases it right before calling inspecting rproc->recovery_disabled and > calling rproc_trigger_recovery(). It looks wrong, I think it should keep the > lock and rptoc_trigger_recovery() should enforce it being taken before the > call. Yes. Nevertheless the driver should cancel the work too. -- With best wishes Dmitry
On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote: > On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote: >> On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov >> <dmitry.baryshkov@oss.qualcomm.com> said: >>> On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote: >>>> >>>> Ideally things like this would be passed to the rproc core in some kind of a >>>> config structure and only set when registration succeeds. This looks to me >>>> like papering over the real issue and I think it's still racy as there's no >>>> true synchronization. >>>> >>>> Wouldn't it be better to take rproc->lock for the entire duration of >>>> rproc_add()? It's already initialized in rproc_alloc(). >>> >>> It would still be racy as rproc_trigger_recovery() is called outside of >>> the lock. Instead the error cleanup path (and BTW, rproc_del() path too) >>> must explicitly call cancel_work_sync() on the crash_handler work (and >>> any other work items that can be scheduled). >>> >> >> This looks weird TBH. For example: rproc_crash_handler_work() takes the lock, >> but releases it right before calling inspecting rproc->recovery_disabled and >> calling rproc_trigger_recovery(). It looks wrong, I think it should keep the >> lock and rptoc_trigger_recovery() should enforce it being taken before the >> call. > > Yes. Nevertheless the driver should cancel the work too. > Hi Dmitry & Bartosz, rproc_crash_handler_work() may call rproc_trigger_recovery() and rproc_add() may call rproc_boot(), both the function have already hold the lock. And the lock cannot protect resources like glink_subdev in the patch. And there is a possible case for cancel_work, rproc_add tear down call cancel work and wait for the work finished, the reboot run successfully, and the tear down continued and the resources all released, including sysfs and glink_subdev. Indeed recovery_disabled is kind of hacky. The root cause for this issue is that for remoteproc with RPROC_OFFLINE state, the rproc_start will be called asynchronously, but for the remoteproc with RPROC_DETACHED, the attach function is called directly, the failure in this path will cause the rproc_add() fail and the resource release. I think the current patch can be dropped, we are thinking about make rproc_attach called asynchronously to avoid this race. Thanks, Jingyi
On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote: > > > On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote: > > On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote: > > > On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov > > > <dmitry.baryshkov@oss.qualcomm.com> said: > > > > On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote: > > > > > > > > > > Ideally things like this would be passed to the rproc core in some kind of a > > > > > config structure and only set when registration succeeds. This looks to me > > > > > like papering over the real issue and I think it's still racy as there's no > > > > > true synchronization. > > > > > > > > > > Wouldn't it be better to take rproc->lock for the entire duration of > > > > > rproc_add()? It's already initialized in rproc_alloc(). > > > > > > > > It would still be racy as rproc_trigger_recovery() is called outside of > > > > the lock. Instead the error cleanup path (and BTW, rproc_del() path too) > > > > must explicitly call cancel_work_sync() on the crash_handler work (and > > > > any other work items that can be scheduled). > > > > > > > > > > This looks weird TBH. For example: rproc_crash_handler_work() takes the lock, > > > but releases it right before calling inspecting rproc->recovery_disabled and > > > calling rproc_trigger_recovery(). It looks wrong, I think it should keep the > > > lock and rptoc_trigger_recovery() should enforce it being taken before the > > > call. > > > > Yes. Nevertheless the driver should cancel the work too. > > > > Hi Dmitry & Bartosz, > > rproc_crash_handler_work() may call rproc_trigger_recovery() and > rproc_add() may call rproc_boot(), both the function have already > hold the lock. And the lock cannot protect resources like glink_subdev > in the patch. > > And there is a possible case for cancel_work, rproc_add tear down call > cancel work and wait for the work finished, the reboot run successfully, > and the tear down continued and the resources all released, including sysfs > and glink_subdev. > > Indeed recovery_disabled is kind of hacky. > The root cause for this issue is that for remoteproc with RPROC_OFFLINE > state, the rproc_start will be called asynchronously, but for the remoteproc > with RPROC_DETACHED, the attach function is called directly, the failure > in this path will cause the rproc_add() fail and the resource release. > I think the current patch can be dropped, we are thinking about make rproc_attach > called asynchronously to avoid this race. Isn't this patch necessary for SoCCP bringup? If not, why did you include it into the series? -- With best wishes Dmitry
On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote: > On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote: >> >> >> On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote: >>> On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote: >>>> On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov >>>> <dmitry.baryshkov@oss.qualcomm.com> said: >>>>> On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote: >>>>>> >>>>>> Ideally things like this would be passed to the rproc core in some kind of a >>>>>> config structure and only set when registration succeeds. This looks to me >>>>>> like papering over the real issue and I think it's still racy as there's no >>>>>> true synchronization. >>>>>> >>>>>> Wouldn't it be better to take rproc->lock for the entire duration of >>>>>> rproc_add()? It's already initialized in rproc_alloc(). >>>>> >>>>> It would still be racy as rproc_trigger_recovery() is called outside of >>>>> the lock. Instead the error cleanup path (and BTW, rproc_del() path too) >>>>> must explicitly call cancel_work_sync() on the crash_handler work (and >>>>> any other work items that can be scheduled). >>>>> >>>> >>>> This looks weird TBH. For example: rproc_crash_handler_work() takes the lock, >>>> but releases it right before calling inspecting rproc->recovery_disabled and >>>> calling rproc_trigger_recovery(). It looks wrong, I think it should keep the >>>> lock and rptoc_trigger_recovery() should enforce it being taken before the >>>> call. >>> >>> Yes. Nevertheless the driver should cancel the work too. >>> >> >> Hi Dmitry & Bartosz, >> >> rproc_crash_handler_work() may call rproc_trigger_recovery() and >> rproc_add() may call rproc_boot(), both the function have already >> hold the lock. And the lock cannot protect resources like glink_subdev >> in the patch. >> >> And there is a possible case for cancel_work, rproc_add tear down call >> cancel work and wait for the work finished, the reboot run successfully, >> and the tear down continued and the resources all released, including sysfs >> and glink_subdev. >> >> Indeed recovery_disabled is kind of hacky. >> The root cause for this issue is that for remoteproc with RPROC_OFFLINE >> state, the rproc_start will be called asynchronously, but for the remoteproc >> with RPROC_DETACHED, the attach function is called directly, the failure >> in this path will cause the rproc_add() fail and the resource release. >> I think the current patch can be dropped, we are thinking about make rproc_attach >> called asynchronously to avoid this race. > > Isn't this patch necessary for SoCCP bringup? If not, why did you > include it into the series? > yes, will squash to soccp patch in next versoin. Thanks, Jingyi
On Thu, Mar 19, 2026 at 01:44:48PM +0800, Jingyi Wang wrote: > > > On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote: > > On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote: > > > > > > > > > On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote: > > > > On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote: > > > > > On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov > > > > > <dmitry.baryshkov@oss.qualcomm.com> said: > > > > > > On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote: > > > > > > > > > > > > > > Ideally things like this would be passed to the rproc core in some kind of a > > > > > > > config structure and only set when registration succeeds. This looks to me > > > > > > > like papering over the real issue and I think it's still racy as there's no > > > > > > > true synchronization. > > > > > > > > > > > > > > Wouldn't it be better to take rproc->lock for the entire duration of > > > > > > > rproc_add()? It's already initialized in rproc_alloc(). > > > > > > > > > > > > It would still be racy as rproc_trigger_recovery() is called outside of > > > > > > the lock. Instead the error cleanup path (and BTW, rproc_del() path too) > > > > > > must explicitly call cancel_work_sync() on the crash_handler work (and > > > > > > any other work items that can be scheduled). > > > > > > > > > > > > > > > > This looks weird TBH. For example: rproc_crash_handler_work() takes the lock, > > > > > but releases it right before calling inspecting rproc->recovery_disabled and > > > > > calling rproc_trigger_recovery(). It looks wrong, I think it should keep the > > > > > lock and rptoc_trigger_recovery() should enforce it being taken before the > > > > > call. > > > > > > > > Yes. Nevertheless the driver should cancel the work too. > > > > > > > > > > Hi Dmitry & Bartosz, > > > > > > rproc_crash_handler_work() may call rproc_trigger_recovery() and > > > rproc_add() may call rproc_boot(), both the function have already > > > hold the lock. And the lock cannot protect resources like glink_subdev > > > in the patch. > > > > > > And there is a possible case for cancel_work, rproc_add tear down call > > > cancel work and wait for the work finished, the reboot run successfully, > > > and the tear down continued and the resources all released, including sysfs > > > and glink_subdev. > > > > > > Indeed recovery_disabled is kind of hacky. > > > The root cause for this issue is that for remoteproc with RPROC_OFFLINE > > > state, the rproc_start will be called asynchronously, but for the remoteproc > > > with RPROC_DETACHED, the attach function is called directly, the failure > > > in this path will cause the rproc_add() fail and the resource release. > > > I think the current patch can be dropped, we are thinking about make rproc_attach > > > called asynchronously to avoid this race. > > > > Isn't this patch necessary for SoCCP bringup? If not, why did you > > include it into the series? > > > yes, will squash to soccp patch in next versoin. I'm sorry, but that doesn't make sense to me. The SoCCP patch adds support for attaching SoCCP. This change tries to address a generic problem shared across all remoteproc drivers (that does attach?). I think you should interpret Dmitry's comment as "why is this part of this series, please fix this problem in a separate series". Regards, Bjorn
On 4/6/2026 11:04 PM, Bjorn Andersson wrote: > On Thu, Mar 19, 2026 at 01:44:48PM +0800, Jingyi Wang wrote: >> >> >> On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote: >>> On Thu, Mar 19, 2026 at 12:36:15PM +0800, Jingyi Wang wrote: >>>> >>>> >>>> On 3/13/2026 10:37 AM, Dmitry Baryshkov wrote: >>>>> On Wed, Mar 11, 2026 at 01:39:50AM -0700, Bartosz Golaszewski wrote: >>>>>> On Wed, 11 Mar 2026 03:11:42 +0100, Dmitry Baryshkov >>>>>> <dmitry.baryshkov@oss.qualcomm.com> said: >>>>>>> On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote: >>>>>>>> >>>>>>>> Ideally things like this would be passed to the rproc core in some kind of a >>>>>>>> config structure and only set when registration succeeds. This looks to me >>>>>>>> like papering over the real issue and I think it's still racy as there's no >>>>>>>> true synchronization. >>>>>>>> >>>>>>>> Wouldn't it be better to take rproc->lock for the entire duration of >>>>>>>> rproc_add()? It's already initialized in rproc_alloc(). >>>>>>> >>>>>>> It would still be racy as rproc_trigger_recovery() is called outside of >>>>>>> the lock. Instead the error cleanup path (and BTW, rproc_del() path too) >>>>>>> must explicitly call cancel_work_sync() on the crash_handler work (and >>>>>>> any other work items that can be scheduled). >>>>>>> >>>>>> >>>>>> This looks weird TBH. For example: rproc_crash_handler_work() takes the lock, >>>>>> but releases it right before calling inspecting rproc->recovery_disabled and >>>>>> calling rproc_trigger_recovery(). It looks wrong, I think it should keep the >>>>>> lock and rptoc_trigger_recovery() should enforce it being taken before the >>>>>> call. >>>>> >>>>> Yes. Nevertheless the driver should cancel the work too. >>>>> >>>> >>>> Hi Dmitry & Bartosz, >>>> >>>> rproc_crash_handler_work() may call rproc_trigger_recovery() and >>>> rproc_add() may call rproc_boot(), both the function have already >>>> hold the lock. And the lock cannot protect resources like glink_subdev >>>> in the patch. >>>> >>>> And there is a possible case for cancel_work, rproc_add tear down call >>>> cancel work and wait for the work finished, the reboot run successfully, >>>> and the tear down continued and the resources all released, including sysfs >>>> and glink_subdev. >>>> >>>> Indeed recovery_disabled is kind of hacky. >>>> The root cause for this issue is that for remoteproc with RPROC_OFFLINE >>>> state, the rproc_start will be called asynchronously, but for the remoteproc >>>> with RPROC_DETACHED, the attach function is called directly, the failure >>>> in this path will cause the rproc_add() fail and the resource release. >>>> I think the current patch can be dropped, we are thinking about make rproc_attach >>>> called asynchronously to avoid this race. >>> >>> Isn't this patch necessary for SoCCP bringup? If not, why did you >>> include it into the series? >>> >> yes, will squash to soccp patch in next versoin. > > I'm sorry, but that doesn't make sense to me. > > The SoCCP patch adds support for attaching SoCCP. This change tries to > address a generic problem shared across all remoteproc drivers (that > does attach?). > > I think you should interpret Dmitry's comment as "why is this part of > this series, please fix this problem in a separate series". > > Regards, > Bjorn Sorry I might misunderstand this comment, this patch only address problem for remoteproc that does attach, I will send a separate series to make rproc_attach called asynchronously Thanks, Jingyi
On Mon, Apr 06, 2026 at 10:04:03AM -0500, Bjorn Andersson wrote: > On Thu, Mar 19, 2026 at 01:44:48PM +0800, Jingyi Wang wrote: > > > > > > On 3/19/2026 1:23 PM, Dmitry Baryshkov wrote: > > > > > > Isn't this patch necessary for SoCCP bringup? If not, why did you > > > include it into the series? > > > > > yes, will squash to soccp patch in next versoin. > > I'm sorry, but that doesn't make sense to me. > > The SoCCP patch adds support for attaching SoCCP. This change tries to > address a generic problem shared across all remoteproc drivers (that > does attach?). > > I think you should interpret Dmitry's comment as "why is this part of > this series, please fix this problem in a separate series". Exactly. -- With best wishes Dmitry
© 2016 - 2026 Red Hat, Inc.