remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

[PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Posted by Mukesh Ojha 1 year, 4 months ago

Multiple call to glink_subdev_stop() for the same remoteproc can happen
if rproc_stop() fails from Process-A that leaves the rproc state to
RPROC_CRASHED state later a call to recovery_store from user space in
Process B triggers rproc_trigger_recovery() of the same remoteproc to
recover it results in NULL pointer dereference issue in
qcom_glink_smem_unregister().

Fix it by having a NULL check in glink_subdev_stop().

	Process-A                			Process-B

  fatal error interrupt happens

  rproc_crash_handler_work()
    mutex_lock_interruptible(&rproc->lock);
    ...

       rproc->state = RPROC_CRASHED;
    ...
    mutex_unlock(&rproc->lock);

    rproc_trigger_recovery()
     mutex_lock_interruptible(&rproc->lock);

      adsp_stop()
      qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
      remoteproc remoteproc3: can't stop rproc: -22
     mutex_unlock(&rproc->lock);

						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
						recovery_store()
						 rproc_trigger_recovery()
						  mutex_lock_interruptible(&rproc->lock);
						   rproc_stop()
						    glink_subdev_stop()
						      qcom_glink_smem_unregister() ==|
                                                                                     |
                                                                                     V
						      Unable to handle kernel NULL pointer dereference
                                                                at virtual address 0000000000000358

Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
---
- We can do this NULL check in qcom_glink_smem_unregister() as it is
  exported function however, there is only one user of this. So, doing
  it with current approach should also be fine.

 drivers/remoteproc/qcom_common.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
index 8c8688f99f0a..52d6c9b99fdb 100644
--- a/drivers/remoteproc/qcom_common.c
+++ b/drivers/remoteproc/qcom_common.c
@@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed)
 {
 	struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
 
+	if (!glink->edge)
+		return;
+
 	qcom_glink_smem_unregister(glink->edge);
 	glink->edge = NULL;
 }
-- 
2.34.1

Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Posted by Bjorn Andersson 1 year, 4 months ago

On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> Multiple call to glink_subdev_stop() for the same remoteproc can happen
> if rproc_stop() fails from Process-A that leaves the rproc state to
> RPROC_CRASHED state later a call to recovery_store from user space in
> Process B triggers rproc_trigger_recovery() of the same remoteproc to
> recover it results in NULL pointer dereference issue in
> qcom_glink_smem_unregister().
> 
> Fix it by having a NULL check in glink_subdev_stop().
> 
> 	Process-A                			Process-B
> 
>   fatal error interrupt happens
> 
>   rproc_crash_handler_work()
>     mutex_lock_interruptible(&rproc->lock);
>     ...
> 
>        rproc->state = RPROC_CRASHED;
>     ...
>     mutex_unlock(&rproc->lock);
> 
>     rproc_trigger_recovery()
>      mutex_lock_interruptible(&rproc->lock);
> 
>       adsp_stop()
>       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
>       remoteproc remoteproc3: can't stop rproc: -22

I presume that at this point this remoteproc is in some undefined state
and the only way to recover is for the user to reboot the machine?


The check for glink->edge avoids one pitfall following this, but I'd
prefer to see a solution that avoids issues in this scenario in the
remoteproc core - rather than working around side effects of this in
different places.

Regards,
Bjorn

>      mutex_unlock(&rproc->lock);
> 
> 						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> 						recovery_store()
> 						 rproc_trigger_recovery()
> 						  mutex_lock_interruptible(&rproc->lock);
> 						   rproc_stop()
> 						    glink_subdev_stop()
> 						      qcom_glink_smem_unregister() ==|
>                                                                                      |
>                                                                                      V
> 						      Unable to handle kernel NULL pointer dereference
>                                                                 at virtual address 0000000000000358
> 
> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> ---
> - We can do this NULL check in qcom_glink_smem_unregister() as it is
>   exported function however, there is only one user of this. So, doing
>   it with current approach should also be fine.
> 
>  drivers/remoteproc/qcom_common.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
> index 8c8688f99f0a..52d6c9b99fdb 100644
> --- a/drivers/remoteproc/qcom_common.c
> +++ b/drivers/remoteproc/qcom_common.c
> @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed)
>  {
>  	struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
>  
> +	if (!glink->edge)
> +		return;
> +
>  	qcom_glink_smem_unregister(glink->edge);
>  	glink->edge = NULL;
>  }
> -- 
> 2.34.1
> 
>

Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Posted by Mukesh Ojha 1 year, 4 months ago

On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > if rproc_stop() fails from Process-A that leaves the rproc state to
> > RPROC_CRASHED state later a call to recovery_store from user space in
> > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > recover it results in NULL pointer dereference issue in
> > qcom_glink_smem_unregister().
> > 
> > Fix it by having a NULL check in glink_subdev_stop().
> > 
> > 	Process-A                			Process-B
> > 
> >   fatal error interrupt happens
> > 
> >   rproc_crash_handler_work()
> >     mutex_lock_interruptible(&rproc->lock);
> >     ...
> > 
> >        rproc->state = RPROC_CRASHED;
> >     ...
> >     mutex_unlock(&rproc->lock);
> > 
> >     rproc_trigger_recovery()
> >      mutex_lock_interruptible(&rproc->lock);
> > 
> >       adsp_stop()
> >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> >       remoteproc remoteproc3: can't stop rproc: -22
> 
> I presume that at this point this remoteproc is in some undefined state
> and the only way to recover is for the user to reboot the machine?

Here, 50+ (5s) retry of scm shutdown is failing during decryption of
remote processor memory region, and i don't think, it is anyway to do
with remote processor state here, as a best effort more number of
retries can be tried instead of 50 or wait for some other recovery
command like recovery_store() to let it do the retry again from
beginning.

> 
> 
> The check for glink->edge avoids one pitfall following this, but I'd
> prefer to see a solution that avoids issues in this scenario in the
> remoteproc core - rather than working around side effects of this in
> different places.

Handling in a remoteproc core means we may need another state something
like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
failure and checking the same during another try return immediately with
some log message.

-Mukesh

Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Posted by Bjorn Andersson 1 year, 4 months ago

On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > recover it results in NULL pointer dereference issue in
> > > qcom_glink_smem_unregister().
> > > 
> > > Fix it by having a NULL check in glink_subdev_stop().
> > > 
> > > 	Process-A                			Process-B
> > > 
> > >   fatal error interrupt happens
> > > 
> > >   rproc_crash_handler_work()
> > >     mutex_lock_interruptible(&rproc->lock);
> > >     ...
> > > 
> > >        rproc->state = RPROC_CRASHED;
> > >     ...
> > >     mutex_unlock(&rproc->lock);
> > > 
> > >     rproc_trigger_recovery()
> > >      mutex_lock_interruptible(&rproc->lock);
> > > 
> > >       adsp_stop()
> > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > >       remoteproc remoteproc3: can't stop rproc: -22
> > 
> > I presume that at this point this remoteproc is in some undefined state
> > and the only way to recover is for the user to reboot the machine?
> 
> Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> remote processor memory region, and i don't think, it is anyway to do
> with remote processor state here, as a best effort more number of
> retries can be tried instead of 50 or wait for some other recovery
> command like recovery_store() to let it do the retry again from
> beginning.
> 

But are you saying that retrying a bit later would allow us to get out
of this problem? If we just didn't hit the NULL pointer(s)?

How long are we talking about here? Is the timeout too short?

> > 
> > 
> > The check for glink->edge avoids one pitfall following this, but I'd
> > prefer to see a solution that avoids issues in this scenario in the
> > remoteproc core - rather than working around side effects of this in
> > different places.
> 
> Handling in a remoteproc core means we may need another state something
> like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> failure and checking the same during another try return immediately with
> some log message.
> 

Yes, if we are failing to shut down the remoteproc and there's no way
for us to reliably recover from that (e.g. we are not able to reclaim
the memory), it seems reasonable that we have to mark it using a new
state.

If that is the case, I'd call it RPROC_DEFUNCT (or something like that
instead), because while in some "unknown" state, from a remoteproc
framework's point of view, it's in a well known (broken) state.

Regards,
Bjorn

Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Posted by Mukesh Ojha 1 year, 4 months ago

On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > recover it results in NULL pointer dereference issue in
> > > > qcom_glink_smem_unregister().
> > > > 
> > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > 
> > > > 	Process-A                			Process-B
> > > > 
> > > >   fatal error interrupt happens
> > > > 
> > > >   rproc_crash_handler_work()
> > > >     mutex_lock_interruptible(&rproc->lock);
> > > >     ...
> > > > 
> > > >        rproc->state = RPROC_CRASHED;
> > > >     ...
> > > >     mutex_unlock(&rproc->lock);
> > > > 
> > > >     rproc_trigger_recovery()
> > > >      mutex_lock_interruptible(&rproc->lock);
> > > > 
> > > >       adsp_stop()
> > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > 
> > > I presume that at this point this remoteproc is in some undefined state
> > > and the only way to recover is for the user to reboot the machine?
> > 
> > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > remote processor memory region, and i don't think, it is anyway to do
> > with remote processor state here, as a best effort more number of
> > retries can be tried instead of 50 or wait for some other recovery
> > command like recovery_store() to let it do the retry again from
> > beginning.
> > 
> 
> But are you saying that retrying a bit later would allow us to get out
> of this problem? If we just didn't hit the NULL pointer(s)?

I am not sure whether adding more number of retries will solve the issue
and initially thinking from perspective that, it is better to retry than
to leave the remoteproc in some unknown state however, I do get that
letting it retry could give unnecessary patching every code e.g., ssr
notifier callbacks, which is not expecting to be called twice as a
side-effect of remoteproc unknown state.
> 
> How long are we talking about here? Is the timeout too short?

5sec is very significant amount of time in wait for remote processor to
get recovered, we should not change this.
> 
> > > 
> > > 
> > > The check for glink->edge avoids one pitfall following this, but I'd
> > > prefer to see a solution that avoids issues in this scenario in the
> > > remoteproc core - rather than working around side effects of this in
> > > different places.
> > 
> > Handling in a remoteproc core means we may need another state something
> > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > failure and checking the same during another try return immediately with
> > some log message.
> > 
> 
> Yes, if we are failing to shut down the remoteproc and there's no way
> for us to reliably recover from that (e.g. we are not able to reclaim
> the memory), it seems reasonable that we have to mark it using a new
> state.
> 
> If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> instead), because while in some "unknown" state, from a remoteproc
> framework's point of view, it's in a well known (broken) state.

Ack.

-Mukesh
> 
> Regards,
> Bjorn

Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Posted by Bjorn Andersson 1 year, 4 months ago

On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote:
> On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > > recover it results in NULL pointer dereference issue in
> > > > > qcom_glink_smem_unregister().
> > > > > 
> > > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > > 
> > > > > 	Process-A                			Process-B
> > > > > 
> > > > >   fatal error interrupt happens
> > > > > 
> > > > >   rproc_crash_handler_work()
> > > > >     mutex_lock_interruptible(&rproc->lock);
> > > > >     ...
> > > > > 
> > > > >        rproc->state = RPROC_CRASHED;
> > > > >     ...
> > > > >     mutex_unlock(&rproc->lock);
> > > > > 
> > > > >     rproc_trigger_recovery()
> > > > >      mutex_lock_interruptible(&rproc->lock);
> > > > > 
> > > > >       adsp_stop()
> > > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > > 
> > > > I presume that at this point this remoteproc is in some undefined state
> > > > and the only way to recover is for the user to reboot the machine?
> > > 
> > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > > remote processor memory region, and i don't think, it is anyway to do
> > > with remote processor state here, as a best effort more number of
> > > retries can be tried instead of 50 or wait for some other recovery
> > > command like recovery_store() to let it do the retry again from
> > > beginning.
> > > 
> > 
> > But are you saying that retrying a bit later would allow us to get out
> > of this problem? If we just didn't hit the NULL pointer(s)?
> 
> I am not sure whether adding more number of retries will solve the issue
> and initially thinking from perspective that, it is better to retry than
> to leave the remoteproc in some unknown state however, I do get that
> letting it retry could give unnecessary patching every code e.g., ssr
> notifier callbacks, which is not expecting to be called twice as a
> side-effect of remoteproc unknown state.

That's not what I'm asking you. When the remote processor fails to stop,
can you recover the system by just trying a bit later, or is the
remoteproc dead until reboot?

> > 
> > How long are we talking about here? Is the timeout too short?
> 
> 5sec is very significant amount of time in wait for remote processor to
> get recovered, we should not change this.

Okay, I'm just trying to understand the problem you're trying to solve.

Regards,
Bjorn

> > 
> > > > 
> > > > 
> > > > The check for glink->edge avoids one pitfall following this, but I'd
> > > > prefer to see a solution that avoids issues in this scenario in the
> > > > remoteproc core - rather than working around side effects of this in
> > > > different places.
> > > 
> > > Handling in a remoteproc core means we may need another state something
> > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > > failure and checking the same during another try return immediately with
> > > some log message.
> > > 
> > 
> > Yes, if we are failing to shut down the remoteproc and there's no way
> > for us to reliably recover from that (e.g. we are not able to reclaim
> > the memory), it seems reasonable that we have to mark it using a new
> > state.
> > 
> > If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> > instead), because while in some "unknown" state, from a remoteproc
> > framework's point of view, it's in a well known (broken) state.
> 
> Ack.
> 
> -Mukesh
> > 
> > Regards,
> > Bjorn

Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Posted by Mukesh Ojha 1 year, 4 months ago

On Tue, Oct 01, 2024 at 02:15:52PM -0700, Bjorn Andersson wrote:
> On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote:
> > On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> > > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > > > recover it results in NULL pointer dereference issue in
> > > > > > qcom_glink_smem_unregister().
> > > > > > 
> > > > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > > > 
> > > > > > 	Process-A                			Process-B
> > > > > > 
> > > > > >   fatal error interrupt happens
> > > > > > 
> > > > > >   rproc_crash_handler_work()
> > > > > >     mutex_lock_interruptible(&rproc->lock);
> > > > > >     ...
> > > > > > 
> > > > > >        rproc->state = RPROC_CRASHED;
> > > > > >     ...
> > > > > >     mutex_unlock(&rproc->lock);
> > > > > > 
> > > > > >     rproc_trigger_recovery()
> > > > > >      mutex_lock_interruptible(&rproc->lock);
> > > > > > 
> > > > > >       adsp_stop()
> > > > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > > > 
> > > > > I presume that at this point this remoteproc is in some undefined state
> > > > > and the only way to recover is for the user to reboot the machine?
> > > > 
> > > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > > > remote processor memory region, and i don't think, it is anyway to do
> > > > with remote processor state here, as a best effort more number of
> > > > retries can be tried instead of 50 or wait for some other recovery
> > > > command like recovery_store() to let it do the retry again from
> > > > beginning.
> > > > 
> > > 
> > > But are you saying that retrying a bit later would allow us to get out
> > > of this problem? If we just didn't hit the NULL pointer(s)?
> > 
> > I am not sure whether adding more number of retries will solve the issue
> > and initially thinking from perspective that, it is better to retry than
> > to leave the remoteproc in some unknown state however, I do get that
> > letting it retry could give unnecessary patching every code e.g., ssr
> > notifier callbacks, which is not expecting to be called twice as a
> > side-effect of remoteproc unknown state.
> 
> That's not what I'm asking you. When the remote processor fails to stop,
> can you recover the system by just trying a bit later, or is the
> remoteproc dead until reboot?

I cannot say this with certainty. For the current issue, the remoteproc
fails to stop as it is running out of heap memory.

-Mukesh
> 
> > > 
> > > How long are we talking about here? Is the timeout too short?
> > 
> > 5sec is very significant amount of time in wait for remote processor to
> > get recovered, we should not change this.
> 
> Okay, I'm just trying to understand the problem you're trying to solve.
> 
> Regards,
> Bjorn
> 
> > > 
> > > > > 
> > > > > 
> > > > > The check for glink->edge avoids one pitfall following this, but I'd
> > > > > prefer to see a solution that avoids issues in this scenario in the
> > > > > remoteproc core - rather than working around side effects of this in
> > > > > different places.
> > > > 
> > > > Handling in a remoteproc core means we may need another state something
> > > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > > > failure and checking the same during another try return immediately with
> > > > some log message.
> > > > 
> > > 
> > > Yes, if we are failing to shut down the remoteproc and there's no way
> > > for us to reliably recover from that (e.g. we are not able to reclaim
> > > the memory), it seems reasonable that we have to mark it using a new
> > > state.
> > > 
> > > If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> > > instead), because while in some "unknown" state, from a remoteproc
> > > framework's point of view, it's in a well known (broken) state.
> > 
> > Ack.
> > 
> > -Mukesh
> > > 
> > > Regards,
> > > Bjorn