firmware: stratix10-rsu: avoid RSU mutex deadlock on update

[PATCH] firmware: stratix10-rsu: avoid RSU mutex deadlock on update

Posted by tze.yee.ng@altera.com 1 month, 2 weeks ago

From: Tze Yee Ng <tze.yee.ng@altera.com>

Use mutex_trylock() in rsu_send_msg() instead of mutex_lock() so callers
that already hold priv->lock or race with an in-flight RSU operation get
-EAGAIN instead of blocking forever.

In reboot_image_store(), treat -EAGAIN as a no-op success so sysfs write
does not hang when the RSU path is busy.

Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
---
 drivers/firmware/stratix10-rsu.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/stratix10-rsu.c b/drivers/firmware/stratix10-rsu.c
index e1912108a0fe..f3ee0b4e49a8 100644
--- a/drivers/firmware/stratix10-rsu.c
+++ b/drivers/firmware/stratix10-rsu.c
@@ -254,7 +254,13 @@ static void rsu_async_get_spt_table_callback(struct device *dev,
  * is necessary to get RSU boot log or set the address of bitstream to
  * boot after reboot.
  *
- * Returns 0 on success or -ETIMEDOUT on error.
+ * Return:
+ * * 0 on success
+ * * -EAGAIN if the driver mutex could not be acquired
+ * * a negative errno from stratix10_svc_send() on failure to send the request
+ * * -ETIMEDOUT if waiting for the SMC callback times out
+ * * a negative errno from wait_for_completion_interruptible_timeout() if
+ *   interrupted or otherwise signaled while waiting (for example -ERESTARTSYS)
  */
 static int rsu_send_msg(struct stratix10_rsu_priv *priv,
 			enum stratix10_svc_command_code command,
@@ -264,7 +270,8 @@ static int rsu_send_msg(struct stratix10_rsu_priv *priv,
 	struct stratix10_svc_client_msg msg;
 	int ret;
 
-	mutex_lock(&priv->lock);
+	if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
 	reinit_completion(&priv->completion);
 	priv->client.receive_cb = callback;
 
@@ -597,7 +604,9 @@ static ssize_t reboot_image_store(struct device *dev,
 
 	ret = rsu_send_msg(priv, COMMAND_RSU_UPDATE,
 			   address, rsu_command_callback);
-	if (ret) {
+	if (ret == -EAGAIN) {
+		return 0;
+	} else if (ret) {
 		dev_err(dev, "Error, RSU update returned %i\n", ret);
 		return ret;
 	}
-- 
2.43.7

Re: [PATCH] firmware: stratix10-rsu: avoid RSU mutex deadlock on update

Posted by Dinh Nguyen 1 month, 2 weeks ago

 From sashiko claude-sonnet-4-6,

On 5/4/26 21:51, tze.yee.ng@altera.com wrote:
> From: Tze Yee Ng <tze.yee.ng@altera.com>
> 
> Use mutex_trylock() in rsu_send_msg() instead of mutex_lock() so callers
> that already hold priv->lock or race with an in-flight RSU operation get
> -EAGAIN instead of blocking forever.
> 
> In reboot_image_store(), treat -EAGAIN as a no-op success so sysfs write
> does not hang when the RSU path is busy.
> 
> Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
> ---
>   drivers/firmware/stratix10-rsu.c | 15 ++++++++++++---
>   1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/firmware/stratix10-rsu.c b/drivers/firmware/stratix10-rsu.c
> index e1912108a0fe..f3ee0b4e49a8 100644
> --- a/drivers/firmware/stratix10-rsu.c
> +++ b/drivers/firmware/stratix10-rsu.c
> @@ -254,7 +254,13 @@ static void rsu_async_get_spt_table_callback(struct device *dev,
>    * is necessary to get RSU boot log or set the address of bitstream to
>    * boot after reboot.
>    *
> - * Returns 0 on success or -ETIMEDOUT on error.
> + * Return:
> + * * 0 on success
> + * * -EAGAIN if the driver mutex could not be acquired
> + * * a negative errno from stratix10_svc_send() on failure to send the request
> + * * -ETIMEDOUT if waiting for the SMC callback times out
> + * * a negative errno from wait_for_completion_interruptible_timeout() if
> + *   interrupted or otherwise signaled while waiting (for example -ERESTARTSYS)
>    */
>   static int rsu_send_msg(struct stratix10_rsu_priv *priv,
>   			enum stratix10_svc_command_code command,
> @@ -264,7 +270,8 @@ static int rsu_send_msg(struct stratix10_rsu_priv *priv,
>   	struct stratix10_svc_client_msg msg;
>   	int ret;
>   
> -	mutex_lock(&priv->lock);
> +	if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;

The commit message says this prevents deadlocks from "callers that
already hold priv->lock." Does such a caller actually exist?

Looking at the driver, priv->lock is initialized in
stratix10_rsu_probe() and is acquired only inside rsu_send_msg()
itself. The completion callbacks (rsu_command_callback and friends)
call complete() without taking priv->lock, so there is no reentrant
acquisition path through callbacks either.

Two concurrent sysfs writes would serialize correctly under the
original mutex_lock() — the second writer would block until the first
completed. That is the expected behavior for a one-shot firmware
configuration command, not a deadlock.

>   	reinit_completion(&priv->completion);
>   	priv->client.receive_cb = callback;
>   
> @@ -597,7 +604,9 @@ static ssize_t reboot_image_store(struct device *dev,
>   
>   	ret = rsu_send_msg(priv, COMMAND_RSU_UPDATE,
>   			   address, rsu_command_callback);
> -	if (ret) {
> +	if (ret == -EAGAIN) {
> +		return 0;

Two concerns here.

First, does returning 0 violate the sysfs store callback contract?
Sysfs store handlers are expected to return the number of bytes
consumed on success, or a negative errno on error. Returning 0 tells
kernfs that zero bytes were consumed, which is a short write. A
POSIX-compliant write loop seeing 0 bytes returned on a non-zero
write will retry indefinitely. The intended silent no-op should
return count, not 0.

Second, when mutex_trylock fails, COMMAND_RSU_UPDATE is never sent
to the firmware, yet reboot_image_store() returns without an error.
The operator's request to change the persistent boot image is
silently dropped with no log message and no error to userspace.

On the next reboot the device will boot the old image rather than
the one that was written to the attribute.

For a one-shot persistent firmware state change, is silently
discarding the request the right tradeoff? Blocking in a sysfs
store callback is permitted, so keeping the original mutex_lock()
would let concurrent writers serialize correctly without any data
loss.

Dinh

Re: [PATCH] firmware: stratix10-rsu: avoid RSU mutex deadlock on update

Posted by NG, TZE YEE 1 month, 1 week ago

On 6/5/2026 7:40 am, Dinh Nguyen wrote:
>  From sashiko claude-sonnet-4-6,
> 
> On 5/4/26 21:51, tze.yee.ng@altera.com wrote:
>> From: Tze Yee Ng <tze.yee.ng@altera.com>
>>
>> Use mutex_trylock() in rsu_send_msg() instead of mutex_lock() so callers
>> that already hold priv->lock or race with an in-flight RSU operation get
>> -EAGAIN instead of blocking forever.
>>
>> In reboot_image_store(), treat -EAGAIN as a no-op success so sysfs write
>> does not hang when the RSU path is busy.
>>
>> Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
>> ---
>>   drivers/firmware/stratix10-rsu.c | 15 ++++++++++++---
>>   1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/firmware/stratix10-rsu.c b/drivers/firmware/ 
>> stratix10-rsu.c
>> index e1912108a0fe..f3ee0b4e49a8 100644
>> --- a/drivers/firmware/stratix10-rsu.c
>> +++ b/drivers/firmware/stratix10-rsu.c
>> @@ -254,7 +254,13 @@ static void 
>> rsu_async_get_spt_table_callback(struct device *dev,
>>    * is necessary to get RSU boot log or set the address of bitstream to
>>    * boot after reboot.
>>    *
>> - * Returns 0 on success or -ETIMEDOUT on error.
>> + * Return:
>> + * * 0 on success
>> + * * -EAGAIN if the driver mutex could not be acquired
>> + * * a negative errno from stratix10_svc_send() on failure to send 
>> the request
>> + * * -ETIMEDOUT if waiting for the SMC callback times out
>> + * * a negative errno from 
>> wait_for_completion_interruptible_timeout() if
>> + *   interrupted or otherwise signaled while waiting (for example - 
>> ERESTARTSYS)
>>    */
>>   static int rsu_send_msg(struct stratix10_rsu_priv *priv,
>>               enum stratix10_svc_command_code command,
>> @@ -264,7 +270,8 @@ static int rsu_send_msg(struct stratix10_rsu_priv 
>> *priv,
>>       struct stratix10_svc_client_msg msg;
>>       int ret;
>> -    mutex_lock(&priv->lock);
>> +    if (!mutex_trylock(&priv->lock))
>> +        return -EAGAIN;
> 
> The commit message says this prevents deadlocks from "callers that
> already hold priv->lock." Does such a caller actually exist?
> 
> Looking at the driver, priv->lock is initialized in
> stratix10_rsu_probe() and is acquired only inside rsu_send_msg()
> itself. The completion callbacks (rsu_command_callback and friends)
> call complete() without taking priv->lock, so there is no reentrant
> acquisition path through callbacks either.
> 
> Two concurrent sysfs writes would serialize correctly under the
> original mutex_lock() — the second writer would block until the first
> completed. That is the expected behavior for a one-shot firmware
> configuration command, not a deadlock.
> 
> 
You are right, there is no recursive deadlock path from a caller already 
holding priv->lock, and that wording in my commit message is inaccurate.
The issue I am trying to address is a potential indefinite or blocking 
wait in practice when an RSU transaction does not complete as expected, 
which is observed as a sysfs write hang. In that situation, subsequent 
sysfs writes queue behind mutex_lock() and appear hung from user space.
So the intent is to make reboot_image_store() non-blocking when RSU is 
already busy by returning early on -EAGAIN, not to fix lock recursion.
I will correct the commit wording in v2 patch.>>       
reinit_completion(&priv->completion);
>>       priv->client.receive_cb = callback;
>> @@ -597,7 +604,9 @@ static ssize_t reboot_image_store(struct device *dev,
>>       ret = rsu_send_msg(priv, COMMAND_RSU_UPDATE,
>>                  address, rsu_command_callback);
>> -    if (ret) {
>> +    if (ret == -EAGAIN) {
>> +        return 0;
> 
> Two concerns here.
> 
> First, does returning 0 violate the sysfs store callback contract?
> Sysfs store handlers are expected to return the number of bytes
> consumed on success, or a negative errno on error. Returning 0 tells
> kernfs that zero bytes were consumed, which is a short write. A
> POSIX-compliant write loop seeing 0 bytes returned on a non-zero
> write will retry indefinitely. The intended silent no-op should
> return count, not 0.
> 
> Second, when mutex_trylock fails, COMMAND_RSU_UPDATE is never sent
> to the firmware, yet reboot_image_store() returns without an error.
> The operator's request to change the persistent boot image is
> silently dropped with no log message and no error to userspace.
> 
> On the next reboot the device will boot the old image rather than
> the one that was written to the attribute.
> 
> For a one-shot persistent firmware state change, is silently
> discarding the request the right tradeoff? Blocking in a sysfs
> store callback is permitted, so keeping the original mutex_lock()
> would let concurrent writers serialize correctly without any data
> loss.
> 
> Dinh
I agree with both points.
Returning 0 from a sysfs store callback is incorrect here.
A successful sysfs store should return count; returning 0 can be treated 
as a short write and may cause retry loops in userspace.
I also agree that silently dropping COMMAND_RSU_UPDATE when 
mutex_trylock() fails is not acceptable for a one-shot persistent 
boot-image update.
I will drop the reboot_image_store() special-case no-op success on
-EAGAIN in v2.

Thanks,
Regards,
Tze Yee