[v1] RDMA/rxe: Fix race in do_task() when draining

[PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by Gui-Dong Han 4 months, 3 weeks ago

When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally
sets the task state to TASK_STATE_IDLE to reschedule. This overwrites
the TASK_STATE_DRAINING state that may have been concurrently set by
rxe_cleanup_task() or rxe_disable_task().

This race condition breaks the cleanup and disable logic, which expects
the task to stop processing new work. The cleanup code may proceed while
do_task() reschedules itself, leading to a potential use-after-free.

This bug was introduced during the migration from tasklets to workqueues,
where the special handling for the draining case was lost.

Fix this by restoring the original behavior. If the state is
TASK_STATE_DRAINING when iterations are exhausted, continue the loop by
setting cont to 1. This allows new iterations to finish the remaining
work and reach the switch statement, which properly transitions the
state to TASK_STATE_DRAINED and stops the task as intended.

Fixes: 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_task.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 6f8f353e9583..f522820b950c 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -132,8 +132,12 @@ static void do_task(struct rxe_task *task)
 		 * yield the cpu and reschedule the task
 		 */
 		if (!ret) {
-			task->state = TASK_STATE_IDLE;
-			resched = 1;
+			if (task->state != TASK_STATE_DRAINING) {
+				task->state = TASK_STATE_IDLE;
+				resched = 1;
+			} else {
+				cont = 1;
+			}
 			goto exit;
 		}
 
-- 
2.25.1

Re: [PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by yanjun.zhu 4 months, 3 weeks ago

On 9/17/25 3:06 AM, Gui-Dong Han wrote:
> When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally

 From the source code, it will check ret value, then set it to 
TASK_STATE_IDLE, not unconditionally.

> sets the task state to TASK_STATE_IDLE to reschedule. This overwrites
> the TASK_STATE_DRAINING state that may have been concurrently set by
> rxe_cleanup_task() or rxe_disable_task().

 From the source code, there is a spin lock to protect the state. It 
will not make race condition.

> 
> This race condition breaks the cleanup and disable logic, which expects
> the task to stop processing new work. The cleanup code may proceed while
> do_task() reschedules itself, leading to a potential use-after-free.
> 

Can you post the call trace when this problem occurred?

Hi, Jason && Leon

Please comment on this problem.

Thanks a lot.
Yanjun.Zhu

> This bug was introduced during the migration from tasklets to workqueues,
> where the special handling for the draining case was lost.
> 
> Fix this by restoring the original behavior. If the state is
> TASK_STATE_DRAINING when iterations are exhausted, continue the loop by
> setting cont to 1. This allows new iterations to finish the remaining
> work and reach the switch statement, which properly transitions the
> state to TASK_STATE_DRAINED and stops the task as intended.
> 
> Fixes: 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks")
> Cc: stable@vger.kernel.org
> Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_task.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> index 6f8f353e9583..f522820b950c 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -132,8 +132,12 @@ static void do_task(struct rxe_task *task)
>   		 * yield the cpu and reschedule the task
>   		 */
>   		if (!ret) {
> -			task->state = TASK_STATE_IDLE;
> -			resched = 1;
> +			if (task->state != TASK_STATE_DRAINING) {
> +				task->state = TASK_STATE_IDLE;
> +				resched = 1;
> +			} else {
> +				cont = 1;
> +			}
>   			goto exit;
>   		}
>

Re: [PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by Leon Romanovsky 4 months, 3 weeks ago

On Wed, Sep 17, 2025 at 12:30:56PM -0700, yanjun.zhu wrote:
> On 9/17/25 3:06 AM, Gui-Dong Han wrote:
> > When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally
> 
> From the source code, it will check ret value, then set it to
> TASK_STATE_IDLE, not unconditionally.
> 
> > sets the task state to TASK_STATE_IDLE to reschedule. This overwrites
> > the TASK_STATE_DRAINING state that may have been concurrently set by
> > rxe_cleanup_task() or rxe_disable_task().
> 
> From the source code, there is a spin lock to protect the state. It will not
> make race condition.
> 
> > 
> > This race condition breaks the cleanup and disable logic, which expects
> > the task to stop processing new work. The cleanup code may proceed while
> > do_task() reschedules itself, leading to a potential use-after-free.
> > 
> 
> Can you post the call trace when this problem occurred?
> 
> Hi, Jason && Leon
> 
> Please comment on this problem.

The idea to recheck task->state looks correct to me, otherwise we overwrite it unconditionally.
However I would write this patch slightly different (without cont = 1):

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 6f8f353e95838..2ff5d7cc0a933 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -132,8 +132,10 @@ static void do_task(struct rxe_task *task)
                 * yield the cpu and reschedule the task
                 */
                if (!ret) {
-                       task->state = TASK_STATE_IDLE;
-                       resched = 1;
+                       if (task->state != TASK_STATE_DRAINING) {
+                               task->state = TASK_STATE_IDLE;
+                               resched = 1;
+                       }
                        goto exit;
                }

@@ -151,7 +153,6 @@ static void do_task(struct rxe_task *task)
                        break;

                case TASK_STATE_DRAINING:
-                       task->state = TASK_STATE_DRAINED;
                        break;

                default:
(END)



> 
> Thanks a lot.
> Yanjun.Zhu
> 
> > This bug was introduced during the migration from tasklets to workqueues,
> > where the special handling for the draining case was lost.
> > 
> > Fix this by restoring the original behavior. If the state is
> > TASK_STATE_DRAINING when iterations are exhausted, continue the loop by
> > setting cont to 1. This allows new iterations to finish the remaining
> > work and reach the switch statement, which properly transitions the
> > state to TASK_STATE_DRAINED and stops the task as intended.
> > 
> > Fixes: 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
> > ---
> >   drivers/infiniband/sw/rxe/rxe_task.c | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> > index 6f8f353e9583..f522820b950c 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_task.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> > @@ -132,8 +132,12 @@ static void do_task(struct rxe_task *task)
> >   		 * yield the cpu and reschedule the task
> >   		 */
> >   		if (!ret) {
> > -			task->state = TASK_STATE_IDLE;
> > -			resched = 1;
> > +			if (task->state != TASK_STATE_DRAINING) {
> > +				task->state = TASK_STATE_IDLE;
> > +				resched = 1;
> > +			} else {
> > +				cont = 1;
> > +			}
> >   			goto exit;
> >   		}
>

Re: [PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by Gui-Dong Han 4 months, 3 weeks ago

On Thu, Sep 18, 2025 at 5:58 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Sep 17, 2025 at 12:30:56PM -0700, yanjun.zhu wrote:
> > On 9/17/25 3:06 AM, Gui-Dong Han wrote:
> > > When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally
> >
> > From the source code, it will check ret value, then set it to
> > TASK_STATE_IDLE, not unconditionally.
> >
> > > sets the task state to TASK_STATE_IDLE to reschedule. This overwrites
> > > the TASK_STATE_DRAINING state that may have been concurrently set by
> > > rxe_cleanup_task() or rxe_disable_task().
> >
> > From the source code, there is a spin lock to protect the state. It will not
> > make race condition.
> >
> > >
> > > This race condition breaks the cleanup and disable logic, which expects
> > > the task to stop processing new work. The cleanup code may proceed while
> > > do_task() reschedules itself, leading to a potential use-after-free.
> > >
> >
> > Can you post the call trace when this problem occurred?
> >
> > Hi, Jason && Leon
> >
> > Please comment on this problem.
>
> The idea to recheck task->state looks correct to me, otherwise we overwrite it unconditionally.
> However I would write this patch slightly different (without cont = 1):
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> index 6f8f353e95838..2ff5d7cc0a933 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -132,8 +132,10 @@ static void do_task(struct rxe_task *task)
>                  * yield the cpu and reschedule the task
>                  */
>                 if (!ret) {
> -                       task->state = TASK_STATE_IDLE;
> -                       resched = 1;
> +                       if (task->state != TASK_STATE_DRAINING) {
> +                               task->state = TASK_STATE_IDLE;
> +                               resched = 1;
> +                       }
>                         goto exit;
>                 }
>
> @@ -151,7 +153,6 @@ static void do_task(struct rxe_task *task)
>                         break;
>
>                 case TASK_STATE_DRAINING:
> -                       task->state = TASK_STATE_DRAINED;
>                         break;
>
>                 default:
> (END)

Hi Leon,

Thanks for your review and for confirming the need for a fix.

Regarding your suggested patch, I believe removing the transition to
TASK_STATE_DRAINED would cause an issue. As seen in the code and comments
for rxe_cleanup_task() and is_done(), the cleanup process waits for the
final TASK_STATE_DRAINED state. If the task remains stuck in DRAINING,
the cleanup loop will never terminate.

My use of cont = 1 was intended as a minimal change. Since this
regression was introduced during the migration from tasklets, restoring
the pre-migration logic seemed like a reasonable approach. An alternative
could be to set the state to TASK_STATE_DRAINED directly inside the
if (!ret) block, and I am open to discussing the best fix.

Regards,
Han

Re: [PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by Leon Romanovsky 4 months, 3 weeks ago

On Thu, Sep 18, 2025 at 08:02:04PM +0800, Gui-Dong Han wrote:
> On Thu, Sep 18, 2025 at 5:58 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Sep 17, 2025 at 12:30:56PM -0700, yanjun.zhu wrote:
> > > On 9/17/25 3:06 AM, Gui-Dong Han wrote:
> > > > When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally
> > >
> > > From the source code, it will check ret value, then set it to
> > > TASK_STATE_IDLE, not unconditionally.
> > >
> > > > sets the task state to TASK_STATE_IDLE to reschedule. This overwrites
> > > > the TASK_STATE_DRAINING state that may have been concurrently set by
> > > > rxe_cleanup_task() or rxe_disable_task().
> > >
> > > From the source code, there is a spin lock to protect the state. It will not
> > > make race condition.
> > >
> > > >
> > > > This race condition breaks the cleanup and disable logic, which expects
> > > > the task to stop processing new work. The cleanup code may proceed while
> > > > do_task() reschedules itself, leading to a potential use-after-free.
> > > >
> > >
> > > Can you post the call trace when this problem occurred?
> > >
> > > Hi, Jason && Leon
> > >
> > > Please comment on this problem.
> >
> > The idea to recheck task->state looks correct to me, otherwise we overwrite it unconditionally.
> > However I would write this patch slightly different (without cont = 1):
> >
> > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> > index 6f8f353e95838..2ff5d7cc0a933 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_task.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> > @@ -132,8 +132,10 @@ static void do_task(struct rxe_task *task)
> >                  * yield the cpu and reschedule the task
> >                  */
> >                 if (!ret) {
> > -                       task->state = TASK_STATE_IDLE;
> > -                       resched = 1;
> > +                       if (task->state != TASK_STATE_DRAINING) {
> > +                               task->state = TASK_STATE_IDLE;
> > +                               resched = 1;
> > +                       }
> >                         goto exit;
> >                 }
> >
> > @@ -151,7 +153,6 @@ static void do_task(struct rxe_task *task)
> >                         break;
> >
> >                 case TASK_STATE_DRAINING:
> > -                       task->state = TASK_STATE_DRAINED;
> >                         break;
> >
> >                 default:
> > (END)
> 
> Hi Leon,
> 
> Thanks for your review and for confirming the need for a fix.
> 
> Regarding your suggested patch, I believe removing the transition to
> TASK_STATE_DRAINED would cause an issue. As seen in the code and comments
> for rxe_cleanup_task() and is_done(), the cleanup process waits for the
> final TASK_STATE_DRAINED state. If the task remains stuck in DRAINING,
> the cleanup loop will never terminate.
> 
> My use of cont = 1 was intended as a minimal change. Since this
> regression was introduced during the migration from tasklets, restoring
> the pre-migration logic seemed like a reasonable approach. An alternative
> could be to set the state to TASK_STATE_DRAINED directly inside the
> if (!ret) block, and I am open to discussing the best fix.

Ahh, sorry, I misread that it is TASK_STATE_DRAINED and not TASK_STATE_DRAINING.

Thanks

> 
> Regards,
> Han

Re: [PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by Gui-Dong Han 4 months, 3 weeks ago

On Thu, Sep 18, 2025 at 3:31 AM yanjun.zhu <yanjun.zhu@linux.dev> wrote:
>
> On 9/17/25 3:06 AM, Gui-Dong Han wrote:
> > When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally
>
>  From the source code, it will check ret value, then set it to
> TASK_STATE_IDLE, not unconditionally.

Hi Yanjun,

Thanks for your review. Let me clarify a few points.

You are correct that the code checks the ret value. The if (!ret)
branch specifically handles the case where the RXE_MAX_ITERATIONS
limit is reached while work still remains. My use of "unconditionally"
refers to the action inside this branch, which sets the state to
TASK_STATE_IDLE without a secondary check on task->state. The original
tasklet implementation effectively checked both conditions in this
scenario.

>
> > sets the task state to TASK_STATE_IDLE to reschedule. This overwrites
> > the TASK_STATE_DRAINING state that may have been concurrently set by
> > rxe_cleanup_task() or rxe_disable_task().
>
>  From the source code, there is a spin lock to protect the state. It
> will not make race condition.

While a spinlock protects state changes, rxe_cleanup_task() and
rxe_disable_task() do not hold it for its entire duration. It acquires
the lock to set TASK_STATE_DRAINING, but then releases it to wait in
the while (!is_done(task)) loop. The race window exists when do_task()
acquires the lock during this wait period, allowing it to overwrite
the TASK_STATE_DRAINING state.

>
> >
> > This race condition breaks the cleanup and disable logic, which expects
> > the task to stop processing new work. The cleanup code may proceed while
> > do_task() reschedules itself, leading to a potential use-after-free.
> >
>
> Can you post the call trace when this problem occurred?

This issue was identified through code inspection and a static
analysis tool we are developing to detect TOCTOU bugs in the kernel,
so I do not have a runtime call trace. The bug is confirmed by
inspecting the Fixes commit (9b4b7c1f9f54), which lost the special
handling for the draining case during the migration from tasklets to
workqueues.

Regards,
Han

Re: [PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by Yanjun.Zhu 4 months, 3 weeks ago

On 9/17/25 7:21 PM, Gui-Dong Han wrote:
> On Thu, Sep 18, 2025 at 3:31 AM yanjun.zhu <yanjun.zhu@linux.dev> wrote:
>> On 9/17/25 3:06 AM, Gui-Dong Han wrote:
>>> When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally
>>   From the source code, it will check ret value, then set it to
>> TASK_STATE_IDLE, not unconditionally.
> Hi Yanjun,
>
> Thanks for your review. Let me clarify a few points.
>
> You are correct that the code checks the ret value. The if (!ret)
> branch specifically handles the case where the RXE_MAX_ITERATIONS
> limit is reached while work still remains. My use of "unconditionally"
> refers to the action inside this branch, which sets the state to
> TASK_STATE_IDLE without a secondary check on task->state. The original
> tasklet implementation effectively checked both conditions in this
> scenario.
>
>>> sets the task state to TASK_STATE_IDLE to reschedule. This overwrites
>>> the TASK_STATE_DRAINING state that may have been concurrently set by
>>> rxe_cleanup_task() or rxe_disable_task().
>>   From the source code, there is a spin lock to protect the state. It
>> will not make race condition.
> While a spinlock protects state changes, rxe_cleanup_task() and
> rxe_disable_task() do not hold it for its entire duration. It acquires
> the lock to set TASK_STATE_DRAINING, but then releases it to wait in
> the while (!is_done(task)) loop. The race window exists when do_task()
> acquires the lock during this wait period, allowing it to overwrite
> the TASK_STATE_DRAINING state.
>
>>> This race condition breaks the cleanup and disable logic, which expects
>>> the task to stop processing new work. The cleanup code may proceed while
>>> do_task() reschedules itself, leading to a potential use-after-free.
>>>
>> Can you post the call trace when this problem occurred?
> This issue was identified through code inspection and a static
> analysis tool we are developing to detect TOCTOU bugs in the kernel,
> so I do not have a runtime call trace. The bug is confirmed by
> inspecting the Fixes commit (9b4b7c1f9f54), which lost the special
> handling for the draining case during the migration from tasklets to
> workqueues.
Thanks a lot for your detailed explanations. Could you update the commit 
logs to reflect the points you explained above?

The current commit logs are a bit confusing, but your explanation makes 
everything clear. If you rewrite the logs with that context, other 
reviewers will be able to understand your intent directly from the 
commit message, without needing the extra explanation. That would make 
the commit much clearer.

Any way,

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

>
> Regards,
> Han

Re: [PATCH] RDMA/rxe: Fix race in do_task() when draining

Posted by Gui-Dong Han 4 months, 3 weeks ago

> Thanks a lot for your detailed explanations. Could you update the commit
> logs to reflect the points you explained above?

Hi Yanjun,

Fixed in the patch v2.
Thanks!

Regards,
Han