[PATCH v5 3/3] block: prevent race condition on bi_status in __bio_chain_endio

zhangshida posted 3 patches 2 weeks, 1 day ago
There is a newer version of this series
[PATCH v5 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by zhangshida 2 weeks, 1 day ago
From: Shida Zhang <zhangshida@kylinos.cn>

Andreas point out that multiple completions can race setting
bi_status.

If __bio_chain_endio() is called concurrently from multiple threads
accessing the same parent bio, it should use WRITE_ONCE()/READ_ONCE()
to access parent->bi_status and avoid data races.

On x86 and ARM, these macros compile to the same instruction as a
normal write, but they may be required on other architectures to
prevent tearing, and to ensure the compiler does not add or remove
memory accesses under the assumption that the values are not accessed
concurrently.

Adopting a cmpxchg approach, as used in other code paths, resolves all
these issues, as suggested by Christoph.

Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
 block/bio.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index cfb751dfcf5..51b57f9d8bd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -314,8 +314,9 @@ static struct bio *__bio_chain_endio(struct bio *bio)
 {
 	struct bio *parent = bio->bi_private;
 
-	if (bio->bi_status && !parent->bi_status)
-		parent->bi_status = bio->bi_status;
+	if (bio->bi_status)
+		cmpxchg(&parent->bi_status, 0, bio->bi_status);
+
 	bio_put(bio);
 	return parent;
 }
-- 
2.34.1
Re: [PATCH v5 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Andreas Gruenbacher 2 weeks, 1 day ago
On Thu, Dec 4, 2025 at 3:48 AM zhangshida <starzhangzsd@gmail.com> wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Andreas point out that multiple completions can race setting
> bi_status.

What I've actually  pointed out is that the '!parent->bi_status' check
in this statement is an unnecessary optimization that can be removed.
But this is not what this discussion is mainly about anymore.

In the current code, multiple completions can race setting bi_status,
but that is fine as long as bi_status is never set to 0 during bio
completion. The effect is that when there are multiple errors, the
bi_error field of the final bio will eventually be set to an error
code, but we don't know which error code will win. This all works
correctly today, and there is no race to fix because the race is
intentional.

> If __bio_chain_endio() is called concurrently from multiple threads
> accessing the same parent bio, it should use WRITE_ONCE()/READ_ONCE()
> to access parent->bi_status and avoid data races.
>
> On x86 and ARM, these macros compile to the same instruction as a
> normal write, but they may be required on other architectures to
> prevent tearing, and to ensure the compiler does not add or remove
> memory accesses under the assumption that the values are not accessed
> concurrently.

WRITE_ONCE() and READ_ONCE() also prevent the compiler from reordering
operations. Even when the compiler doesn't seem to do anything nasty
at the moment, it would probably still be worthwhile to use
WRITE_ONCE() for setting bi_status throughout the code. But that's
beyond the scope of this patch, and it calls for more than a global
search and replace job.

> Adopting a cmpxchg approach, as used in other code paths, resolves all
> these issues, as suggested by Christoph.

No, the cmpxchg() doesn't actually achieve anything, it only makes
things worse. For example, when there is an A -> B chain, we can end
up with the following sequence of events:

  - A fails, sets A->bi_status, and calls bio_endio(A).
  - B->status is still 0, so bio_endio(A) sets B->bi_status to A->bi_status.
  - B fails and sets B->bi_status, OVERRIDING the value of A->bi_status.
  - bio_endio(B) calls B->bi_end_io().

Things get worse in an A -> B -> C chain, but I've already mentioned
that earlier in this thread.

So again, the cmpxchg() is unnecessary, but it is also harmless
because it suggests that there is some form of synchronization that
doesn't exist. The btrfs code that the cmpxchg() was taken from seems
to implement actual first-failure-wins semantics, but this patch does
not.

The underlying question here is whether we want to change things so
that bi_status is set to the first error that occurs (probably first
in time, not first in the chain). If that is the goal, then we should
be explicit about it. Right now, I don't see the need.

Thanks,
Andreas

> Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> ---
>  block/bio.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index cfb751dfcf5..51b57f9d8bd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -314,8 +314,9 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>  {
>         struct bio *parent = bio->bi_private;
>
> -       if (bio->bi_status && !parent->bi_status)
> -               parent->bi_status = bio->bi_status;
> +       if (bio->bi_status)
> +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> +
>         bio_put(bio);
>         return parent;
>  }
> --
> 2.34.1
>
Re: [PATCH v5 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Stephen Zhang 2 weeks ago
Andreas Gruenbacher <agruenba@redhat.com> 于2025年12月4日周四 20:01写道:
>
> On Thu, Dec 4, 2025 at 3:48 AM zhangshida <starzhangzsd@gmail.com> wrote:
> > From: Shida Zhang <zhangshida@kylinos.cn>
> >
> > Andreas point out that multiple completions can race setting
> > bi_status.
>
> What I've actually  pointed out is that the '!parent->bi_status' check
> in this statement is an unnecessary optimization that can be removed.
> But this is not what this discussion is mainly about anymore.
>
> In the current code, multiple completions can race setting bi_status,
> but that is fine as long as bi_status is never set to 0 during bio
> completion. The effect is that when there are multiple errors, the
> bi_error field of the final bio will eventually be set to an error
> code, but we don't know which error code will win. This all works
> correctly today, and there is no race to fix because the race is
> intentional.
>
> > If __bio_chain_endio() is called concurrently from multiple threads
> > accessing the same parent bio, it should use WRITE_ONCE()/READ_ONCE()
> > to access parent->bi_status and avoid data races.
> >
> > On x86 and ARM, these macros compile to the same instruction as a
> > normal write, but they may be required on other architectures to
> > prevent tearing, and to ensure the compiler does not add or remove
> > memory accesses under the assumption that the values are not accessed
> > concurrently.
>
> WRITE_ONCE() and READ_ONCE() also prevent the compiler from reordering
> operations. Even when the compiler doesn't seem to do anything nasty
> at the moment, it would probably still be worthwhile to use
> WRITE_ONCE() for setting bi_status throughout the code. But that's
> beyond the scope of this patch, and it calls for more than a global
> search and replace job.
>
> > Adopting a cmpxchg approach, as used in other code paths, resolves all
> > these issues, as suggested by Christoph.
>
> No, the cmpxchg() doesn't actually achieve anything, it only makes
> things worse. For example, when there is an A -> B chain, we can end
> up with the following sequence of events:
>
>   - A fails, sets A->bi_status, and calls bio_endio(A).
>   - B->status is still 0, so bio_endio(A) sets B->bi_status to A->bi_status.
>   - B fails and sets B->bi_status, OVERRIDING the value of A->bi_status.
>   - bio_endio(B) calls B->bi_end_io().
>
> Things get worse in an A -> B -> C chain, but I've already mentioned
> that earlier in this thread.
>
> So again, the cmpxchg() is unnecessary, but it is also harmless
> because it suggests that there is some form of synchronization that
> doesn't exist. The btrfs code that the cmpxchg() was taken from seems
> to implement actual first-failure-wins semantics, but this patch does
> not.
>
> The underlying question here is whether we want to change things so
> that bi_status is set to the first error that occurs (probably first
> in time, not first in the chain). If that is the goal, then we should
> be explicit about it. Right now, I don't see the need.
>

Thank you for the thorough discussion on this matter. Based on my
understanding, there appear to be two main viewpoints:

Side 1 (S1) argues:
A. Using cmpxchg() addresses all the issues that
READ_ONCE()/WRITE_ONCE() can solve. That said, cmpxchg()
appears to provide a comprehensive solution.
B. While cmpxchg() introduces some additional overhead, this may not
be significant since it occurs in a slow error-handling path.
Or nondeterministic order of overriding, Which is not a problem.

Side 2 (S2) argues:
A. While READ_ONCE()/WRITE_ONCE()-style protections may be
needed, addressing them in this specific patch may not be appropriate.
B. The cmpxchg() approach adds unnecessary complexity without clear
benefits.

The trade-off between S1B and S2B seems somewhat balanced, with
no clear winner.
However, regarding S1A versus S2A, I wonder:

If we agree that READ_ONCE()/WRITE_ONCE() protections are required
here, is there a specific reason we couldn't address just this code snippet
in this patch? We could then follow the S1 approach and use cmpxchg().

However, If this change remains controversial, I'd be happy to drop it and
resend v6 without this modification.

Thanks,
Shida





> Thanks,
> Andreas
>
> > Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
> > Suggested-by: Christoph Hellwig <hch@infradead.org>
> > Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> > ---
> >  block/bio.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index cfb751dfcf5..51b57f9d8bd 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -314,8 +314,9 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> >  {
> >         struct bio *parent = bio->bi_private;
> >
> > -       if (bio->bi_status && !parent->bi_status)
> > -               parent->bi_status = bio->bi_status;
> > +       if (bio->bi_status)
> > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > +
> >         bio_put(bio);
> >         return parent;
> >  }
> > --
> > 2.34.1
> >
>