[PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio

zhangshida posted 3 patches 2 months, 1 week ago
There is a newer version of this series
[PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by zhangshida 2 months, 1 week ago
From: Shida Zhang <zhangshida@kylinos.cn>

Andreas point out that multiple completions can race setting
bi_status.

The check (parent->bi_status) and the subsequent write are not an
atomic operation. The value of parent->bi_status could have changed
between the time you read it for the if check and the time you write
to it. So we use cmpxchg to fix the race, as suggested by Christoph.

Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
 block/bio.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1b5e4577f4c..097c1cd2054 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -314,8 +314,9 @@ static struct bio *__bio_chain_endio(struct bio *bio)
 {
 	struct bio *parent = bio->bi_private;
 
-	if (bio->bi_status && !parent->bi_status)
-		parent->bi_status = bio->bi_status;
+	if (bio->bi_status)
+		cmpxchg(&parent->bi_status, 0, bio->bi_status);
+
 	bio_put(bio);
 	return parent;
 }
-- 
2.34.1
Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Andreas Gruenbacher 2 months, 1 week ago
On Mon, Dec 1, 2025 at 10:05 AM zhangshida <starzhangzsd@gmail.com> wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Andreas point out that multiple completions can race setting
> bi_status.
>
> The check (parent->bi_status) and the subsequent write are not an
> atomic operation. The value of parent->bi_status could have changed
> between the time you read it for the if check and the time you write
> to it. So we use cmpxchg to fix the race, as suggested by Christoph.
>
> Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> ---
>  block/bio.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 1b5e4577f4c..097c1cd2054 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -314,8 +314,9 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>  {
>         struct bio *parent = bio->bi_private;
>
> -       if (bio->bi_status && !parent->bi_status)
> -               parent->bi_status = bio->bi_status;
> +       if (bio->bi_status)
> +               cmpxchg(&parent->bi_status, 0, bio->bi_status);

Hmm. I don't think cmpxchg() actually is of any value here: for all
the chained bios, bi_status is initialized to 0, and it is only set
again (to a non-0 value) when a failure occurs. When there are
multiple failures, we only need to make sure that one of those
failures is eventually reported, but for that, a simple assignment is
enough here. The cmpxchg() won't guarantee that a specific error value
will survive; it all still depends on the timing. The cmpxchg() only
makes it look like something special is happening here with respect to
ordering.

> +
>         bio_put(bio);
>         return parent;
>  }
> --
> 2.34.1
>

Thanks,
Andreas
Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Christoph Hellwig 2 months, 1 week ago
On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > -       if (bio->bi_status && !parent->bi_status)
> > -               parent->bi_status = bio->bi_status;
> > +       if (bio->bi_status)
> > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> 
> Hmm. I don't think cmpxchg() actually is of any value here: for all
> the chained bios, bi_status is initialized to 0, and it is only set
> again (to a non-0 value) when a failure occurs. When there are
> multiple failures, we only need to make sure that one of those
> failures is eventually reported, but for that, a simple assignment is
> enough here.

A simple assignment doesn't guarantee atomicy.  It also overrides
earlier with later status codes, which might not be desirable.
Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Andreas Gruenbacher 2 months, 1 week ago
On Mon, Dec 1, 2025 at 12:25 PM Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > > -       if (bio->bi_status && !parent->bi_status)
> > > -               parent->bi_status = bio->bi_status;
> > > +       if (bio->bi_status)
> > > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> >
> > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > the chained bios, bi_status is initialized to 0, and it is only set
> > again (to a non-0 value) when a failure occurs. When there are
> > multiple failures, we only need to make sure that one of those
> > failures is eventually reported, but for that, a simple assignment is
> > enough here.
>
> A simple assignment doesn't guarantee atomicy.

Well, we've already discussed that bi_status is a single byte and so
tearing won't be an issue. Otherwise, WRITE_ONCE() would still be
enough here.

>  It also overrides earlier with later status codes, which might not be desirable.

In an A -> B bio chain, we have two competing bi_status writers:

  (1) when the A fails, B->bi_status will be updated using cmpxchg(),
  (2) when B fails, bi_status will be assigned a non-0 value.

In that scenario, B failures will always win over A failures.

In addition, when we have an A -> B -> C bio chain, we still won't get
"first failure wins" semantics because a failure of A will only be
propagated to C once B completes as well. To "fix" that, we'd have to
"chain" all bios to the same parent instead. But I don't think any of
that is really needed.

Andreas
Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Christoph Hellwig 2 months, 1 week ago
On Mon, Dec 01, 2025 at 02:07:07PM +0100, Andreas Gruenbacher wrote:
> On Mon, Dec 1, 2025 at 12:25 PM Christoph Hellwig <hch@infradead.org> wrote:
> > On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > > > -       if (bio->bi_status && !parent->bi_status)
> > > > -               parent->bi_status = bio->bi_status;
> > > > +       if (bio->bi_status)
> > > > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > >
> > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > the chained bios, bi_status is initialized to 0, and it is only set
> > > again (to a non-0 value) when a failure occurs. When there are
> > > multiple failures, we only need to make sure that one of those
> > > failures is eventually reported, but for that, a simple assignment is
> > > enough here.
> >
> > A simple assignment doesn't guarantee atomicy.
> 
> Well, we've already discussed that bi_status is a single byte and so
> tearing won't be an issue. Otherwise, WRITE_ONCE() would still be
> enough here.

No.  At least older alpha can tear byte updates as they need a
read-modify-write cycle.  But even on normal x86 the check and the update
would be racy.  The cmpxchg makes the intentions very clear, works
everywhere and given it only happens in the error path does not create
any fast path overhead.

Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Andreas Gruenbacher 2 months, 1 week ago
On Tue, Dec 2, 2025 at 6:48 AM Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Dec 01, 2025 at 02:07:07PM +0100, Andreas Gruenbacher wrote:
> > On Mon, Dec 1, 2025 at 12:25 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > > > > -       if (bio->bi_status && !parent->bi_status)
> > > > > -               parent->bi_status = bio->bi_status;
> > > > > +       if (bio->bi_status)
> > > > > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > > >
> > > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > > the chained bios, bi_status is initialized to 0, and it is only set
> > > > again (to a non-0 value) when a failure occurs. When there are
> > > > multiple failures, we only need to make sure that one of those
> > > > failures is eventually reported, but for that, a simple assignment is
> > > > enough here.
> > >
> > > A simple assignment doesn't guarantee atomicy.
> >
> > Well, we've already discussed that bi_status is a single byte and so
> > tearing won't be an issue. Otherwise, WRITE_ONCE() would still be
> > enough here.
>
> No.  At least older alpha can tear byte updates as they need a
> read-modify-write cycle.

I know this used to be a thing in the past, but to see that none of
that is relevant anymore today, have a look at where [*] quotes the
C11 standard:

        memory location
                either an object of scalar type, or a maximal sequence
                of adjacent bit-fields all having nonzero width

                NOTE 1: Two threads of execution can update and access
                separate memory locations without interfering with
                each other.

                NOTE 2: A bit-field and an adjacent non-bit-field member
                are in separate memory locations. The same applies
                to two bit-fields, if one is declared inside a nested
                structure declaration and the other is not, or if the two
                are separated by a zero-length bit-field declaration,
                or if they are separated by a non-bit-field member
                declaration. It is not safe to concurrently update two
                bit-fields in the same structure if all members declared
                between them are also bit-fields, no matter what the
                sizes of those intervening bit-fields happen to be.

[*] Documentation/memory-barriers.txt

> But even on normal x86 the check and the update would be racy.

There is no check and update (RMW), though. Quoting what I wrote
earlier in this thread:

On Mon, Dec 1, 2025 at 11:22 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> Hmm. I don't think cmpxchg() actually is of any value here: for all
> the chained bios, bi_status is initialized to 0, and it is only set
> again (to a non-0 value) when a failure occurs. When there are
> multiple failures, we only need to make sure that one of those
> failures is eventually reported, but for that, a simple assignment is
> enough here. The cmpxchg() won't guarantee that a specific error value
> will survive; it all still depends on the timing. The cmpxchg() only
> makes it look like something special is happening here with respect to
> ordering.

So with or without the cmpxchg(), if there are multiple errors, we
won't know which bi_status code will survive, but we do know that we
will end up with one of those error codes.

Andreas
Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Stephen Zhang 2 months, 1 week ago
Andreas Gruenbacher <agruenba@redhat.com> 于2025年12月3日周三 05:15写道:
>
> On Tue, Dec 2, 2025 at 6:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > On Mon, Dec 01, 2025 at 02:07:07PM +0100, Andreas Gruenbacher wrote:
> > > On Mon, Dec 1, 2025 at 12:25 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > > > > > -       if (bio->bi_status && !parent->bi_status)
> > > > > > -               parent->bi_status = bio->bi_status;
> > > > > > +       if (bio->bi_status)
> > > > > > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > > > >
> > > > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > > > the chained bios, bi_status is initialized to 0, and it is only set
> > > > > again (to a non-0 value) when a failure occurs. When there are
> > > > > multiple failures, we only need to make sure that one of those
> > > > > failures is eventually reported, but for that, a simple assignment is
> > > > > enough here.
> > > >
> > > > A simple assignment doesn't guarantee atomicy.
> > >
> > > Well, we've already discussed that bi_status is a single byte and so
> > > tearing won't be an issue. Otherwise, WRITE_ONCE() would still be
> > > enough here.
> >
> > No.  At least older alpha can tear byte updates as they need a
> > read-modify-write cycle.
>
> I know this used to be a thing in the past, but to see that none of
> that is relevant anymore today, have a look at where [*] quotes the
> C11 standard:
>
>         memory location
>                 either an object of scalar type, or a maximal sequence
>                 of adjacent bit-fields all having nonzero width
>
>                 NOTE 1: Two threads of execution can update and access
>                 separate memory locations without interfering with
>                 each other.
>
>                 NOTE 2: A bit-field and an adjacent non-bit-field member
>                 are in separate memory locations. The same applies
>                 to two bit-fields, if one is declared inside a nested
>                 structure declaration and the other is not, or if the two
>                 are separated by a zero-length bit-field declaration,
>                 or if they are separated by a non-bit-field member
>                 declaration. It is not safe to concurrently update two
>                 bit-fields in the same structure if all members declared
>                 between them are also bit-fields, no matter what the
>                 sizes of those intervening bit-fields happen to be.
>
> [*] Documentation/memory-barriers.txt
>
> > But even on normal x86 the check and the update would be racy.
>
> There is no check and update (RMW), though. Quoting what I wrote
> earlier in this thread:
>
> On Mon, Dec 1, 2025 at 11:22 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > the chained bios, bi_status is initialized to 0, and it is only set
> > again (to a non-0 value) when a failure occurs. When there are
> > multiple failures, we only need to make sure that one of those
> > failures is eventually reported, but for that, a simple assignment is
> > enough here. The cmpxchg() won't guarantee that a specific error value
> > will survive; it all still depends on the timing. The cmpxchg() only
> > makes it look like something special is happening here with respect to
> > ordering.
>
> So with or without the cmpxchg(), if there are multiple errors, we
> won't know which bi_status code will survive, but we do know that we
> will end up with one of those error codes.
>

Thank you for sharing your insights—I found the discussion very enlightening.

While I agree with Andreas’s perspective, I also very much appreciate
the clarity
and precision offered by the cmpxchg() approach. That’s why when Christoph
suggested it, I was happy to incorporate it into the code.

But a cmpxchg is a little bit redundant here.
so we will change it to the simple assignment:

-       if (bio->bi_status && !parent->bi_status)
                 parent->bi_status = bio->bi_status;
+       if (bio->bi_status)
                 parent->bi_status = bio->bi_status;

I will integrate this discussion into the commit message, it is very insightful.

Thanks,
Shida

> Andreas
>
Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Stephen Zhang 2 months, 1 week ago
Stephen Zhang <starzhangzsd@gmail.com> 于2025年12月3日周三 09:51写道:
>
> Andreas Gruenbacher <agruenba@redhat.com> 于2025年12月3日周三 05:15写道:
> >
> > On Tue, Dec 2, 2025 at 6:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > > On Mon, Dec 01, 2025 at 02:07:07PM +0100, Andreas Gruenbacher wrote:
> > > > On Mon, Dec 1, 2025 at 12:25 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > > > > > > -       if (bio->bi_status && !parent->bi_status)
> > > > > > > -               parent->bi_status = bio->bi_status;
> > > > > > > +       if (bio->bi_status)
> > > > > > > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > > > > >
> > > > > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > > > > the chained bios, bi_status is initialized to 0, and it is only set
> > > > > > again (to a non-0 value) when a failure occurs. When there are
> > > > > > multiple failures, we only need to make sure that one of those
> > > > > > failures is eventually reported, but for that, a simple assignment is
> > > > > > enough here.
> > > > >
> > > > > A simple assignment doesn't guarantee atomicy.
> > > >
> > > > Well, we've already discussed that bi_status is a single byte and so
> > > > tearing won't be an issue. Otherwise, WRITE_ONCE() would still be
> > > > enough here.
> > >
> > > No.  At least older alpha can tear byte updates as they need a
> > > read-modify-write cycle.
> >
> > I know this used to be a thing in the past, but to see that none of
> > that is relevant anymore today, have a look at where [*] quotes the
> > C11 standard:
> >
> >         memory location
> >                 either an object of scalar type, or a maximal sequence
> >                 of adjacent bit-fields all having nonzero width
> >
> >                 NOTE 1: Two threads of execution can update and access
> >                 separate memory locations without interfering with
> >                 each other.
> >
> >                 NOTE 2: A bit-field and an adjacent non-bit-field member
> >                 are in separate memory locations. The same applies
> >                 to two bit-fields, if one is declared inside a nested
> >                 structure declaration and the other is not, or if the two
> >                 are separated by a zero-length bit-field declaration,
> >                 or if they are separated by a non-bit-field member
> >                 declaration. It is not safe to concurrently update two
> >                 bit-fields in the same structure if all members declared
> >                 between them are also bit-fields, no matter what the
> >                 sizes of those intervening bit-fields happen to be.
> >
> > [*] Documentation/memory-barriers.txt
> >
> > > But even on normal x86 the check and the update would be racy.
> >
> > There is no check and update (RMW), though. Quoting what I wrote
> > earlier in this thread:
> >
> > On Mon, Dec 1, 2025 at 11:22 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > the chained bios, bi_status is initialized to 0, and it is only set
> > > again (to a non-0 value) when a failure occurs. When there are
> > > multiple failures, we only need to make sure that one of those
> > > failures is eventually reported, but for that, a simple assignment is
> > > enough here. The cmpxchg() won't guarantee that a specific error value
> > > will survive; it all still depends on the timing. The cmpxchg() only
> > > makes it look like something special is happening here with respect to
> > > ordering.
> >
> > So with or without the cmpxchg(), if there are multiple errors, we
> > won't know which bi_status code will survive, but we do know that we
> > will end up with one of those error codes.
> >
>
> Thank you for sharing your insights—I found the discussion very enlightening.
>
> While I agree with Andreas’s perspective, I also very much appreciate
> the clarity
> and precision offered by the cmpxchg() approach. That’s why when Christoph
> suggested it, I was happy to incorporate it into the code.
>
> But a cmpxchg is a little bit redundant here.
> so we will change it to the simple assignment:
>
> -       if (bio->bi_status && !parent->bi_status)
>                  parent->bi_status = bio->bi_status;
> +       if (bio->bi_status)
>                  parent->bi_status = bio->bi_status;
>
> I will integrate this discussion into the commit message, it is very insightful.
>

Hi,

I’ve been reconsidering the two approaches for the upcoming patch revision.
Essentially, we’re comparing two methods:
A:
        if (bio->bi_status)
                   parent->bi_status = bio->bi_status;
B:
        if (bio->bi_status)
                cmpxchg(&parent->bi_status, 0, bio->bi_status);

Both appear correct, but B seems a little bit redundant here.
Upon further reflection, I’ve noticed a subtle difference:
A unconditionally writes to parent->bi_status when bio->bi_status is non-zero,
regardless of the current value of parent->bi_status.
B uses cmpxchg to only update parent->bi_status if it is still zero.

Thus, B could avoid unnecessary writes in cases where parent->bi_status has
already been set to a non-zero value.

Do you think this optimization would be beneficial in practice, or is
the difference
negligible?

Thanks,
Shida

> Thanks,
> Shida
>
> > Andreas
> >
Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Christoph Hellwig 2 months, 1 week ago
On Wed, Dec 03, 2025 at 11:09:36AM +0800, Stephen Zhang wrote:
> 
> I’ve been reconsidering the two approaches for the upcoming patch revision.
> Essentially, we’re comparing two methods:
> A:
>         if (bio->bi_status)
>                    parent->bi_status = bio->bi_status;
> B:
>         if (bio->bi_status)
>                 cmpxchg(&parent->bi_status, 0, bio->bi_status);
> 
> Both appear correct, but B seems a little bit redundant here.

A is not correct.  You at least needs READ_ONCE/WRITE_ONCE here.

B solves all these issues.

> Upon further reflection, I’ve noticed a subtle difference:
> A unconditionally writes to parent->bi_status when bio->bi_status is non-zero,
> regardless of the current value of parent->bi_status.
> B uses cmpxchg to only update parent->bi_status if it is still zero.
> 
> Thus, B could avoid unnecessary writes in cases where parent->bi_status has
> already been set to a non-zero value.

The unnecessary writes don't really matter, we're in an error slow path
here.

Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
Posted by Caleb Sander Mateos 2 months, 1 week ago
On Tue, Dec 2, 2025 at 7:10 PM Stephen Zhang <starzhangzsd@gmail.com> wrote:
>
> Stephen Zhang <starzhangzsd@gmail.com> 于2025年12月3日周三 09:51写道:
> >
> > Andreas Gruenbacher <agruenba@redhat.com> 于2025年12月3日周三 05:15写道:
> > >
> > > On Tue, Dec 2, 2025 at 6:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > On Mon, Dec 01, 2025 at 02:07:07PM +0100, Andreas Gruenbacher wrote:
> > > > > On Mon, Dec 1, 2025 at 12:25 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > > > > > > > -       if (bio->bi_status && !parent->bi_status)
> > > > > > > > -               parent->bi_status = bio->bi_status;
> > > > > > > > +       if (bio->bi_status)
> > > > > > > > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > > > > > >
> > > > > > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > > > > > the chained bios, bi_status is initialized to 0, and it is only set
> > > > > > > again (to a non-0 value) when a failure occurs. When there are
> > > > > > > multiple failures, we only need to make sure that one of those
> > > > > > > failures is eventually reported, but for that, a simple assignment is
> > > > > > > enough here.
> > > > > >
> > > > > > A simple assignment doesn't guarantee atomicy.
> > > > >
> > > > > Well, we've already discussed that bi_status is a single byte and so
> > > > > tearing won't be an issue. Otherwise, WRITE_ONCE() would still be
> > > > > enough here.
> > > >
> > > > No.  At least older alpha can tear byte updates as they need a
> > > > read-modify-write cycle.
> > >
> > > I know this used to be a thing in the past, but to see that none of
> > > that is relevant anymore today, have a look at where [*] quotes the
> > > C11 standard:
> > >
> > >         memory location
> > >                 either an object of scalar type, or a maximal sequence
> > >                 of adjacent bit-fields all having nonzero width
> > >
> > >                 NOTE 1: Two threads of execution can update and access
> > >                 separate memory locations without interfering with
> > >                 each other.
> > >
> > >                 NOTE 2: A bit-field and an adjacent non-bit-field member
> > >                 are in separate memory locations. The same applies
> > >                 to two bit-fields, if one is declared inside a nested
> > >                 structure declaration and the other is not, or if the two
> > >                 are separated by a zero-length bit-field declaration,
> > >                 or if they are separated by a non-bit-field member
> > >                 declaration. It is not safe to concurrently update two
> > >                 bit-fields in the same structure if all members declared
> > >                 between them are also bit-fields, no matter what the
> > >                 sizes of those intervening bit-fields happen to be.
> > >
> > > [*] Documentation/memory-barriers.txt
> > >
> > > > But even on normal x86 the check and the update would be racy.
> > >
> > > There is no check and update (RMW), though. Quoting what I wrote
> > > earlier in this thread:
> > >
> > > On Mon, Dec 1, 2025 at 11:22 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > > the chained bios, bi_status is initialized to 0, and it is only set
> > > > again (to a non-0 value) when a failure occurs. When there are
> > > > multiple failures, we only need to make sure that one of those
> > > > failures is eventually reported, but for that, a simple assignment is
> > > > enough here. The cmpxchg() won't guarantee that a specific error value
> > > > will survive; it all still depends on the timing. The cmpxchg() only
> > > > makes it look like something special is happening here with respect to
> > > > ordering.
> > >
> > > So with or without the cmpxchg(), if there are multiple errors, we
> > > won't know which bi_status code will survive, but we do know that we
> > > will end up with one of those error codes.
> > >
> >
> > Thank you for sharing your insights—I found the discussion very enlightening.
> >
> > While I agree with Andreas’s perspective, I also very much appreciate
> > the clarity
> > and precision offered by the cmpxchg() approach. That’s why when Christoph
> > suggested it, I was happy to incorporate it into the code.
> >
> > But a cmpxchg is a little bit redundant here.
> > so we will change it to the simple assignment:
> >
> > -       if (bio->bi_status && !parent->bi_status)
> >                  parent->bi_status = bio->bi_status;
> > +       if (bio->bi_status)
> >                  parent->bi_status = bio->bi_status;
> >
> > I will integrate this discussion into the commit message, it is very insightful.
> >
>
> Hi,
>
> I’ve been reconsidering the two approaches for the upcoming patch revision.
> Essentially, we’re comparing two methods:
> A:
>         if (bio->bi_status)
>                    parent->bi_status = bio->bi_status;
> B:
>         if (bio->bi_status)
>                 cmpxchg(&parent->bi_status, 0, bio->bi_status);
>
> Both appear correct, but B seems a little bit redundant here.
> Upon further reflection, I’ve noticed a subtle difference:
> A unconditionally writes to parent->bi_status when bio->bi_status is non-zero,
> regardless of the current value of parent->bi_status.
> B uses cmpxchg to only update parent->bi_status if it is still zero.
>
> Thus, B could avoid unnecessary writes in cases where parent->bi_status has
> already been set to a non-zero value.
>
> Do you think this optimization would be beneficial in practice, or is
> the difference
> negligible?

cmpxchg() is much more expensive than a plain write, as it compiles
into an atomic instruction (or load-linked/store-conditional pair on
architectures that don't provide such an instruction). This requires
the processor to take exclusive access of the cache line for the
duration of the operation (or check whether the cache line has been
invalidated during the operation). On x86, cmpxchg() marks the cache
line as modified regardless of whether the compare succeeds and the
value is updated. So it doesn't actually avoid the cost of a write.
However, the original code's check of parent->bi_status before writing
to it *does* avoid the write if parent->bi_status is already nonzero.
So the optimization you're looking for is already implemented :)
If __bio_chain_endio() can be called concurrently from multiple
threads accessing the same parent bio, it should be using WRITE_ONCE()
(and READ_ONCE(), if applicable) to access parent->bi_status to avoid
the data race. On x86 and ARM these macros produce the same
instruction as a normal write, but they may be required on other
architectures to avoid tearing, as well as to prevent the compiler
from adding or removing memory accesses based on the assumption that
the values aren't being accessed concurrently by other threads.

Best,
Caleb