Add proper bio_split() error handling. For any error, call
raid_end_bio_io() and return;
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
drivers/md/raid1.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 6c9d24203f39..c561e2d185e2 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1383,6 +1383,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
if (max_sectors < bio_sectors(bio)) {
struct bio *split = bio_split(bio, max_sectors,
gfp, &conf->bio_split);
+ if (IS_ERR(split)) {
+ raid_end_bio_io(r1_bio);
+ return;
+ }
bio_chain(split, bio);
submit_bio_noacct(bio);
bio = split;
@@ -1576,6 +1580,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
if (max_sectors < bio_sectors(bio)) {
struct bio *split = bio_split(bio, max_sectors,
GFP_NOIO, &conf->bio_split);
+ if (IS_ERR(split)) {
+ raid_end_bio_io(r1_bio);
+ return;
+ }
bio_chain(split, bio);
submit_bio_noacct(bio);
bio = split;
--
2.31.1
Hi, 在 2024/09/19 17:23, John Garry 写道: > Add proper bio_split() error handling. For any error, call > raid_end_bio_io() and return; > > Signed-off-by: John Garry <john.g.garry@oracle.com> > --- > drivers/md/raid1.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 6c9d24203f39..c561e2d185e2 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1383,6 +1383,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, > if (max_sectors < bio_sectors(bio)) { > struct bio *split = bio_split(bio, max_sectors, > gfp, &conf->bio_split); > + if (IS_ERR(split)) { > + raid_end_bio_io(r1_bio); > + return; > + } This way, BLK_STS_IOERR will always be returned, perhaps what you want is to return the error code from bio_split()? Thanks, Kuai > bio_chain(split, bio); > submit_bio_noacct(bio); > bio = split; > @@ -1576,6 +1580,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, > if (max_sectors < bio_sectors(bio)) { > struct bio *split = bio_split(bio, max_sectors, > GFP_NOIO, &conf->bio_split); > + if (IS_ERR(split)) { > + raid_end_bio_io(r1_bio); > + return; > + } > bio_chain(split, bio); > submit_bio_noacct(bio); > bio = split; >
On 20/09/2024 07:58, Yu Kuai wrote: > Hi, > > 在 2024/09/19 17:23, John Garry 写道: >> Add proper bio_split() error handling. For any error, call >> raid_end_bio_io() and return; >> >> Signed-off-by: John Garry <john.g.garry@oracle.com> >> --- >> drivers/md/raid1.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 6c9d24203f39..c561e2d185e2 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -1383,6 +1383,10 @@ static void raid1_read_request(struct mddev >> *mddev, struct bio *bio, >> if (max_sectors < bio_sectors(bio)) { >> struct bio *split = bio_split(bio, max_sectors, >> gfp, &conf->bio_split); >> + if (IS_ERR(split)) { >> + raid_end_bio_io(r1_bio); >> + return; >> + } > > This way, BLK_STS_IOERR will always be returned, perhaps what you want > is to return the error code from bio_split()? Yeah, I would like to return that error code, so maybe I can encode it in the master_bio directly or pass as an arg to raid_end_bio_io(). Thanks, John
在 2024/09/20 18:04, John Garry 写道: > On 20/09/2024 07:58, Yu Kuai wrote: >> Hi, >> >> 在 2024/09/19 17:23, John Garry 写道: >>> Add proper bio_split() error handling. For any error, call >>> raid_end_bio_io() and return; >>> >>> Signed-off-by: John Garry <john.g.garry@oracle.com> >>> --- >>> drivers/md/raid1.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 6c9d24203f39..c561e2d185e2 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -1383,6 +1383,10 @@ static void raid1_read_request(struct mddev >>> *mddev, struct bio *bio, >>> if (max_sectors < bio_sectors(bio)) { >>> struct bio *split = bio_split(bio, max_sectors, >>> gfp, &conf->bio_split); >>> + if (IS_ERR(split)) { >>> + raid_end_bio_io(r1_bio); >>> + return; >>> + } >> >> This way, BLK_STS_IOERR will always be returned, perhaps what you want >> is to return the error code from bio_split()? > > Yeah, I would like to return that error code, so maybe I can encode it > in the master_bio directly or pass as an arg to raid_end_bio_io(). That's fine, however, I think the change can introduce problems in some corner cases, for example there is a rdev with badblocks and a slow rdev with full copy. Currently raid1_read_request() will split this bio to read some from fast rdev, and read the badblocks region from slow rdev. We need a new branch in read_balance() to choose a rdev with full copy. Thanks, Kuai > > Thanks, > John > > . >
On 23/09/2024 07:15, Yu Kuai wrote: Hi Kuai, >> iff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 6c9d24203f39..c561e2d185e2 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -1383,6 +1383,10 @@ static void raid1_read_request(struct mddev >> *mddev, struct bio *bio, >> if (max_sectors < bio_sectors(bio)) { >> struct bio *split = bio_split(bio, max_sectors, >> gfp, &conf->bio_split); >> + if (IS_ERR(split)) { >> + raid_end_bio_io(r1_bio); >> + return; >> + } > > This way, BLK_STS_IOERR will always be returned, perhaps what you want > is to return the error code from bio_split()? I am not sure on the best way to pass the bio_split() error code to bio->bi_status. I could just have this pattern: bio->bi_status = errno_to_blk_status(err); set_bit(R1BIO_Uptodate, &r1_bio->state); raid_end_bio_io(r1_bio); Is there a neater way to do this? Thanks, John
在 2024/10/23 19:21, John Garry 写道: > On 23/09/2024 07:15, Yu Kuai wrote: > > Hi Kuai, > >>> iff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 6c9d24203f39..c561e2d185e2 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -1383,6 +1383,10 @@ static void raid1_read_request(struct mddev >>> *mddev, struct bio *bio, >>> if (max_sectors < bio_sectors(bio)) { >>> struct bio *split = bio_split(bio, max_sectors, >>> gfp, &conf->bio_split); >>> + if (IS_ERR(split)) { >>> + raid_end_bio_io(r1_bio); >>> + return; >>> + } >> >> This way, BLK_STS_IOERR will always be returned, perhaps what you want >> is to return the error code from bio_split()? > > I am not sure on the best way to pass the bio_split() error code to > bio->bi_status. > > I could just have this pattern: > > bio->bi_status = errno_to_blk_status(err); > set_bit(R1BIO_Uptodate, &r1_bio->state); > raid_end_bio_io(r1_bio); > I can live with this. :) > Is there a neater way to do this? Perhaps add a new filed 'status' in r1bio? And initialize it to BLK_STS_IOERR; Then replace: set_bit(R1BIO_Uptodate, &r1_bio->state); to: r1_bio->status = BLK_STS_OK; and change call_bio_endio: bio->bi_status = r1_bio->status; finially here: r1_bio->status = errno_to_blk_status(err); raid_end_bio_io(r1_bio); Thanks, Kuai > > Thanks, > John > > > . >
On 24/10/2024 04:08, Yu Kuai wrote: >> >> I could just have this pattern: Hi Kuai, >> >> bio->bi_status = errno_to_blk_status(err); >> set_bit(R1BIO_Uptodate, &r1_bio->state); >> raid_end_bio_io(r1_bio); >> > I can live with this. 🙂 > >> Is there a neater way to do this? > > Perhaps add a new filed 'status' in r1bio? And initialize it to > BLK_STS_IOERR; > > Then replace: > set_bit(R1BIO_Uptodate, &r1_bio->state); > to: > r1_bio->status = BLK_STS_OK; So are you saying that R1BIO_Uptodate could be dropped then? > > and change call_bio_endio: > bio->bi_status = r1_bio->status; > > finially here: > r1_bio->status = errno_to_blk_status(err); > raid_end_bio_io(r1_bio); Why not just set bio->bi_status directly? Cheers, John
Hi, 在 2024/10/24 21:51, John Garry 写道: > On 24/10/2024 04:08, Yu Kuai wrote: >>> >>> I could just have this pattern: > > Hi Kuai, > >>> >>> bio->bi_status = errno_to_blk_status(err); >>> set_bit(R1BIO_Uptodate, &r1_bio->state); >>> raid_end_bio_io(r1_bio); >>> >> I can live with this. 🙂 >> >>> Is there a neater way to do this? >> >> Perhaps add a new filed 'status' in r1bio? And initialize it to >> BLK_STS_IOERR; >> >> Then replace: >> set_bit(R1BIO_Uptodate, &r1_bio->state); >> to: >> r1_bio->status = BLK_STS_OK; > > So are you saying that R1BIO_Uptodate could be dropped then? > >> >> and change call_bio_endio: >> bio->bi_status = r1_bio->status; >> >> finially here: >> r1_bio->status = errno_to_blk_status(err); >> raid_end_bio_io(r1_bio); > > Why not just set bio->bi_status directly? Because you have to set R1BIO_Uptodate in this case, and this is not what this flag means. Like I said, I can live with this, it's up to you. :) Thanks, Kuai > > Cheers, > John > > > . >
On 23/09/2024 07:15, Yu Kuai wrote: >>> >>> This way, BLK_STS_IOERR will always be returned, perhaps what you want >>> is to return the error code from bio_split()? >> >> Yeah, I would like to return that error code, so maybe I can encode it >> in the master_bio directly or pass as an arg to raid_end_bio_io(). > > That's fine, however, I think the change can introduce problems in some > corner cases, for example there is a rdev with badblocks and a slow rdev > with full copy. Currently raid1_read_request() will split this bio to > read some from fast rdev, and read the badblocks region from slow rdev. > > We need a new branch in read_balance() to choose a rdev with full copy. Sure, I do realize that the mirror'ing personalities need more sophisticated error handling changes (than what I presented). However, in raid1_read_request() we do the read_balance() and then the bio_split() attempt. So what are you suggesting we do for the bio_split() error? Is it to retry without the bio_split()? To me bio_split() should not fail. If it does, it is likely ENOMEM or some other bug being exposed, so I am not sure that retrying with skipping bio_split() is the right approach (if that is what you are suggesting). Thanks, John
Hi, 在 2024/09/23 15:44, John Garry 写道: > On 23/09/2024 07:15, Yu Kuai wrote: >>>> >>>> This way, BLK_STS_IOERR will always be returned, perhaps what you want >>>> is to return the error code from bio_split()? >>> >>> Yeah, I would like to return that error code, so maybe I can encode >>> it in the master_bio directly or pass as an arg to raid_end_bio_io(). >> >> That's fine, however, I think the change can introduce problems in some >> corner cases, for example there is a rdev with badblocks and a slow rdev >> with full copy. Currently raid1_read_request() will split this bio to >> read some from fast rdev, and read the badblocks region from slow rdev. >> >> We need a new branch in read_balance() to choose a rdev with full copy. > > Sure, I do realize that the mirror'ing personalities need more > sophisticated error handling changes (than what I presented). > > However, in raid1_read_request() we do the read_balance() and then the > bio_split() attempt. So what are you suggesting we do for the > bio_split() error? Is it to retry without the bio_split()? > > To me bio_split() should not fail. If it does, it is likely ENOMEM or > some other bug being exposed, so I am not sure that retrying with > skipping bio_split() is the right approach (if that is what you are > suggesting). bio_split_to_limits() is already called from md_submit_bio(), so here bio should only be splitted because of badblocks or resync. We have to return error for resync, however, for badblocks, we can still try to find a rdev without badblocks so bio_split() is not needed. And we need to retry and inform read_balance() to skip rdev with badblocks in this case. This can only happen if the full copy only exist in slow disks. This really is corner case, and this is not related to your new error path by atomic write. I don't mind this version for now, just something I noticed if bio_spilit() can fail. Thanks, Kuai > > Thanks, > John > > . >
On 23/09/2024 09:18, Yu Kuai wrote: >>> >>> We need a new branch in read_balance() to choose a rdev with full copy. >> >> Sure, I do realize that the mirror'ing personalities need more >> sophisticated error handling changes (than what I presented). >> >> However, in raid1_read_request() we do the read_balance() and then the >> bio_split() attempt. So what are you suggesting we do for the >> bio_split() error? Is it to retry without the bio_split()? >> >> To me bio_split() should not fail. If it does, it is likely ENOMEM or >> some other bug being exposed, so I am not sure that retrying with >> skipping bio_split() is the right approach (if that is what you are >> suggesting). > > bio_split_to_limits() is already called from md_submit_bio(), so here > bio should only be splitted because of badblocks or resync. We have to > return error for resync, however, for badblocks, we can still try to > find a rdev without badblocks so bio_split() is not needed. And we need > to retry and inform read_balance() to skip rdev with badblocks in this > case. > > This can only happen if the full copy only exist in slow disks. This > really is corner case, and this is not related to your new error path by > atomic write. I don't mind this version for now, just something > I noticed if bio_spilit() can fail. Are you saying that some improvement needs to be made to the current code for badblocks handling, like initially try to skip bio_split()? Apart from that, what about the change in raid10_write_request(), w.r.t error handling? There, for an error in bio_split(), I think that we need to do some tidy-up if bio_split() fails, i.e. undo increase in rdev->nr_pending when looping conf->copies BTW, feel free to comment in patch 6/6 for that. Thanks, John
Hi, 在 2024/09/23 17:21, John Garry 写道: > On 23/09/2024 09:18, Yu Kuai wrote: >>>> >>>> We need a new branch in read_balance() to choose a rdev with full copy. >>> >>> Sure, I do realize that the mirror'ing personalities need more >>> sophisticated error handling changes (than what I presented). >>> >>> However, in raid1_read_request() we do the read_balance() and then >>> the bio_split() attempt. So what are you suggesting we do for the >>> bio_split() error? Is it to retry without the bio_split()? >>> >>> To me bio_split() should not fail. If it does, it is likely ENOMEM or >>> some other bug being exposed, so I am not sure that retrying with >>> skipping bio_split() is the right approach (if that is what you are >>> suggesting). >> >> bio_split_to_limits() is already called from md_submit_bio(), so here >> bio should only be splitted because of badblocks or resync. We have to >> return error for resync, however, for badblocks, we can still try to >> find a rdev without badblocks so bio_split() is not needed. And we need >> to retry and inform read_balance() to skip rdev with badblocks in this >> case. >> >> This can only happen if the full copy only exist in slow disks. This >> really is corner case, and this is not related to your new error path by >> atomic write. I don't mind this version for now, just something >> I noticed if bio_spilit() can fail. > > Are you saying that some improvement needs to be made to the current > code for badblocks handling, like initially try to skip bio_split()? > > Apart from that, what about the change in raid10_write_request(), w.r.t > error handling? > > There, for an error in bio_split(), I think that we need to do some > tidy-up if bio_split() fails, i.e. undo increase in rdev->nr_pending > when looping conf->copies > > BTW, feel free to comment in patch 6/6 for that. Yes, raid1/raid10 write are the same. If you want to enable atomic write for raid1/raid10, you must add a new branch to handle badblocks now, otherwise, as long as one copy contain any badblocks, atomic write will fail while theoretically I think it can work. Thanks, Kuai > > Thanks, > John > > . >
On 23/09/2024 10:38, Yu Kuai wrote: >>>>> >>>>> We need a new branch in read_balance() to choose a rdev with full >>>>> copy. >>>> >>>> Sure, I do realize that the mirror'ing personalities need more >>>> sophisticated error handling changes (than what I presented). >>>> >>>> However, in raid1_read_request() we do the read_balance() and then >>>> the bio_split() attempt. So what are you suggesting we do for the >>>> bio_split() error? Is it to retry without the bio_split()? >>>> >>>> To me bio_split() should not fail. If it does, it is likely ENOMEM >>>> or some other bug being exposed, so I am not sure that retrying with >>>> skipping bio_split() is the right approach (if that is what you are >>>> suggesting). >>> >>> bio_split_to_limits() is already called from md_submit_bio(), so here >>> bio should only be splitted because of badblocks or resync. We have to >>> return error for resync, however, for badblocks, we can still try to >>> find a rdev without badblocks so bio_split() is not needed. And we need >>> to retry and inform read_balance() to skip rdev with badblocks in this >>> case. >>> >>> This can only happen if the full copy only exist in slow disks. This >>> really is corner case, and this is not related to your new error path by >>> atomic write. I don't mind this version for now, just something >>> I noticed if bio_spilit() can fail. >> Hi Kuai, I am just coming back to this topic now. Previously I was saying that we should error and end the bio if we need to split for an atomic write due to BB. Continued below.. >> Are you saying that some improvement needs to be made to the current >> code for badblocks handling, like initially try to skip bio_split()? >> >> Apart from that, what about the change in raid10_write_request(), >> w.r.t error handling? >> >> There, for an error in bio_split(), I think that we need to do some >> tidy-up if bio_split() fails, i.e. undo increase in rdev->nr_pending >> when looping conf->copies >> >> BTW, feel free to comment in patch 6/6 for that. > > Yes, raid1/raid10 write are the same. If you want to enable atomic write > for raid1/raid10, you must add a new branch to handle badblocks now, > otherwise, as long as one copy contain any badblocks, atomic write will > fail while theoretically I think it can work. Can you please expand on what you mean by this last sentence, "I think it can work". Indeed, IMO, chance of encountering a device with BBs and supporting atomic writes is low, so no need to try to make it work (if it were possible) - I think that we just report EIO. Thanks, John
On 23/10/2024 12:16, John Garry wrote: > On 23/09/2024 10:38, Yu Kuai wrote: >>>>>> We need a new branch in read_balance() to choose a rdev with full >>>>>> copy. >>>>> Sure, I do realize that the mirror'ing personalities need more >>>>> sophisticated error handling changes (than what I presented). >>>>> >>>>> However, in raid1_read_request() we do the read_balance() and then >>>>> the bio_split() attempt. So what are you suggesting we do for the >>>>> bio_split() error? Is it to retry without the bio_split()? >>>>> >>>>> To me bio_split() should not fail. If it does, it is likely ENOMEM >>>>> or some other bug being exposed, so I am not sure that retrying with >>>>> skipping bio_split() is the right approach (if that is what you are >>>>> suggesting). >>>> bio_split_to_limits() is already called from md_submit_bio(), so here >>>> bio should only be splitted because of badblocks or resync. We have to >>>> return error for resync, however, for badblocks, we can still try to >>>> find a rdev without badblocks so bio_split() is not needed. And we need >>>> to retry and inform read_balance() to skip rdev with badblocks in this >>>> case. >>>> >>>> This can only happen if the full copy only exist in slow disks. This >>>> really is corner case, and this is not related to your new error path by >>>> atomic write. I don't mind this version for now, just something >>>> I noticed if bio_spilit() can fail. > Hi Kuai, > > I am just coming back to this topic now. > > Previously I was saying that we should error and end the bio if we need > to split for an atomic write due to BB. Continued below.. > >>> Are you saying that some improvement needs to be made to the current >>> code for badblocks handling, like initially try to skip bio_split()? >>> >>> Apart from that, what about the change in raid10_write_request(), >>> w.r.t error handling? >>> >>> There, for an error in bio_split(), I think that we need to do some >>> tidy-up if bio_split() fails, i.e. undo increase in rdev->nr_pending >>> when looping conf->copies >>> >>> BTW, feel free to comment in patch 6/6 for that. >> Yes, raid1/raid10 write are the same. If you want to enable atomic write >> for raid1/raid10, you must add a new branch to handle badblocks now, >> otherwise, as long as one copy contain any badblocks, atomic write will >> fail while theoretically I think it can work. > Can you please expand on what you mean by this last sentence, "I think > it can work". > > Indeed, IMO, chance of encountering a device with BBs and supporting > atomic writes is low, so no need to try to make it work (if it were > possible) - I think that we just report EIO. > > Thanks, > John > > Hi all, Looking at this from a different angle: what does the bad blocks system actually gain in modern environments? All the physical storage devices I can think of (including all HDDs and SSDs, NVME or otherwise) have internal mechanisms for remapping faulty blocks, and therefore unrecoverable blocks don't become visible to the Linux kernel level until after the physical storage device has exhausted its internal supply of replacement blocks. At that point the physical device is already catastrophically failing, and in the case of SSDs will likely have already transitioned to a read-only state. Using bad-blocks at the kernel level to map around additional faulty blocks at this point does not seem to me to have any benefit, and the device is unlikely to be even marginally usable for any useful length of time at that point anyway. It seems to me that the bad-blocks capability is a legacy from the distant past when HDDs did not do internal block remapping and hence the kernel could usefully keep a disk usable by mapping out individual blocks in software. If this is the case and there isn't some other way that bad-blocks is still beneficial, might it be better to drop it altogether rather than implementing complex code to work around its effects? Of course I'm happy to be corrected if there's still a real benefit to having it, just because I can't see one doesn't mean there isn't one. Regards, Geoff.
On 23/10/2024 12:46, Geoff Back wrote: >>> Yes, raid1/raid10 write are the same. If you want to enable atomic write >>> for raid1/raid10, you must add a new branch to handle badblocks now, >>> otherwise, as long as one copy contain any badblocks, atomic write will >>> fail while theoretically I think it can work. >> Can you please expand on what you mean by this last sentence, "I think >> it can work". >> >> Indeed, IMO, chance of encountering a device with BBs and supporting >> atomic writes is low, so no need to try to make it work (if it were >> possible) - I think that we just report EIO. >> >> Thanks, >> John >> >> > Hi all, > > Looking at this from a different angle: what does the bad blocks system > actually gain in modern environments? All the physical storage devices > I can think of (including all HDDs and SSDs, NVME or otherwise) have > internal mechanisms for remapping faulty blocks, and therefore > unrecoverable blocks don't become visible to the Linux kernel level > until after the physical storage device has exhausted its internal > supply of replacement blocks. At that point the physical device is > already catastrophically failing, and in the case of SSDs will likely > have already transitioned to a read-only state. Using bad-blocks at the > kernel level to map around additional faulty blocks at this point does > not seem to me to have any benefit, and the device is unlikely to be > even marginally usable for any useful length of time at that point anyway. > > It seems to me that the bad-blocks capability is a legacy from the > distant past when HDDs did not do internal block remapping and hence the > kernel could usefully keep a disk usable by mapping out individual > blocks in software. > If this is the case and there isn't some other way that bad-blocks is > still beneficial, might it be better to drop it altogether rather than > implementing complex code to work around its effects? I am not proposing to drop it. That is another topic. I am just saying that I don't expect BBs for a device which supports atomic writes. As such, the solution for that case is simple - for an atomic write which cover BBs in any rdev, then just error that write. > > Of course I'm happy to be corrected if there's still a real benefit to > having it, just because I can't see one doesn't mean there isn't one. I don't know if there is really a BB support benefit for modern devices at all. Thanks, John
Hi, 在 2024/10/23 20:11, John Garry 写道: > On 23/10/2024 12:46, Geoff Back wrote: >>>> Yes, raid1/raid10 write are the same. If you want to enable atomic >>>> write >>>> for raid1/raid10, you must add a new branch to handle badblocks now, >>>> otherwise, as long as one copy contain any badblocks, atomic write will >>>> fail while theoretically I think it can work. >>> Can you please expand on what you mean by this last sentence, "I think >>> it can work". I mean in this case, for the write IO, there is no need to split this IO for the underlying disks that doesn't have BB, hence atomic write can still work. Currently solution is to split the IO to the range that all underlying disks doesn't have BB. >>> >>> Indeed, IMO, chance of encountering a device with BBs and supporting >>> atomic writes is low, so no need to try to make it work (if it were >>> possible) - I think that we just report EIO. If you want this, then make sure raid will set fail fast together with atomic write. This way disk will just faulty with IO error instead of marking with BB, hence make sure there are no BBs. >>> >>> Thanks, >>> John >>> >>> >> Hi all, >> >> Looking at this from a different angle: what does the bad blocks system >> actually gain in modern environments? All the physical storage devices >> I can think of (including all HDDs and SSDs, NVME or otherwise) have >> internal mechanisms for remapping faulty blocks, and therefore >> unrecoverable blocks don't become visible to the Linux kernel level >> until after the physical storage device has exhausted its internal >> supply of replacement blocks. At that point the physical device is >> already catastrophically failing, and in the case of SSDs will likely >> have already transitioned to a read-only state. Using bad-blocks at the >> kernel level to map around additional faulty blocks at this point does >> not seem to me to have any benefit, and the device is unlikely to be >> even marginally usable for any useful length of time at that point >> anyway. >> >> It seems to me that the bad-blocks capability is a legacy from the >> distant past when HDDs did not do internal block remapping and hence the >> kernel could usefully keep a disk usable by mapping out individual >> blocks in software. >> If this is the case and there isn't some other way that bad-blocks is >> still beneficial, might it be better to drop it altogether rather than >> implementing complex code to work around its effects? No, we can't just kill it, unless all the disks behaves like: never return IO error if the disk is still accessible, and once IO error is returned, the disk is totally unusable.(This is what failfast means in raid). Thanks, Kuai > > I am not proposing to drop it. That is another topic. > > I am just saying that I don't expect BBs for a device which supports > atomic writes. As such, the solution for that case is simple - for an > atomic write which cover BBs in any rdev, then just error that write. > >> >> Of course I'm happy to be corrected if there's still a real benefit to >> having it, just because I can't see one doesn't mean there isn't one. > > I don't know if there is really a BB support benefit for modern devices > at all. > > Thanks, > John > > > . >
On 24/10/2024 03:10, Yu Kuai wrote: >> On 23/10/2024 12:46, Geoff Back wrote: >>>>> Yes, raid1/raid10 write are the same. If you want to enable atomic >>>>> write >>>>> for raid1/raid10, you must add a new branch to handle badblocks now, >>>>> otherwise, as long as one copy contain any badblocks, atomic write >>>>> will >>>>> fail while theoretically I think it can work. >>>> Can you please expand on what you mean by this last sentence, "I think >>>> it can work". > > I mean in this case, for the write IO, there is no need to split this IO > for the underlying disks that doesn't have BB, hence atomic write can > still work. Currently solution is to split the IO to the range that all > underlying disks doesn't have BB. ok, right. > >>>> >>>> Indeed, IMO, chance of encountering a device with BBs and supporting >>>> atomic writes is low, so no need to try to make it work (if it were >>>> possible) - I think that we just report EIO. > > If you want this, then make sure raid will set fail fast together with > atomic write. This way disk will just faulty with IO error instead of > marking with BB, hence make sure there are no BBs. To be clear, you mean to set the r1/r10 bio failfast flag, right? There are rdev and also r1/r10 bio failfast flags. Thanks, John
Hi, 在 2024/10/24 16:57, John Garry 写道: > On 24/10/2024 03:10, Yu Kuai wrote: >>> On 23/10/2024 12:46, Geoff Back wrote: >>>>>> Yes, raid1/raid10 write are the same. If you want to enable atomic >>>>>> write >>>>>> for raid1/raid10, you must add a new branch to handle badblocks now, >>>>>> otherwise, as long as one copy contain any badblocks, atomic write >>>>>> will >>>>>> fail while theoretically I think it can work. >>>>> Can you please expand on what you mean by this last sentence, "I think >>>>> it can work". >> >> I mean in this case, for the write IO, there is no need to split this IO >> for the underlying disks that doesn't have BB, hence atomic write can >> still work. Currently solution is to split the IO to the range that all >> underlying disks doesn't have BB. > > ok, right. > >> >>>>> >>>>> Indeed, IMO, chance of encountering a device with BBs and supporting >>>>> atomic writes is low, so no need to try to make it work (if it were >>>>> possible) - I think that we just report EIO. >> >> If you want this, then make sure raid will set fail fast together with >> atomic write. This way disk will just faulty with IO error instead of >> marking with BB, hence make sure there are no BBs. > > To be clear, you mean to set the r1/r10 bio failfast flag, right? There > are rdev and also r1/r10 bio failfast flags. I mean the rdev flag, all underlying disks should set FailFast, so that no BB will be present. rdev will just become faulty for the case IO error. r1/r10 bio failfast flags is for internal usage to handle IO error. Thanks, Kuai > > Thanks, > John > > > . >
On 24/10/2024 10:12, Yu Kuai wrote: >>> >>>>>> >>>>>> Indeed, IMO, chance of encountering a device with BBs and supporting >>>>>> atomic writes is low, so no need to try to make it work (if it were >>>>>> possible) - I think that we just report EIO. >>> >>> If you want this, then make sure raid will set fail fast together with >>> atomic write. This way disk will just faulty with IO error instead of >>> marking with BB, hence make sure there are no BBs. >> >> To be clear, you mean to set the r1/r10 bio failfast flag, right? >> There are rdev and also r1/r10 bio failfast flags. > > I mean the rdev flag, all underlying disks should set FailFast, so that > no BB will be present. rdev will just become faulty for the case IO > error. > > r1/r10 bio failfast flags is for internal usage to handle IO error. I am not familiar with all consequences of FailFast for an rdev, but it seems a bit drastic to set it just because the rdev supports atomic writes. If we support atomic writes, then not all writes will necessarily be atomic. Thanks, John
Hi, 在 2024/10/24 17:56, John Garry 写道: > On 24/10/2024 10:12, Yu Kuai wrote: >>>> >>>>>>> >>>>>>> Indeed, IMO, chance of encountering a device with BBs and supporting >>>>>>> atomic writes is low, so no need to try to make it work (if it were >>>>>>> possible) - I think that we just report EIO. >>>> >>>> If you want this, then make sure raid will set fail fast together with >>>> atomic write. This way disk will just faulty with IO error instead of >>>> marking with BB, hence make sure there are no BBs. >>> >>> To be clear, you mean to set the r1/r10 bio failfast flag, right? >>> There are rdev and also r1/r10 bio failfast flags. >> >> I mean the rdev flag, all underlying disks should set FailFast, so that >> no BB will be present. rdev will just become faulty for the case IO >> error. >> >> r1/r10 bio failfast flags is for internal usage to handle IO error. > > I am not familiar with all consequences of FailFast for an rdev, but it > seems a bit drastic to set it just because the rdev supports atomic > writes. If we support atomic writes, then not all writes will > necessarily be atomic. I don't see there is other option for now. 1) set failfast and make sure no BB will be present; 2) handle BB and don't split it for the good disks for atomic writes. Thanks, Kuai > > Thanks, > John > > . >
On 23/09/2024 10:38, Yu Kuai wrote: >> >> Are you saying that some improvement needs to be made to the current >> code for badblocks handling, like initially try to skip bio_split()? >> >> Apart from that, what about the change in raid10_write_request(), >> w.r.t error handling? >> >> There, for an error in bio_split(), I think that we need to do some >> tidy-up if bio_split() fails, i.e. undo increase in rdev->nr_pending >> when looping conf->copies >> >> BTW, feel free to comment in patch 6/6 for that. > > Yes, raid1/raid10 write are the same. If you want to enable atomic write > for raid1/raid10, you must add a new branch to handle badblocks now, > otherwise, as long as one copy contain any badblocks, atomic write will > fail while theoretically I think it can work. ok, I'll check the badblocks code further to understand this. The point really for atomic writes support is that we should just not be attempting to split a bio, and handle an attempt to split an atomic write bio like any other bio split failure, e.g. if it does happen we either have a software bug or out-of-resources (-ENOMEM). Properly stacked atomic write queue limits should ensure that we are not in the situation where we do need to split, and the new checking in bio_split() is just an insurance policy. Thanks, John
© 2016 - 2024 Red Hat, Inc.