Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a super large delay.
We now split the space left to discard based the block discard limit
in preparation of introduction of cancellation signals handling.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a5966324607d..9c1ddf13659e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
}
if (bytes_left) {
- ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
- bytes_left >> SECTOR_SHIFT,
- GFP_NOFS);
- if (!ret)
- *discarded_bytes += bytes_left;
+ u64 bytes_to_discard;
+ struct bio *bio = NULL;
+ sector_t sector = start >> SECTOR_SHIFT;
+ sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
+
+ while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects,
+ GFP_NOFS))) {
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
+
+ if (!ret)
+ bytes_to_discard = bio->bi_iter.bi_size;
+ else if (ret != -EOPNOTSUPP)
+ return ret;
+
+ start += bytes_to_discard;
+ bytes_left -= bytes_to_discard;
+ }
}
+
return ret;
}
--
2.46.0
Hi Luca, kernel test robot noticed the following build errors: [auto build test ERROR on v6.11-rc6] [also build test ERROR on linus/master] [cannot apply to kdave/for-next next-20240906] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Luca-Stefani/btrfs-Always-update-fstrim_range-on-failure/20240903-191851 base: v6.11-rc6 patch link: https://lore.kernel.org/r/20240903071625.957275-3-luca.stefani.ge1%40gmail.com patch subject: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/config) compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202409070311.SB9wmgSx-lkp@intel.com/ All errors (new ones prefixed by >>, old ones prefixed by <<): >> ERROR: modpost: "blk_alloc_discard_bio" [fs/btrfs/btrfs.ko] undefined! -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
On Tue, Sep 03, 2024 at 09:16:11AM +0200, Luca Stefani wrote:
> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
> mostly empty although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.
>
> We now split the space left to discard based the block discard limit
> in preparation of introduction of cancellation signals handling.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
> fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
> 1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a5966324607d..9c1ddf13659e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
> }
>
> if (bytes_left) {
> - ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> - bytes_left >> SECTOR_SHIFT,
> - GFP_NOFS);
> - if (!ret)
> - *discarded_bytes += bytes_left;
> + u64 bytes_to_discard;
> + struct bio *bio = NULL;
> + sector_t sector = start >> SECTOR_SHIFT;
> + sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
> +
> + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects,
> + GFP_NOFS))) {
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> +
> + if (!ret)
> + bytes_to_discard = bio->bi_iter.bi_size;
> + else if (ret != -EOPNOTSUPP)
> + return ret;
> +
> + start += bytes_to_discard;
> + bytes_left -= bytes_to_discard;
> + }
This is not what I anticipated, we only wanted to know the optimal
request size but now it's reimplementing the bio submission and compared
to blkdev_issue_discard() it lacks blk_start_plug/blk_finish_plug.
As we won't get the bio_discard_limit() export for some reason I suggest
to go back to setting the maximum chunk limit in our code and set it to
something like 8G.
On 03/09/24 09:16, Luca Stefani wrote:
> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
> mostly empty although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.
>
> We now split the space left to discard based the block discard limit
> in preparation of introduction of cancellation signals handling.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
> fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
> 1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a5966324607d..9c1ddf13659e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
> }
>
> if (bytes_left) {
> - ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> - bytes_left >> SECTOR_SHIFT,
> - GFP_NOFS);
> - if (!ret)
> - *discarded_bytes += bytes_left;
I removed this by mistake, will be re-added.
> + u64 bytes_to_discard;
> + struct bio *bio = NULL;
> + sector_t sector = start >> SECTOR_SHIFT;
> + sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
> +
> + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects,
> + GFP_NOFS))) {
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> +
> + if (!ret)
> + bytes_to_discard = bio->bi_iter.bi_size;
> + else if (ret != -EOPNOTSUPP)
> + return ret;
I think I got the logic wrong, we probably want to `continue` in case
ret is set, but it's not -EOPNOTSUPP, otherwise bytes_to_discard might
be left uninitialized.
bio->bi_iter.bi_size can be used directly for all those cases, so I'll
remove bytes_to_discard as a whole.
> +
> + start += bytes_to_discard;
> + bytes_left -= bytes_to_discard;
> + }
> }
> +
> return ret;
> }
>
I'll fix those up for v4, but I'll wait for more comments before doing so.
You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
在 2024/9/3 16:57, Christoph Hellwig 写道: > You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio. Just curious, shouldn't blk_ioctl_discard() also check frozen status to prevent block device trim from block suspension? And if that's the case, would it make more sense to move the signal and freezing checks into blk_alloc_discard_bio()? Thanks, Qu
On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote: > > > 在 2024/9/3 16:57, Christoph Hellwig 写道: > > You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio. > > Just curious, shouldn't blk_ioctl_discard() also check frozen status to > prevent block device trim from block suspension? Someone needs to explain what 'block suspension' is for that first. > And if that's the case, would it make more sense to move the signal and > freezing checks into blk_alloc_discard_bio()? No. Look at that the recent history of the dicard support. We tried that and it badly breaks callers not expecting the signal handling.
在 2024/9/4 13:43, Christoph Hellwig 写道:
> On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/9/3 16:57, Christoph Hellwig 写道:
>>> You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
>>
>> Just curious, shouldn't blk_ioctl_discard() also check frozen status to
>> prevent block device trim from block suspension?
>
> Someone needs to explain what 'block suspension' is for that first.
E.g. a long running full device trim preventing PM suspension/hibernation.
The original report shows the fstrim of btrfs is preventing the system
entering suspension/hibernation:
PM: suspend entry (deep)
Filesystems sync: 0.060 seconds
Freezing user space processes
Freezing user space processes failed after 20.007 seconds (1 tasks
refusing to freeze, wq_busy=0):
task:fstrim state:D stack:0 pid:15564 tgid:15564 ppid:1
flags:0x00004006
Call Trace:
<TASK>
__schedule+0x381/0x1540
schedule+0x24/0xb0
schedule_timeout+0x1ea/0x2a0
io_schedule_timeout+0x19/0x50
wait_for_completion_io+0x78/0x140
submit_bio_wait+0xaa/0xc0
blkdev_issue_discard+0x65/0xb0
btrfs_issue_discard+0xcf/0x160 [btrfs
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
btrfs_discard_extent+0x120/0x2a0 [btrfs
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
do_trimming+0xd4/0x220 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
trim_bitmaps+0x418/0x520 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
btrfs_trim_block_group+0xcb/0x130 [btrfs
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
btrfs_trim_fs+0x119/0x460 [btrfs
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
btrfs_ioctl_fitrim+0xfb/0x160 [btrfs
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
btrfs_ioctl+0x11cc/0x29f0 [btrfs
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
__x64_sys_ioctl+0x92/0xd0
do_syscall_64+0x5b/0x80
entry_SYSCALL_64_after_hwframe+0x7c/0xe6
RIP: 0033:0x7f5f3b529f9b
RSP: 002b:00007fff279ebc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fff279ebd60 RCX: 00007f5f3b529f9b
RDX: 00007fff279ebc90 RSI: 00000000c0185879 RDI: 0000000000000003
RBP: 000055748718b2d0 R08: 00005574871899e8 R09: 00007fff279eb010
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 000055748718ac40 R14: 000055748718b290 R15: 000055748718b290
</TASK>
OOM killer enabled.
Restarting tasks ... done.
random: crng reseeded on system resumption
PM: suspend exit
PM: suspend entry (s2idle)
Filesystems sync: 0.047 seconds
Considering blkdev_issue_discard() is already doing cond_sched() and
btrfs is also doing cond_sched() for each block group, it looks like PM
freezing needs its own freezing() checks, just like what ext4 is doing.
>
>> And if that's the case, would it make more sense to move the signal and
>> freezing checks into blk_alloc_discard_bio()?
>
> No. Look at that the recent history of the dicard support. We tried
> that and it badly breaks callers not expecting the signal handling.
OK, got it, at least for btrfs we would go the blk_alloc_discard_bio()
and do signal and freezing checks.
Thanks,
Qu
The fs trim loops over ranges and sends discard requests, some ranges
can be large so it's all transparently handled by blkdev_issue_discard()
and processed in smaller chunks.
To support cancellation (or suspend) requests we need to insert checks
into the the loop, exporting the symbol allows to reimplement
such loop with the desired behavior.
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
block/blk-lib.c | 1 +
include/linux/blkdev.h | 2 ++
2 files changed, 3 insertions(+)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..ebaef47d8ce7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -59,6 +59,7 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev,
cond_resched();
return bio;
}
+EXPORT_SYMBOL_GPL(blk_alloc_discard_bio);
int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b7664d593486..f3631044d905 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1088,6 +1088,8 @@ static inline long nr_blockdev_pages(void)
extern void blk_io_schedule(void);
+struct bio *blk_alloc_discard_bio(struct block_device *bdev,
+ sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask);
int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask);
int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
--
2.46.0
On 9/3/24 1:39 AM, Luca Stefani wrote: > The fs trim loops over ranges and sends discard requests, some ranges > can be large so it's all transparently handled by blkdev_issue_discard() > and processed in smaller chunks. > > To support cancellation (or suspend) requests we need to insert checks > into the the loop, exporting the symbol allows to reimplement > such loop with the desired behavior. > > Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> > --- > block/blk-lib.c | 1 + > include/linux/blkdev.h | 2 ++ > 2 files changed, 3 insertions(+) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index 4c9f20a689f7..ebaef47d8ce7 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -59,6 +59,7 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev, > cond_resched(); > return bio; > } > +EXPORT_SYMBOL_GPL(blk_alloc_discard_bio); > > int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, > sector_t nr_sects, gfp_t gfp_mask, struct bio **biop) > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index b7664d593486..f3631044d905 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1088,6 +1088,8 @@ static inline long nr_blockdev_pages(void) > > extern void blk_io_schedule(void); > > +struct bio *blk_alloc_discard_bio(struct block_device *bdev, > + sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask); > int blkdev_issue_discard(struct block_device *bdev, sector_t sector, > sector_t nr_sects, gfp_t gfp_mask); > int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, Since blk_alloc_discard_bio() is already defined in a header (otherwise it would've been static and your export symbol above would have failed miserably), why add it to another header? -- Jens Axboe
On 03/09/24 17:49, Jens Axboe wrote: > On 9/3/24 1:39 AM, Luca Stefani wrote: >> The fs trim loops over ranges and sends discard requests, some ranges >> can be large so it's all transparently handled by blkdev_issue_discard() >> and processed in smaller chunks. >> >> To support cancellation (or suspend) requests we need to insert checks >> into the the loop, exporting the symbol allows to reimplement >> such loop with the desired behavior. >> >> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> >> --- >> block/blk-lib.c | 1 + >> include/linux/blkdev.h | 2 ++ >> 2 files changed, 3 insertions(+) >> >> diff --git a/block/blk-lib.c b/block/blk-lib.c >> index 4c9f20a689f7..ebaef47d8ce7 100644 >> --- a/block/blk-lib.c >> +++ b/block/blk-lib.c >> @@ -59,6 +59,7 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev, >> cond_resched(); >> return bio; >> } >> +EXPORT_SYMBOL_GPL(blk_alloc_discard_bio); >> >> int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, >> sector_t nr_sects, gfp_t gfp_mask, struct bio **biop) >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >> index b7664d593486..f3631044d905 100644 >> --- a/include/linux/blkdev.h >> +++ b/include/linux/blkdev.h >> @@ -1088,6 +1088,8 @@ static inline long nr_blockdev_pages(void) >> >> extern void blk_io_schedule(void); >> >> +struct bio *blk_alloc_discard_bio(struct block_device *bdev, >> + sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask); >> int blkdev_issue_discard(struct block_device *bdev, sector_t sector, >> sector_t nr_sects, gfp_t gfp_mask); >> int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, > > Since blk_alloc_discard_bio() is already defined in a header (otherwise > it would've been static and your export symbol above would have failed > miserably), why add it to another header? > ACK, will remove the header change in v4, Thanks.
© 2016 - 2025 Red Hat, Inc.