[v3] btrfs: Don't block suspend during fstrim

[PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by Luca Stefani 1 year, 5 months ago

Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a super large delay.

We now split the space left to discard based the block discard limit
in preparation of introduction of cancellation signals handling.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
 fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a5966324607d..9c1ddf13659e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 	}
 
 	if (bytes_left) {
-		ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
-					   bytes_left >> SECTOR_SHIFT,
-					   GFP_NOFS);
-		if (!ret)
-			*discarded_bytes += bytes_left;
+		u64 bytes_to_discard;
+		struct bio *bio = NULL;
+		sector_t sector = start >> SECTOR_SHIFT;
+		sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
+
+		while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
+				GFP_NOFS))) {
+			ret = submit_bio_wait(bio);
+			bio_put(bio);
+
+			if (!ret)
+				bytes_to_discard = bio->bi_iter.bi_size;
+			else if (ret != -EOPNOTSUPP)
+				return ret;
+
+			start += bytes_to_discard;
+			bytes_left -= bytes_to_discard;
+		}
 	}
+
 	return ret;
 }
 
-- 
2.46.0

Re: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by kernel test robot 1 year, 5 months ago

Hi Luca,

kernel test robot noticed the following build errors:

[auto build test ERROR on v6.11-rc6]
[also build test ERROR on linus/master]
[cannot apply to kdave/for-next next-20240906]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luca-Stefani/btrfs-Always-update-fstrim_range-on-failure/20240903-191851
base:   v6.11-rc6
patch link:    https://lore.kernel.org/r/20240903071625.957275-3-luca.stefani.ge1%40gmail.com
patch subject: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409070311.SB9wmgSx-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "blk_alloc_discard_bio" [fs/btrfs/btrfs.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by David Sterba 1 year, 5 months ago

On Tue, Sep 03, 2024 at 09:16:11AM +0200, Luca Stefani wrote:
> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
> mostly empty although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.
> 
> We now split the space left to discard based the block discard limit
> in preparation of introduction of cancellation signals handling.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
>  fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a5966324607d..9c1ddf13659e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>  	}
>  
>  	if (bytes_left) {
> -		ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> -					   bytes_left >> SECTOR_SHIFT,
> -					   GFP_NOFS);
> -		if (!ret)
> -			*discarded_bytes += bytes_left;
> +		u64 bytes_to_discard;
> +		struct bio *bio = NULL;
> +		sector_t sector = start >> SECTOR_SHIFT;
> +		sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
> +
> +		while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> +				GFP_NOFS))) {
> +			ret = submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (!ret)
> +				bytes_to_discard = bio->bi_iter.bi_size;
> +			else if (ret != -EOPNOTSUPP)
> +				return ret;
> +
> +			start += bytes_to_discard;
> +			bytes_left -= bytes_to_discard;
> +		}

This is not what I anticipated, we only wanted to know the optimal
request size but now it's reimplementing the bio submission and compared
to blkdev_issue_discard() it lacks blk_start_plug/blk_finish_plug.

As we won't get the bio_discard_limit() export for some reason I suggest
to go back to setting the maximum chunk limit in our code and set it to
something like 8G.

Re: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by Luca Stefani 1 year, 5 months ago


On 03/09/24 09:16, Luca Stefani wrote:
> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
> mostly empty although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.
> 
> We now split the space left to discard based the block discard limit
> in preparation of introduction of cancellation signals handling.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
>   fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
>   1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a5966324607d..9c1ddf13659e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>   	}
>   
>   	if (bytes_left) {
> -		ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> -					   bytes_left >> SECTOR_SHIFT,
> -					   GFP_NOFS);
> -		if (!ret)
> -			*discarded_bytes += bytes_left;
I removed this by mistake, will be re-added.
> +		u64 bytes_to_discard;
> +		struct bio *bio = NULL;
> +		sector_t sector = start >> SECTOR_SHIFT;
> +		sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
> +
> +		while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> +				GFP_NOFS))) {
> +			ret = submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (!ret)
> +				bytes_to_discard = bio->bi_iter.bi_size;
> +			else if (ret != -EOPNOTSUPP)
> +				return ret;
I think I got the logic wrong, we probably want to `continue` in case 
ret is set, but it's not -EOPNOTSUPP, otherwise bytes_to_discard might 
be left uninitialized.
bio->bi_iter.bi_size can be used directly for all those cases, so I'll 
remove bytes_to_discard as a whole.
> +
> +			start += bytes_to_discard;
> +			bytes_left -= bytes_to_discard;
> +		}
>   	}
> +
>   	return ret;
>   }
>   

I'll fix those up for v4, but I'll wait for more comments before doing so.

Re: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by Christoph Hellwig 1 year, 5 months ago

You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.

Re: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by Qu Wenruo 1 year, 5 months ago


在 2024/9/3 16:57, Christoph Hellwig 写道:
> You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.

Just curious, shouldn't blk_ioctl_discard() also check frozen status to
prevent block device trim from block suspension?

And if that's the case, would it make more sense to move the signal and
freezing checks into blk_alloc_discard_bio()?

Thanks,
Qu

Re: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by Christoph Hellwig 1 year, 5 months ago

On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/9/3 16:57, Christoph Hellwig 写道:
> > You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
> 
> Just curious, shouldn't blk_ioctl_discard() also check frozen status to
> prevent block device trim from block suspension?

Someone needs to explain what 'block suspension' is for that first.

> And if that's the case, would it make more sense to move the signal and
> freezing checks into blk_alloc_discard_bio()?

No.  Look at that the recent history of the dicard support.  We tried
that and it badly breaks callers not expecting the signal handling.

Re: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks

Posted by Qu Wenruo 1 year, 5 months ago


在 2024/9/4 13:43, Christoph Hellwig 写道:
> On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/9/3 16:57, Christoph Hellwig 写道:
>>> You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
>>
>> Just curious, shouldn't blk_ioctl_discard() also check frozen status to
>> prevent block device trim from block suspension?
> 
> Someone needs to explain what 'block suspension' is for that first.

E.g. a long running full device trim preventing PM suspension/hibernation.

The original report shows the fstrim of btrfs is preventing the system 
entering suspension/hibernation:

  PM: suspend entry (deep)
  Filesystems sync: 0.060 seconds
  Freezing user space processes
  Freezing user space processes failed after 20.007 seconds (1 tasks 
refusing to freeze, wq_busy=0):
  task:fstrim          state:D stack:0     pid:15564 tgid:15564 ppid:1 
    flags:0x00004006
  Call Trace:
   <TASK>
   __schedule+0x381/0x1540
   schedule+0x24/0xb0
   schedule_timeout+0x1ea/0x2a0
   io_schedule_timeout+0x19/0x50
   wait_for_completion_io+0x78/0x140
   submit_bio_wait+0xaa/0xc0
   blkdev_issue_discard+0x65/0xb0
   btrfs_issue_discard+0xcf/0x160 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_discard_extent+0x120/0x2a0 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   do_trimming+0xd4/0x220 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   trim_bitmaps+0x418/0x520 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_trim_block_group+0xcb/0x130 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_trim_fs+0x119/0x460 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_ioctl_fitrim+0xfb/0x160 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_ioctl+0x11cc/0x29f0 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   __x64_sys_ioctl+0x92/0xd0
   do_syscall_64+0x5b/0x80
   entry_SYSCALL_64_after_hwframe+0x7c/0xe6
  RIP: 0033:0x7f5f3b529f9b
  RSP: 002b:00007fff279ebc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fff279ebd60 RCX: 00007f5f3b529f9b
  RDX: 00007fff279ebc90 RSI: 00000000c0185879 RDI: 0000000000000003
  RBP: 000055748718b2d0 R08: 00005574871899e8 R09: 00007fff279eb010
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
  R13: 000055748718ac40 R14: 000055748718b290 R15: 000055748718b290
   </TASK>
  OOM killer enabled.
  Restarting tasks ... done.
  random: crng reseeded on system resumption
  PM: suspend exit
  PM: suspend entry (s2idle)
  Filesystems sync: 0.047 seconds

Considering blkdev_issue_discard() is already doing cond_sched() and 
btrfs is also doing cond_sched() for each block group, it looks like PM 
freezing needs its own freezing() checks, just like what ext4 is doing.

> 
>> And if that's the case, would it make more sense to move the signal and
>> freezing checks into blk_alloc_discard_bio()?
> 
> No.  Look at that the recent history of the dicard support.  We tried
> that and it badly breaks callers not expecting the signal handling.

OK, got it, at least for btrfs we would go the blk_alloc_discard_bio() 
and do signal and freezing checks.

Thanks,
Qu

[PATCH] block: Export blk_alloc_discard_bio

Posted by Luca Stefani 1 year, 5 months ago

The fs trim loops over ranges and sends discard requests, some ranges
can be large so it's all transparently handled by blkdev_issue_discard()
and processed in smaller chunks.

To support cancellation (or suspend) requests we need to insert checks
into the the loop, exporting the symbol allows to reimplement
such loop with the desired behavior.

Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
 block/blk-lib.c        | 1 +
 include/linux/blkdev.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..ebaef47d8ce7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -59,6 +59,7 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev,
 	cond_resched();
 	return bio;
 }
+EXPORT_SYMBOL_GPL(blk_alloc_discard_bio);
 
 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b7664d593486..f3631044d905 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1088,6 +1088,8 @@ static inline long nr_blockdev_pages(void)
 
 extern void blk_io_schedule(void);
 
+struct bio *blk_alloc_discard_bio(struct block_device *bdev,
+		sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask);
 int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask);
 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
-- 
2.46.0

Re: [PATCH] block: Export blk_alloc_discard_bio

Posted by Jens Axboe 1 year, 5 months ago

On 9/3/24 1:39 AM, Luca Stefani wrote:
> The fs trim loops over ranges and sends discard requests, some ranges
> can be large so it's all transparently handled by blkdev_issue_discard()
> and processed in smaller chunks.
> 
> To support cancellation (or suspend) requests we need to insert checks
> into the the loop, exporting the symbol allows to reimplement
> such loop with the desired behavior.
> 
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
>  block/blk-lib.c        | 1 +
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 4c9f20a689f7..ebaef47d8ce7 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -59,6 +59,7 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev,
>  	cond_resched();
>  	return bio;
>  }
> +EXPORT_SYMBOL_GPL(blk_alloc_discard_bio);
>  
>  int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index b7664d593486..f3631044d905 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1088,6 +1088,8 @@ static inline long nr_blockdev_pages(void)
>  
>  extern void blk_io_schedule(void);
>  
> +struct bio *blk_alloc_discard_bio(struct block_device *bdev,
> +		sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask);
>  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask);
>  int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,

Since blk_alloc_discard_bio() is already defined in a header (otherwise
it would've been static and your export symbol above would have failed
miserably), why add it to another header?

-- 
Jens Axboe

Re: [PATCH] block: Export blk_alloc_discard_bio

Posted by Luca Stefani 1 year, 5 months ago


On 03/09/24 17:49, Jens Axboe wrote:
> On 9/3/24 1:39 AM, Luca Stefani wrote:
>> The fs trim loops over ranges and sends discard requests, some ranges
>> can be large so it's all transparently handled by blkdev_issue_discard()
>> and processed in smaller chunks.
>>
>> To support cancellation (or suspend) requests we need to insert checks
>> into the the loop, exporting the symbol allows to reimplement
>> such loop with the desired behavior.
>>
>> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
>> ---
>>   block/blk-lib.c        | 1 +
>>   include/linux/blkdev.h | 2 ++
>>   2 files changed, 3 insertions(+)
>>
>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>> index 4c9f20a689f7..ebaef47d8ce7 100644
>> --- a/block/blk-lib.c
>> +++ b/block/blk-lib.c
>> @@ -59,6 +59,7 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev,
>>   	cond_resched();
>>   	return bio;
>>   }
>> +EXPORT_SYMBOL_GPL(blk_alloc_discard_bio);
>>   
>>   int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>>   		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index b7664d593486..f3631044d905 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1088,6 +1088,8 @@ static inline long nr_blockdev_pages(void)
>>   
>>   extern void blk_io_schedule(void);
>>   
>> +struct bio *blk_alloc_discard_bio(struct block_device *bdev,
>> +		sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask);
>>   int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>>   		sector_t nr_sects, gfp_t gfp_mask);
>>   int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> 
> Since blk_alloc_discard_bio() is already defined in a header (otherwise
> it would've been static and your export symbol above would have failed
> miserably), why add it to another header?
> 

ACK, will remove the header change in v4,

Thanks.