btrfs: Split remaining space to discard in chunks

[PATCH] btrfs: Split remaining space to discard in chunks

Posted by Luca Stefani 1 year, 5 months ago

Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a super large delay.

We now split the space left to discard based off the max data
chunk size (10G) to solve the problem.

Reported-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Closes: https://lore.kernel.org/lkml/2e15214b-7e95-4e64-a899-725de12c9037@gmail.com/T/#mdfef1d8b36334a15c54cd009f6aadf49e260e105
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
 fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index feec49e6f9c8..6ad92876bca0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1239,7 +1239,7 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 			       u64 *discarded_bytes)
 {
 	int j, ret = 0;
-	u64 bytes_left, end;
+	u64 bytes_left, bytes_to_discard, end;
 	u64 aligned_start = ALIGN(start, 1 << SECTOR_SHIFT);
 
 	/* Adjust the range to be aligned to 512B sectors if necessary. */
@@ -1300,13 +1300,25 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 		bytes_left = end - start;
 	}
 
-	if (bytes_left) {
+	while (bytes_left) {
+		if (bytes_left > BTRFS_MAX_DATA_CHUNK_SIZE)
+			bytes_to_discard = BTRFS_MAX_DATA_CHUNK_SIZE;
+		else
+			bytes_to_discard = bytes_left;
+
 		ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
-					   bytes_left >> SECTOR_SHIFT,
+					   bytes_to_discard >> SECTOR_SHIFT,
 					   GFP_NOFS);
+
 		if (!ret)
-			*discarded_bytes += bytes_left;
+			*discarded_bytes += bytes_to_discard;
+		else if (ret != -EOPNOTSUPP)
+			return ret;
+
+		start += bytes_to_discard;
+		bytes_left -= bytes_to_discard;
 	}
+
 	return ret;
 }
 
-- 
2.46.0

Re: [PATCH] btrfs: Split remaining space to discard in chunks

Posted by David Sterba 1 year, 5 months ago

On Mon, Sep 02, 2024 at 01:43:00PM +0200, Luca Stefani wrote:
> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
> mostly empty although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.

I'm not sure that this will be different than what we already do, or
have the large delays been observed in practice? The range passed to
blkdev_issue_discard() might be large but internally it's still split to
smaller sizes depending on the queue limits, IOW the device.

Bio is allocated and limited by bio_discard_limit(bdev, *sector);
https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L38

  struct bio *blk_alloc_discard_bio(struct block_device *bdev,
		  sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask)
  {
	  sector_t bio_sects = min(*nr_sects, bio_discard_limit(bdev, *sector));
	  struct bio *bio;

	  if (!bio_sects)
		  return NULL;

	  bio = bio_alloc(bdev, 0, REQ_OP_DISCARD, gfp_mask);
  ...

Then used in __blkdev_issue_discard()
https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L63

  int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
		  sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
  {
	  struct bio *bio;

	  while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
			  gfp_mask)))
		  *biop = bio_chain_and_submit(*biop, bio);
	  return 0;
  }

This is basically just a loop, chopping the input range as needed. The
btrfs code does effectively the same, there's only the superblock,
progress accounting and error handling done.

As the maximum size of a single discard request depends on a device we
don't need to artificially limit it because this would require more IO
requests and can be slower.

Re: [PATCH] btrfs: Split remaining space to discard in chunks

Posted by Luca Stefani 1 year, 5 months ago


On 02/09/24 22:11, David Sterba wrote:
> On Mon, Sep 02, 2024 at 01:43:00PM +0200, Luca Stefani wrote:
>> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
>> mostly empty although we will do the split according to our super block
>> locations, the last super block ends at 256G, we can submit a huge
>> discard for the range [256G, 8T), causing a super large delay.
> 
> I'm not sure that this will be different than what we already do, or
> have the large delays been observed in practice? The range passed to
> blkdev_issue_discard() might be large but internally it's still split to
> smaller sizes depending on the queue limits, IOW the device.
> 
> Bio is allocated and limited by bio_discard_limit(bdev, *sector);
> https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L38
> 
>    struct bio *blk_alloc_discard_bio(struct block_device *bdev,
> 		  sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask)
>    {
> 	  sector_t bio_sects = min(*nr_sects, bio_discard_limit(bdev, *sector));
> 	  struct bio *bio;
> 
> 	  if (!bio_sects)
> 		  return NULL;
> 
> 	  bio = bio_alloc(bdev, 0, REQ_OP_DISCARD, gfp_mask);
>    ...
> 
> 
> Then used in __blkdev_issue_discard()
> https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L63
> 
>    int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> 		  sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
>    {
> 	  struct bio *bio;
> 
> 	  while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> 			  gfp_mask)))
> 		  *biop = bio_chain_and_submit(*biop, bio);
> 	  return 0;
>    }
> 
> This is basically just a loop, chopping the input range as needed. The
> btrfs code does effectively the same, there's only the superblock,
> progress accounting and error handling done.
> 
> As the maximum size of a single discard request depends on a device we
> don't need to artificially limit it because this would require more IO
> requests and can be slower.

Thanks for taking a look, this change was prompted after I've been 
seeing issues due to the discard kthread blocking an userspace process 
causing device not to suspend.
https://lore.kernel.org/lkml/20240822164908.4957-1-luca.stefani.ge1@gmail.com/ 
is the proposed solution, but Qu mentioned that there is another place 
where it could happen that I didn't cover, and I think what I change 
here (unless it's the wrong place) allows me to add the similar 
`btrfs_trim_interrupted` checks to stop.

Please let me know if that makes sense to you, if that's the case I 
guess it would make sense to send the 2 patches together?

Luca.

Re: [PATCH] btrfs: Split remaining space to discard in chunks

Posted by David Sterba 1 year, 5 months ago

On Mon, Sep 02, 2024 at 10:17:37PM +0200, Luca Stefani wrote:
> > 		  sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
> >    {
> > 	  struct bio *bio;
> > 
> > 	  while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> > 			  gfp_mask)))
> > 		  *biop = bio_chain_and_submit(*biop, bio);
> > 	  return 0;
> >    }
> > 
> > This is basically just a loop, chopping the input range as needed. The
> > btrfs code does effectively the same, there's only the superblock,
> > progress accounting and error handling done.
> > 
> > As the maximum size of a single discard request depends on a device we
> > don't need to artificially limit it because this would require more IO
> > requests and can be slower.
> 
> Thanks for taking a look, this change was prompted after I've been 
> seeing issues due to the discard kthread blocking an userspace process 
> causing device not to suspend.
> https://lore.kernel.org/lkml/20240822164908.4957-1-luca.stefani.ge1@gmail.com/ 
> is the proposed solution, but Qu mentioned that there is another place 
> where it could happen that I didn't cover, and I think what I change 
> here (unless it's the wrong place) allows me to add the similar 
> `btrfs_trim_interrupted` checks to stop.
> 
> Please let me know if that makes sense to you, if that's the case I 
> guess it would make sense to send the 2 patches together?

Yeah for inserting the cancellation points it would make sense to do the
chunking. I'd suggest to do the same logic like blk_alloc_discard_bio()
and use the block device discard request limit and not a fixed constant.