Use mm_huge_zero_folio in __blkdev_issue_zero_pages(). Fallback to
ZERO_PAGE if mm_huge_zero_folio is not available.
On systems that allocates mm_huge_zero_folio, we will end up sending larger
bvecs instead of multiple small ones.
Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
gains might be bigger if the device supports bigger MDTS.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
block/blk-lib.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..0fd55e028170 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -4,6 +4,7 @@
*/
#include <linux/kernel.h>
#include <linux/module.h>
+#include <linux/mm.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/scatterlist.h>
@@ -196,6 +197,12 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned int flags)
{
+ struct folio *zero_folio;
+
+ zero_folio = mm_get_huge_zero_folio(NULL);
+ if (!zero_folio)
+ zero_folio = page_folio(ZERO_PAGE(0));
+
while (nr_sects) {
unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
struct bio *bio;
@@ -208,11 +215,12 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
break;
do {
- unsigned int len, added;
+ unsigned int len, added = 0;
- len = min_t(sector_t,
- PAGE_SIZE, nr_sects << SECTOR_SHIFT);
- added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
+ len = min_t(sector_t, folio_size(zero_folio),
+ nr_sects << SECTOR_SHIFT);
+ if (bio_add_folio(bio, zero_folio, len, 0))
+ added = len;
if (added < len)
break;
nr_sects -= added >> SECTOR_SHIFT;
--
2.47.2
On Tue, May 27, 2025 at 07:04:52AM +0200, Pankaj Raghav wrote: > Noticed a 4% increase in performance on a commercial NVMe SSD which does > not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance > gains might be bigger if the device supports bigger MDTS. Impressive gain on the one hand - on the other hand what is the macro workload that does a lot of zeroing on an SSD, because avoiding that should yield even better result while reducing wear.. > + unsigned int len, added = 0; > > + len = min_t(sector_t, folio_size(zero_folio), > + nr_sects << SECTOR_SHIFT); > + if (bio_add_folio(bio, zero_folio, len, 0)) > + added = len; > if (added < len) > break; > nr_sects -= added >> SECTOR_SHIFT; Unless I'm missing something the added variable can go away now, and the code using it can simply use len.
On 6/2/25 07:05, Christoph Hellwig wrote:
> On Tue, May 27, 2025 at 07:04:52AM +0200, Pankaj Raghav wrote:
>> Noticed a 4% increase in performance on a commercial NVMe SSD which does
>> not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
>> gains might be bigger if the device supports bigger MDTS.
>
> Impressive gain on the one hand - on the other hand what is the macro
> workload that does a lot of zeroing on an SSD, because avoiding that
> should yield even better result while reducing wear..
>
Absolutely. I think it is better to use either WRITE_ZEROES or DISCARD. But I wanted
to have some measurable workload to show the benefits of using a huge page to zero out.
Interestingly, I have seen many client SSDs not implementing WRITE_ZEROES.
>> + unsigned int len, added = 0;
>>
>> + len = min_t(sector_t, folio_size(zero_folio),
>> + nr_sects << SECTOR_SHIFT);
>> + if (bio_add_folio(bio, zero_folio, len, 0))
>> + added = len;
>> if (added < len)
>> break;
>> nr_sects -= added >> SECTOR_SHIFT;
>
> Unless I'm missing something the added variable can go away now, and
> the code using it can simply use len.
>
Yes. This should do it.
if (!bio_add_folio(bio, zero_folio, len, 0))
break;
nr_sects -= len >> SECTOR_SHIFT;
--
Pankaj
© 2016 - 2025 Red Hat, Inc.