[v2] block: Generalize physical entry definition

[PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Leon Romanovsky 2 months, 3 weeks ago

From: Leon Romanovsky <leonro@nvidia.com>

This patch changes the length variables from unsigned int to size_t.
Using size_t ensures that we can handle larger sizes, as size_t is
always equal to or larger than the previously used u32 type.

Originally, u32 was used because blk-mq-dma code evolved from
scatter-gather implementation, which uses unsigned int to describe length.
This change will also allow us to reuse the existing struct phys_vec in places
that don't need scatter-gather.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 block/blk-mq-dma.c      | 8 ++++++--
 drivers/nvme/host/pci.c | 4 ++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index e9108ccaf4b0..e7d9b54c3eed 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -8,7 +8,7 @@
 
 struct phys_vec {
 	phys_addr_t	paddr;
-	u32		len;
+	size_t		len;
 };
 
 static bool __blk_map_iter_next(struct blk_map_iter *iter)
@@ -112,8 +112,8 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
 		struct phys_vec *vec)
 {
 	enum dma_data_direction dir = rq_dma_dir(req);
-	unsigned int mapped = 0;
 	unsigned int attrs = 0;
+	size_t mapped = 0;
 	int error;
 
 	iter->addr = state->addr;
@@ -296,6 +296,8 @@ int __blk_rq_map_sg(struct request *rq, struct scatterlist *sglist,
 	blk_rq_map_iter_init(rq, &iter);
 	while (blk_map_iter_next(rq, &iter, &vec)) {
 		*last_sg = blk_next_sg(last_sg, sglist);
+
+		WARN_ON_ONCE(overflows_type(vec.len, unsigned int));
 		sg_set_page(*last_sg, phys_to_page(vec.paddr), vec.len,
 				offset_in_page(vec.paddr));
 		nsegs++;
@@ -416,6 +418,8 @@ int blk_rq_map_integrity_sg(struct request *rq, struct scatterlist *sglist)
 
 	while (blk_map_iter_next(rq, &iter, &vec)) {
 		sg = blk_next_sg(&sg, sglist);
+
+		WARN_ON_ONCE(overflows_type(vec.len, unsigned int));
 		sg_set_page(sg, phys_to_page(vec.paddr), vec.len,
 				offset_in_page(vec.paddr));
 		segments++;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e5ca8301bb8b..b61ec62b0ec6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -290,14 +290,14 @@ struct nvme_iod {
 	u8 flags;
 	u8 nr_descriptors;
 
-	unsigned int total_len;
+	size_t total_len;
 	struct dma_iova_state dma_state;
 	void *descriptors[NVME_MAX_NR_DESCRIPTORS];
 	struct nvme_dma_vec *dma_vecs;
 	unsigned int nr_dma_vecs;
 
 	dma_addr_t meta_dma;
-	unsigned int meta_total_len;
+	size_t meta_total_len;
 	struct dma_iova_state meta_dma_state;
 	struct nvme_sgl_desc *meta_descriptor;
 };

-- 
2.51.1

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Christoph Hellwig 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 09:22:43PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> This patch changes the length variables from unsigned int to size_t.
> Using size_t ensures that we can handle larger sizes, as size_t is
> always equal to or larger than the previously used u32 type.
> 
> Originally, u32 was used because blk-mq-dma code evolved from
> scatter-gather implementation, which uses unsigned int to describe length.
> This change will also allow us to reuse the existing struct phys_vec in places
> that don't need scatter-gather.
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  block/blk-mq-dma.c      | 8 ++++++--
>  drivers/nvme/host/pci.c | 4 ++--
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
> index e9108ccaf4b0..e7d9b54c3eed 100644
> --- a/block/blk-mq-dma.c
> +++ b/block/blk-mq-dma.c
> @@ -8,7 +8,7 @@
>  
>  struct phys_vec {
>  	phys_addr_t	paddr;
> -	u32		len;
> +	size_t		len;
>  };

So we're now going to increase memory usage by 50% again after just
reducing it by removing the scatterlist?

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Tue, Nov 18, 2025 at 06:03:11AM +0100, Christoph Hellwig wrote:
> On Mon, Nov 17, 2025 at 09:22:43PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > This patch changes the length variables from unsigned int to size_t.
> > Using size_t ensures that we can handle larger sizes, as size_t is
> > always equal to or larger than the previously used u32 type.
> > 
> > Originally, u32 was used because blk-mq-dma code evolved from
> > scatter-gather implementation, which uses unsigned int to describe length.
> > This change will also allow us to reuse the existing struct phys_vec in places
> > that don't need scatter-gather.
> > 
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > ---
> >  block/blk-mq-dma.c      | 8 ++++++--
> >  drivers/nvme/host/pci.c | 4 ++--
> >  2 files changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
> > index e9108ccaf4b0..e7d9b54c3eed 100644
> > --- a/block/blk-mq-dma.c
> > +++ b/block/blk-mq-dma.c
> > @@ -8,7 +8,7 @@
> >  
> >  struct phys_vec {
> >  	phys_addr_t	paddr;
> > -	u32		len;
> > +	size_t		len;
> >  };
> 
> So we're now going to increase memory usage by 50% again after just
> reducing it by removing the scatterlist?

It is slightly less.

Before this change: 96 bits
After this change (on 64bits system): 128 bits.

It is 33% increase per-structure.

So what is the resolution? Should I drop this patch or not?

Thanks

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by David Laight 2 months, 3 weeks ago

On Wed, 19 Nov 2025 11:55:16 +0200
Leon Romanovsky <leon@kernel.org> wrote:

> On Tue, Nov 18, 2025 at 06:03:11AM +0100, Christoph Hellwig wrote:
> > On Mon, Nov 17, 2025 at 09:22:43PM +0200, Leon Romanovsky wrote:  
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > This patch changes the length variables from unsigned int to size_t.
> > > Using size_t ensures that we can handle larger sizes, as size_t is
> > > always equal to or larger than the previously used u32 type.
> > > 
> > > Originally, u32 was used because blk-mq-dma code evolved from
> > > scatter-gather implementation, which uses unsigned int to describe length.
> > > This change will also allow us to reuse the existing struct phys_vec in places
> > > that don't need scatter-gather.
> > > 
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > ---
> > >  block/blk-mq-dma.c      | 8 ++++++--
> > >  drivers/nvme/host/pci.c | 4 ++--
> > >  2 files changed, 8 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
> > > index e9108ccaf4b0..e7d9b54c3eed 100644
> > > --- a/block/blk-mq-dma.c
> > > +++ b/block/blk-mq-dma.c
> > > @@ -8,7 +8,7 @@
> > >  
> > >  struct phys_vec {
> > >  	phys_addr_t	paddr;
> > > -	u32		len;
> > > +	size_t		len;
> > >  };  
> > 
> > So we're now going to increase memory usage by 50% again after just
> > reducing it by removing the scatterlist?  
> 
> It is slightly less.
> 
> Before this change: 96 bits

Did you actually look?
There will normally be 4 bytes of padding at the end of the structure.

About the only place where it will be 12 bytes is a 32bit system with
64bit phyaddr that aligns 64bit items on 32bit boundaries - so x86.

	David

> After this change (on 64bits system): 128 bits.
> 
> It is 33% increase per-structure.
> 
> So what is the resolution? Should I drop this patch or not?
> 
> Thanks 
>

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 01:36:15PM +0000, David Laight wrote:
> On Wed, 19 Nov 2025 11:55:16 +0200
> Leon Romanovsky <leon@kernel.org> wrote:
> 
> > On Tue, Nov 18, 2025 at 06:03:11AM +0100, Christoph Hellwig wrote:
> > > On Mon, Nov 17, 2025 at 09:22:43PM +0200, Leon Romanovsky wrote:  
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > This patch changes the length variables from unsigned int to size_t.
> > > > Using size_t ensures that we can handle larger sizes, as size_t is
> > > > always equal to or larger than the previously used u32 type.
> > > > 
> > > > Originally, u32 was used because blk-mq-dma code evolved from
> > > > scatter-gather implementation, which uses unsigned int to describe length.
> > > > This change will also allow us to reuse the existing struct phys_vec in places
> > > > that don't need scatter-gather.
> > > > 
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > ---
> > > >  block/blk-mq-dma.c      | 8 ++++++--
> > > >  drivers/nvme/host/pci.c | 4 ++--
> > > >  2 files changed, 8 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
> > > > index e9108ccaf4b0..e7d9b54c3eed 100644
> > > > --- a/block/blk-mq-dma.c
> > > > +++ b/block/blk-mq-dma.c
> > > > @@ -8,7 +8,7 @@
> > > >  
> > > >  struct phys_vec {
> > > >  	phys_addr_t	paddr;
> > > > -	u32		len;
> > > > +	size_t		len;
> > > >  };  
> > > 
> > > So we're now going to increase memory usage by 50% again after just
> > > reducing it by removing the scatterlist?  
> > 
> > It is slightly less.
> > 
> > Before this change: 96 bits
> 
> Did you actually look?

No, I simply performed sizeof(phys_addr_t) + sizeof(size_t).

> There will normally be 4 bytes of padding at the end of the structure.
> 
> About the only place where it will be 12 bytes is a 32bit system with
> 64bit phyaddr that aligns 64bit items on 32bit boundaries - so x86.

So does it mean that Christoph's comment about size increase is not correct?

Thanks

> 
> 	David
> 
> > After this change (on 64bits system): 128 bits.
> > 
> > It is 33% increase per-structure.
> > 
> > So what is the resolution? Should I drop this patch or not?
> > 
> > Thanks 
> > 
>

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by David Laight 2 months, 3 weeks ago

On Wed, 19 Nov 2025 15:58:21 +0200
Leon Romanovsky <leon@kernel.org> wrote:

> On Wed, Nov 19, 2025 at 01:36:15PM +0000, David Laight wrote:
> > On Wed, 19 Nov 2025 11:55:16 +0200
> > Leon Romanovsky <leon@kernel.org> wrote:
> >   
> > > On Tue, Nov 18, 2025 at 06:03:11AM +0100, Christoph Hellwig wrote:  
> > > > On Mon, Nov 17, 2025 at 09:22:43PM +0200, Leon Romanovsky wrote:    
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > 
> > > > > This patch changes the length variables from unsigned int to size_t.
> > > > > Using size_t ensures that we can handle larger sizes, as size_t is
> > > > > always equal to or larger than the previously used u32 type.
> > > > > 
> > > > > Originally, u32 was used because blk-mq-dma code evolved from
> > > > > scatter-gather implementation, which uses unsigned int to describe length.
> > > > > This change will also allow us to reuse the existing struct phys_vec in places
> > > > > that don't need scatter-gather.
> > > > > 
> > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > > ---
> > > > >  block/blk-mq-dma.c      | 8 ++++++--
> > > > >  drivers/nvme/host/pci.c | 4 ++--
> > > > >  2 files changed, 8 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
> > > > > index e9108ccaf4b0..e7d9b54c3eed 100644
> > > > > --- a/block/blk-mq-dma.c
> > > > > +++ b/block/blk-mq-dma.c
> > > > > @@ -8,7 +8,7 @@
> > > > >  
> > > > >  struct phys_vec {
> > > > >  	phys_addr_t	paddr;
> > > > > -	u32		len;
> > > > > +	size_t		len;
> > > > >  };    
> > > > 
> > > > So we're now going to increase memory usage by 50% again after just
> > > > reducing it by removing the scatterlist?    
> > > 
> > > It is slightly less.
> > > 
> > > Before this change: 96 bits  
> > 
> > Did you actually look?  
> 
> No, I simply performed sizeof(phys_addr_t) + sizeof(size_t).
> 
> > There will normally be 4 bytes of padding at the end of the structure.
> > 
> > About the only place where it will be 12 bytes is a 32bit system with
> > 64bit phyaddr that aligns 64bit items on 32bit boundaries - so x86.  
> 
> So does it mean that Christoph's comment about size increase is not correct?

Correct - ie there is no size increase.

> 
> Thanks
> 
> > 
> > 	David
> >   
> > > After this change (on 64bits system): 128 bits.
> > > 
> > > It is 33% increase per-structure.
> > > 
> > > So what is the resolution? Should I drop this patch or not?
> > > 
> > > Thanks 
> > >   
> >

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Christoph Hellwig 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 11:55:16AM +0200, Leon Romanovsky wrote:
> So what is the resolution? Should I drop this patch or not?

I think it should be dropped.  I don't think having to use one 96-bit
structure per 4GB worth of memory should be a deal breaker.

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 11:10:48AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 19, 2025 at 11:55:16AM +0200, Leon Romanovsky wrote:
> > So what is the resolution? Should I drop this patch or not?
> 
> I think it should be dropped.  I don't think having to use one 96-bit
> structure per 4GB worth of memory should be a deal breaker.

Christoph,

It seems like no size change will before and after my change.
https://lore.kernel.org/linux-nvme/20251117-nvme-phys-types-v2-0-c75a60a2c468@nvidia.com/T/#ma575c050517e91e7630683cf193e39d812338fa4

Thanks

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 11:10:48AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 19, 2025 at 11:55:16AM +0200, Leon Romanovsky wrote:
> > So what is the resolution? Should I drop this patch or not?
> 
> I think it should be dropped.  I don't think having to use one 96-bit
> structure per 4GB worth of memory should be a deal breaker.

No problem, will drop.

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Keith Busch 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 09:22:43PM +0200, Leon Romanovsky wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index e5ca8301bb8b..b61ec62b0ec6 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -290,14 +290,14 @@ struct nvme_iod {
>  	u8 flags;
>  	u8 nr_descriptors;
>  
> -	unsigned int total_len;
> +	size_t total_len;

Changing the generic phys_vec sounds fine, but the nvme driver has a 8MB
limitation on how large an IO can be, so I don't think the driver's
length needs to match the phys_vec type.

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Christoph Hellwig 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 12:35:40PM -0700, Keith Busch wrote:
> > +	size_t total_len;
> 
> Changing the generic phys_vec sounds fine, but the nvme driver has a 8MB
> limitation on how large an IO can be, so I don't think the driver's
> length needs to match the phys_vec type.

With the new dma mapping interface we could lift that limits for
SGL-based controllers as we basically only have a nr_segments limit now.
Not that I'm trying to argue for multi-GB I/O..

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Keith Busch 2 months, 3 weeks ago

On Tue, Nov 18, 2025 at 06:18:23AM +0100, Christoph Hellwig wrote:
> On Mon, Nov 17, 2025 at 12:35:40PM -0700, Keith Busch wrote:
> > > +	size_t total_len;
> > 
> > Changing the generic phys_vec sounds fine, but the nvme driver has a 8MB
> > limitation on how large an IO can be, so I don't think the driver's
> > length needs to match the phys_vec type.
> 
> With the new dma mapping interface we could lift that limits for
> SGL-based controllers as we basically only have a nr_segments limit now.
> Not that I'm trying to argue for multi-GB I/O..

It's not a bad idea. The tricky part is in the timeout handling. If
we allow very large IO, I think we need a dynamic timeout value to
account for the link's throughput. We can already trigger blk-mq
timeouts if you saturate enough queues with max sized IO, despite
everything else working-as-designed.

Re: [PATCH v2 1/2] nvme-pci: Use size_t for length fields to handle larger sizes

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 12:35:40PM -0700, Keith Busch wrote:
> On Mon, Nov 17, 2025 at 09:22:43PM +0200, Leon Romanovsky wrote:
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index e5ca8301bb8b..b61ec62b0ec6 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -290,14 +290,14 @@ struct nvme_iod {
> >  	u8 flags;
> >  	u8 nr_descriptors;
> >  
> > -	unsigned int total_len;
> > +	size_t total_len;
> 
> Changing the generic phys_vec sounds fine, but the nvme driver has a 8MB
> limitation on how large an IO can be, so I don't think the driver's
> length needs to match the phys_vec type.

I'm big fan of keeping same types in all places, but can drop nvme changes,
if you think that it is right thing to do.

Thanks