[v1] nvmet: Make blksize_shift configurable

[RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Richard Weinberger 9 months, 3 weeks ago

Currently, the block size is automatically configured, and for
file-backed namespaces it is likely to be 4K.
While this is a reasonable default for modern storage, it can
cause confusion if someone wants to export a pre-created disk image
that uses a 512-byte block size.
As a result, partition parsing will fail.

So, just like we already do for the loop block device, let the user
configure the block size if they know better.

Signed-off-by: Richard Weinberger <richard@nod.at>
---
 drivers/nvme/target/configfs.c    | 30 ++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-bdev.c |  4 +++-
 drivers/nvme/target/io-cmd-file.c | 15 +++++++++------
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e44ef69dffc2..2fd9cc3b1d00 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -797,6 +797,35 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
 }
 CONFIGFS_ATTR(nvmet_ns_, resv_enable);
 
+static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
+{
+	return sysfs_emit(page, "%d\n", to_nvmet_ns(item)->blksize_shift);
+}
+
+static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_ns *ns = to_nvmet_ns(item);
+	u32 shift;
+	int ret;
+
+	ret = kstrtou32(page, 0, &shift);
+	if (ret)
+		return ret;
+
+	mutex_lock(&ns->subsys->lock);
+	if (ns->enabled) {
+		pr_err("the ns:%d is already enabled.\n", ns->nsid);
+		mutex_unlock(&ns->subsys->lock);
+		return -EINVAL;
+	}
+	ns->blksize_shift = shift;
+	mutex_unlock(&ns->subsys->lock);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
+
 static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_device_path,
 	&nvmet_ns_attr_device_nguid,
@@ -806,6 +835,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_buffered_io,
 	&nvmet_ns_attr_revalidate_size,
 	&nvmet_ns_attr_resv_enable,
+	&nvmet_ns_attr_blksize_shift,
 #ifdef CONFIG_PCI_P2PDMA
 	&nvmet_ns_attr_p2pmem,
 #endif
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 83be0657e6df..a86010af4670 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -100,7 +100,9 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 	}
 	ns->bdev = file_bdev(ns->bdev_file);
 	ns->size = bdev_nr_bytes(ns->bdev);
-	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
+
+	if (!ns->blksize_shift)
+		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
 
 	ns->pi_type = 0;
 	ns->metadata_size = 0;
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 2d068439b129..5893b64179fb 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -49,12 +49,15 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
 
 	nvmet_file_ns_revalidate(ns);
 
-	/*
-	 * i_blkbits can be greater than the universally accepted upper bound,
-	 * so make sure we export a sane namespace lba_shift.
-	 */
-	ns->blksize_shift = min_t(u8,
-			file_inode(ns->file)->i_blkbits, 12);
+	if (!ns->blksize_shift) {
+		/*
+		 * i_blkbits can be greater than the universally accepted
+		 * upper bound, so make sure we export a sane namespace
+		 * lba_shift.
+		 */
+		ns->blksize_shift = min_t(u8,
+				file_inode(ns->file)->i_blkbits, 12);
+	}
 
 	ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab,
 			mempool_free_slab, nvmet_bvec_cache);
-- 
2.48.1

Re: [RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Chaitanya Kulkarni 9 months, 3 weeks ago

On 4/18/25 02:08, Richard Weinberger wrote:
> +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
> +{
> +	return sysfs_emit(page, "%d\n", to_nvmet_ns(item)->blksize_shift);
> +}
> +

I believe ns->blksize_shift is u32 to do we need %u ? unless we need to
re-interpret that value in int which I failed to understand.

if not please ignore this comment ...

-ck

Re: [RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Chaitanya Kulkarni 9 months, 3 weeks ago

On 4/18/25 02:08, Richard Weinberger wrote:
> +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> +		const char *page, size_t count)
> +{
> +	struct nvmet_ns *ns = to_nvmet_ns(item);
> +	u32 shift;
> +	int ret;
> +
> +	ret = kstrtou32(page, 0, &shift);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&ns->subsys->lock);
> +	if (ns->enabled) {
> +		pr_err("the ns:%d is already enabled.\n", ns->nsid);
> +		mutex_unlock(&ns->subsys->lock);
> +		return -EINVAL;
> +	}
> +	ns->blksize_shift = shift;
> +	mutex_unlock(&ns->subsys->lock);
> +
> +	return count;
> +}

before we overwrite the old value in ns->blksize_shift don't we need
to make sure new value shift is in the acceptable range ? in case value
is not within the acceptable range then we need to return -EINVAL ?

Also, do we need to #define min and max values for the acceptable
range?

-ck

Re: [RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Damien Le Moal 9 months, 3 weeks ago

On 4/18/25 18:08, Richard Weinberger wrote:
> Currently, the block size is automatically configured, and for
> file-backed namespaces it is likely to be 4K.
> While this is a reasonable default for modern storage, it can
> cause confusion if someone wants to export a pre-created disk image
> that uses a 512-byte block size.
> As a result, partition parsing will fail.
> 
> So, just like we already do for the loop block device, let the user
> configure the block size if they know better.
> 
> Signed-off-by: Richard Weinberger <richard@nod.at>
> ---
>  drivers/nvme/target/configfs.c    | 30 ++++++++++++++++++++++++++++++
>  drivers/nvme/target/io-cmd-bdev.c |  4 +++-
>  drivers/nvme/target/io-cmd-file.c | 15 +++++++++------
>  3 files changed, 42 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index e44ef69dffc2..2fd9cc3b1d00 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -797,6 +797,35 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
>  }
>  CONFIGFS_ATTR(nvmet_ns_, resv_enable);
>  
> +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
> +{
> +	return sysfs_emit(page, "%d\n", to_nvmet_ns(item)->blksize_shift);
> +}
> +
> +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> +		const char *page, size_t count)
> +{
> +	struct nvmet_ns *ns = to_nvmet_ns(item);
> +	u32 shift;
> +	int ret;
> +
> +	ret = kstrtou32(page, 0, &shift);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&ns->subsys->lock);
> +	if (ns->enabled) {
> +		pr_err("the ns:%d is already enabled.\n", ns->nsid);
> +		mutex_unlock(&ns->subsys->lock);
> +		return -EINVAL;
> +	}
> +	ns->blksize_shift = shift;
> +	mutex_unlock(&ns->subsys->lock);
> +
> +	return count;
> +}
> +CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
> +
>  static struct configfs_attribute *nvmet_ns_attrs[] = {
>  	&nvmet_ns_attr_device_path,
>  	&nvmet_ns_attr_device_nguid,
> @@ -806,6 +835,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>  	&nvmet_ns_attr_buffered_io,
>  	&nvmet_ns_attr_revalidate_size,
>  	&nvmet_ns_attr_resv_enable,
> +	&nvmet_ns_attr_blksize_shift,
>  #ifdef CONFIG_PCI_P2PDMA
>  	&nvmet_ns_attr_p2pmem,
>  #endif
> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> index 83be0657e6df..a86010af4670 100644
> --- a/drivers/nvme/target/io-cmd-bdev.c
> +++ b/drivers/nvme/target/io-cmd-bdev.c
> @@ -100,7 +100,9 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>  	}
>  	ns->bdev = file_bdev(ns->bdev_file);
>  	ns->size = bdev_nr_bytes(ns->bdev);
> -	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> +
> +	if (!ns->blksize_shift)
> +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));

If the user set logical block size is smaller than the block dev logical block
size, this is not going to work... No ? Am I missing something ?

>  
>  	ns->pi_type = 0;
>  	ns->metadata_size = 0;
> diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
> index 2d068439b129..5893b64179fb 100644
> --- a/drivers/nvme/target/io-cmd-file.c
> +++ b/drivers/nvme/target/io-cmd-file.c
> @@ -49,12 +49,15 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
>  
>  	nvmet_file_ns_revalidate(ns);
>  
> -	/*
> -	 * i_blkbits can be greater than the universally accepted upper bound,
> -	 * so make sure we export a sane namespace lba_shift.
> -	 */
> -	ns->blksize_shift = min_t(u8,
> -			file_inode(ns->file)->i_blkbits, 12);
> +	if (!ns->blksize_shift) {
> +		/*
> +		 * i_blkbits can be greater than the universally accepted
> +		 * upper bound, so make sure we export a sane namespace
> +		 * lba_shift.
> +		 */
> +		ns->blksize_shift = min_t(u8,
> +				file_inode(ns->file)->i_blkbits, 12);

This will work for any block size, regardless of the FS block size, but only if
ns->buffered_io is true. Doesn't this require some more checks with regards to
O_DIRECT (!ns->buffered_io case) ?

> +	}
>  
>  	ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab,
>  			mempool_free_slab, nvmet_bvec_cache);


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Richard Weinberger 9 months, 3 weeks ago

On Freitag, 18. April 2025 11:37 'Damien Le Moal' via upstream wrote:
> > +	if (!ns->blksize_shift)
> > +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> 
> If the user set logical block size is smaller than the block dev logical block
> size, this is not going to work... No ? Am I missing something ?

Likely, yes.
TBH, I'm not sure whether it makes actually sense for the bdev case to make
blksize_shift configurable.
The case I see most benefit is the backing file case.

> > +	if (!ns->blksize_shift) {
> > +		/*
> > +		 * i_blkbits can be greater than the universally accepted
> > +		 * upper bound, so make sure we export a sane namespace
> > +		 * lba_shift.
> > +		 */
> > +		ns->blksize_shift = min_t(u8,
> > +				file_inode(ns->file)->i_blkbits, 12);
> 
> This will work for any block size, regardless of the FS block size, but only if
> ns->buffered_io is true. Doesn't this require some more checks with regards to
> O_DIRECT (!ns->buffered_io case) ?

Good catch. I'll add a check.

It's also worth discussing whether we should limit blksize_shift to a specific
range. Right now, any shift is accepted, and it is up to the user to
use a sane value.

Thanks,
//richard

-- 
sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y

Re: [RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Christoph Hellwig 9 months, 3 weeks ago

On Fri, Apr 18, 2025 at 11:56:30AM +0200, Richard Weinberger wrote:
> TBH, I'm not sure whether it makes actually sense for the bdev case to make
> blksize_shift configurable.

> It's also worth discussing whether we should limit blksize_shift to a specific
> range. Right now, any shift is accepted, and it is up to the user to
> use a sane value.

It should have a hard lower bound of 512 bytes (9) and for direct I/O and
the bdev backend the reported dio alignment.

The upper bound is capped by the nvme_lbaf field being a 8-bit value on
the wire, no real need to require anything lower than that probably.

Re: [RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Damien Le Moal 9 months, 3 weeks ago

On 4/18/25 18:56, Richard Weinberger wrote:
> On Freitag, 18. April 2025 11:37 'Damien Le Moal' via upstream wrote:
>>> +	if (!ns->blksize_shift)
>>> +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
>>
>> If the user set logical block size is smaller than the block dev logical block
>> size, this is not going to work... No ? Am I missing something ?
> 
> Likely, yes.
> TBH, I'm not sure whether it makes actually sense for the bdev case to make
> blksize_shift configurable.

Probably not... I do understand the value for the file case though.

> The case I see most benefit is the backing file case.
> 
>>> +	if (!ns->blksize_shift) {
>>> +		/*
>>> +		 * i_blkbits can be greater than the universally accepted
>>> +		 * upper bound, so make sure we export a sane namespace
>>> +		 * lba_shift.
>>> +		 */
>>> +		ns->blksize_shift = min_t(u8,
>>> +				file_inode(ns->file)->i_blkbits, 12);
>>
>> This will work for any block size, regardless of the FS block size, but only if
>> ns->buffered_io is true. Doesn't this require some more checks with regards to
>> O_DIRECT (!ns->buffered_io case) ?
> 
> Good catch. I'll add a check.

And by the way, you need to check for STATX_DIOALIGN since some FS (e.g. xfs)
can handle direct IOs that are not aligned to the FS block size. See the recent
changes in drivers/block/loop.c to improve direct IO handling, specifically, the
function loop_query_min_dio_size().

-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH] nvmet: Make blksize_shift configurable

Posted by Richard Weinberger 9 months, 3 weeks ago

On Freitag, 18. April 2025 12:23 Damien Le Moal wrote:
> On 4/18/25 18:56, Richard Weinberger wrote:
> > On Freitag, 18. April 2025 11:37 'Damien Le Moal' via upstream wrote:
> >>> +	if (!ns->blksize_shift)
> >>> +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> >>
> >> If the user set logical block size is smaller than the block dev logical block
> >> size, this is not going to work... No ? Am I missing something ?
> > 
> > Likely, yes.
> > TBH, I'm not sure whether it makes actually sense for the bdev case to make
> > blksize_shift configurable.
> 
> Probably not... I do understand the value for the file case though.

The use case is exposing ready-to-use cloud images like:
https://cloud.debian.org/images/cloud/bookworm/20250416-2084/debian-12-generic-amd64-20250416-2084.raw

via NVme-of TCP.
Yesterday I did so and figured that no GPT partitions got detected because of different block sizes.
Setting the block size in nvmet to 512 made it work.

If there are better ways to achieve the same, I'm open for suggestions.

> 
> > The case I see most benefit is the backing file case.
> > 
> >>> +	if (!ns->blksize_shift) {
> >>> +		/*
> >>> +		 * i_blkbits can be greater than the universally accepted
> >>> +		 * upper bound, so make sure we export a sane namespace
> >>> +		 * lba_shift.
> >>> +		 */
> >>> +		ns->blksize_shift = min_t(u8,
> >>> +				file_inode(ns->file)->i_blkbits, 12);
> >>
> >> This will work for any block size, regardless of the FS block size, but only if
> >> ns->buffered_io is true. Doesn't this require some more checks with regards to
> >> O_DIRECT (!ns->buffered_io case) ?
> > 
> > Good catch. I'll add a check.
> 
> And by the way, you need to check for STATX_DIOALIGN since some FS (e.g. xfs)
> can handle direct IOs that are not aligned to the FS block size. See the recent
> changes in drivers/block/loop.c to improve direct IO handling, specifically, the
> function loop_query_min_dio_size().

Ok!

Thanks,
//richard

-- 
sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y