[v3] support non power of 2 zoned devices

[PATCH v3 00/11] support non power of 2 zoned devices

Posted by Pankaj Raghav 4 years ago

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone cap is not the same as the
po2 zone size. When the zone cap != zone size, then unmapped LBAs are
introduced to cover the space between the zone cap and zone size. po2
requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [0].
Additional kernel stop-gap patches are provided in this series for dm-zoned.
Support for npo2 zonefs and btrfs support is addressed in this series.

There was an effort previously [1] to add support to non po2 devices via
device level emulation but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[2].

- Patchset description:
This patchset aims at adding support to non power of 2 zoned devices in
the block layer, nvme layer, null blk and adds support to btrfs and
zonefs.

This round of patches **will not** support DM layer for non
power of 2 zoned devices. More about this in the future work section.

Patches 1-2 deals with removing the po2 constraint from the
block layer.

Patches 3-4 deals with removing the constraint from nvme zns.

Patches 5-8 adds support to btrfs for non po2 zoned devices.

Patch 9 removes the po2 constraint in ZoneFS

Patch 10 removes the po2 contraint in null blk

Patches 11 adds conditions to not allow non power of 2 devices in
DM.

The patch series is based on linux-next tag: next-20220502

- Performance:
PO2 zone sizes utilizes log and shifts instead of division when
determing alignment, zone number, etc. The same math cannot be used when
using a zoned device with non po2 zone size. Hence, to avoid any performance
regression on zoned devices with po2 zone sizes, the optimized math in the
hot paths has been retained with branching.

The performance was measured using null blk for regression
and the results have been posted in the appropriate commit log. No
performance regression was noticed.

- Testing
With respect to testing we need to tackle two things: one for regression
on po2 zoned device and progression on non po2 zoned devices.

kdevops (https://github.com/mcgrof/kdevops) was extensively used to
automate the testing for blktests and (x)fstests for btrfs changes. The
known failures were excluded during the test based on the baseline
v5.17.0-rc7

-- regression
Emulated zoned device with zone size =128M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Changes were tested with the following profile in QEMU:
[btrfs_simple_zns]
TEST_DIR=<dir>
SCRATCH_MNT=<mnt>
FSTYP=btrfs
MKFS_OPTIONS="-f -d single -m single"
TEST_DEV=<dev>
SCRATCH_DEV_POOL=<dev-pool>

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests-nullblk.sh and zonefs-tests.sh from zonefs-tools were run
with no failures.

nullblk:
t/zbd/run-tests-against-nullb from fio was run with no failures.

DM:
It was verified if dm-zoned successfully mounts without any
error.

-- progression
Emulated zoned device with zone size = 96M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Same profile as po2 zone size was used.

Many tests in xfstests for btrfs included dm-flakey and some tests
required dm-linear. As they are not supported at the moment for non
po2 devices, those **tests were excluded for non po2 devices**.

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests.sh from zonefs-tools were run with no failures.

nullblk:
A new section was added to cover non po2 devices:

section14()
{
conv_pcnt=10
zone_size=3
zone_capacity=3
max_open=${set_max_open}
zbd_test_opts+=("-o ${max_open}")
}
t/zbd/run-tests-against-nullb from fio was run with no failures.

DM:
It was verified that dm-zoned does not mount.

- Open issue:
* btrfs superblock location for zoned devices is expected to be in 0,
512GB(mirror) and 4TB(mirror) in the device. Zoned devices with po2
zone size will naturally align with these superblock location but non
po2 devices will not align with 512GB and 4TB offset.

The current approach for npo2 devices is to place the superblock mirror
zones near 512GB and 4TB that is **aligned to the zone size**. This
is of no issue for normal operation as we keep track where the superblock
mirror are placed but this can cause an issue with recovery tools for
zoned devices as they expect mirror superblock to be in 512GB and 4TB.

Note that ATM, recovery tools such as `btrfs check` does not work for
image dumps for zoned devices even for po2 zone sizes.

- Tools:
Some tools had to be updated to support non po2 devices. Once these
patches are accepted in the kernel, these tool updates will also be
upstreamed.
* btrfs-prog: https://github.com/Panky-codes/btrfs-progs/tree/remove-po2-btrfs
* blkzone: https://github.com/Panky-codes/util-linux/tree/remove-po2
* zonefs-tools: https://github.com/Panky-codes/zonefs-tools/tree/remove-po2

- Future work
To reduce the amount of changes and testing, support for DM was
excluded in this round of patches. The plan is to add support to F2FS
and DM in the forthcoming future.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[1] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[2] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
(Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
(bart)
- Add comments where generic calculation is directly used instead having
special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Luis Chamberlain (1):
dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (10):
block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2
zsze
block: allow blk-zoned devices to have non-power-of-2 zone size
nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
nvmet: Allow ZNS target to support non-power_of_2 zone sizes
btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
btrfs: zoned: Make sb_zone_number function non power of 2 compatible
btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned
devices
btrfs: zoned: relax the alignment constraint for zoned devices
zonefs: allow non power of 2 zoned devices
null_blk: allow non power of 2 zoned devices

--
2.25.1

Re: [PATCH v3 00/11] support non power of 2 zoned devices

Posted by David Sterba 4 years ago

On Fri, May 06, 2022 at 10:10:54AM +0200, Pankaj Raghav wrote:
> - Open issue:
> * btrfs superblock location for zoned devices is expected to be in 0,
>   512GB(mirror) and 4TB(mirror) in the device. Zoned devices with po2
>   zone size will naturally align with these superblock location but non
>   po2 devices will not align with 512GB and 4TB offset.
> 
>   The current approach for npo2 devices is to place the superblock mirror
>   zones near   512GB and 4TB that is **aligned to the zone size**.

I don't like that, the offsets have been chosen so the values are fixed
and also future proof in case the zone size increases significantly. The
natural alignment of the pow2 zones makes it fairly trivial.

If I understand correctly what you suggest, it would mean that if zone
is eg. 5G and starts at 510G then the superblock should start at 510G,
right? And with another device that has 7G zone size the nearest
multiple is 511G. And so on.

That makes it all less predictable, depending on the physical device
constraints that are affecting the logical data structures of the
filesystem. We tried to avoid that with pow2, the only thing that
depends on the device is that the range from the super block offsets is
always 2 zones.

I really want to keep the offsets for all zoned devices the same and
adapt the code that's handling the writes. This is possible with the
non-pow2 too, the first write is set to the expected offset, leaving the
beginning of the zone unused.

>   This
>   is of no issue for normal operation as we keep track where the superblock
>   mirror are placed but this can cause an issue with recovery tools for
>   zoned devices as they expect mirror superblock to be in 512GB and 4TB.

Yeah the tools need to be updated, btrfs-progs and suite of blk* in
util-linux.

>   Note that ATM, recovery tools such as `btrfs check` does not work for
>   image dumps for zoned devices even for po2 zone sizes.

I thought this worked, but if you find something that does not please
report that to Johannes or Naohiro.

Re: [PATCH v3 00/11] support non power of 2 zoned devices

Posted by Pankaj Raghav 4 years ago

On 2022-05-06 12:00, David Sterba wrote:
>>   The current approach for npo2 devices is to place the superblock mirror
>>   zones near   512GB and 4TB that is **aligned to the zone size**.
> 
> I don't like that, the offsets have been chosen so the values are fixed
> and also future proof in case the zone size increases significantly. The
> natural alignment of the pow2 zones makes it fairly trivial.
> 
> If I understand correctly what you suggest, it would mean that if zone
> is eg. 5G and starts at 510G then the superblock should start at 510G,
> right? And with another device that has 7G zone size the nearest
> multiple is 511G. And so on.
> 
> That makes it all less predictable, depending on the physical device
> constraints that are affecting the logical data structures of the
> filesystem. We tried to avoid that with pow2, the only thing that
> depends on the device is that the range from the super block offsets is
> always 2 zones.
> 
> I really want to keep the offsets for all zoned devices the same and
> adapt the code that's handling the writes. This is possible with the
> non-pow2 too, the first write is set to the expected offset, leaving the
> beginning of the zone unused.
> 
I agree. Having a known place for superblocks is important for recovery
tools. We were thinking along the lines of what you have suggested. I
will add this support in the next revision.
>>   This
>>   is of no issue for normal operation as we keep track where the superblock
>>   mirror are placed but this can cause an issue with recovery tools for
>>   zoned devices as they expect mirror superblock to be in 512GB and 4TB.
> 
> Yeah the tools need to be updated, btrfs-progs and suite of blk* in
> util-linux.
> 
>>   Note that ATM, recovery tools such as `btrfs check` does not work for
>>   image dumps for zoned devices even for po2 zone sizes.
> 
> I thought this worked, but if you find something that does not please
> report that to Johannes or Naohiro.
Ok. Thanks.