Documentation/admin-guide/md.rst | 80 +- drivers/md/Kconfig | 11 + drivers/md/Makefile | 2 +- drivers/md/dm-raid.c | 6 +- drivers/md/md-bitmap.c | 50 +- drivers/md/md-bitmap.h | 55 +- drivers/md/md-llbitmap.c | 1556 ++++++++++++++++++++++++++++++ drivers/md/md.c | 247 +++-- drivers/md/md.h | 20 +- drivers/md/raid5.c | 6 + 10 files changed, 1901 insertions(+), 132 deletions(-) create mode 100644 drivers/md/md-llbitmap.c
From: Yu Kuai <yukuai3@huawei.com>
This is the formal version after previous RFC version:
https://lore.kernel.org/all/20250512011927.2809400-1-yukuai1@huaweicloud.com/
#### Background
Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.
Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.
#### Key Features
- IO fastpath is lockless, if user issues lots of write IO to the same
bitmap bit in a short time, only the first write have additional overhead
to update bitmap bit, no additional overhead for the following writes;
- support only resync or recover written data, means in the case creating
new array or replacing with a new disk, there is no need to do a full disk
resync/recovery;
#### Key Concept
##### State Machine
Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change state:
llbitmap state machine: transitions between states
| | Startwrite | Startsync | Endsync | Abortsync|
| --------- | ---------- | --------- | ------- | ------- |
| Unwritten | Dirty | x | x | x |
| Clean | Dirty | x | x | x |
| Dirty | x | x | x | x |
| NeedSync | x | Syncing | x | x |
| Syncing | x | Syncing | Dirty | NeedSync |
| | Reload | Daemon | Discard | Stale |
| --------- | -------- | ------ | --------- | --------- |
| Unwritten | x | x | x | x |
| Clean | x | x | Unwritten | NeedSync |
| Dirty | NeedSync | Clean | Unwritten | NeedSync |
| NeedSync | x | x | Unwritten | x |
| Syncing | NeedSync | x | Unwritten | NeedSync |
Typical scenarios:
1) Create new array
All bits will be set to Unwritten by default, if --assume-clean is set,
all bits will be set to Clean instead.
2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
rely on xor data
2.1) write new data to raid1/raid10:
Unwritten --StartWrite--> Dirty
2.2) write new data to raid456:
Unwritten --StartWrite--> NeedSync
Because the initial recover for raid456 is skipped, the xor data is not build
yet, the bit must set to NeedSync first and after lazy initial recover is
finished, the bit will finially set to Dirty(see 5.1 and 5.4);
2.3) cover write
Clean --StartWrite--> Dirty
3) daemon, if the array is not degraded:
Dirty --Daemon--> Clean
For degraded array, the Dirty bit will never be cleared, prevent full disk
recovery while readding a removed disk.
4) discard
{Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
5) resync and recover
5.1) common process
NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
5.2) resync after power failure
Dirty --Reload--> NeedSync
5.3) recover while replacing with a new disk
By default, the old bitmap framework will recover all data, and llbitmap
implement this by a new helper, see llbitmap_skip_sync_blocks:
skip recover for bits other than dirty or clean;
5.4) lazy initial recover for raid5:
By default, the old bitmap framework will only allow new recover when there
are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
to perform raid456 lazy recover for set bits(from 2.2).
##### Bitmap IO
##### Chunksize
The default bitmap size is 128k, incluing 1k bitmap super block, and
the default size of segment of data in the array each bit(chunksize) is 64k,
and chunksize will adjust to twice the old size each time if the total number
bits is not less than 127k.(see llbitmap_init)
##### READ
While creating bitmap, all pages will be allocated and read for llbitmap,
there won't be read afterwards
##### WRITE
WRITE IO is divided into logical_block_size of the array, the dirty state
of each block is tracked independently, for example:
each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
| page0 | page1 | ... | page 31 |
| |
| \-----------------------\
| |
| block0 | block1 | ... | block 8|
| |
| \-----------------\
| |
| bit0 | bit1 | ... | bit511 |
From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
subpage will be marked dirty, such block must write first before the IO is
issued. This behaviour will affect IO performance, to reduce the impact, if
multiple bits are changed in the same block in a short time, all bits in this
block will be changed to Dirty/NeedSync, so that there won't be any overhead
until daemon clears dirty bits.
##### Dirty Bits syncronization
IO fast path will set bits to dirty, and those dirty bits will be cleared
by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
IO path and daemon;
IO path:
1) try to grab a reference, if succeed, set expire time after 5s and return;
2) if failed to grab a reference, wait for daemon to finish clearing dirty
bits;
Daemon(Daemon will be waken up every daemon_sleep seconds):
For each page:
1) check if page expired, if not skip this page; for expired page:
2) suspend the page and wait for inflight write IO to be done;
3) change dirty page to clean;
4) resume the page;
Performance Test:
Simple fio randwrite test to build array with 20GB ramdisk in my VM:
| | none | bitmap | llbitmap |
| -------------------- | --------- | --------- | --------- |
| raid1 | 13.7MiB/s | 9696KiB/s | 19.5MiB/s |
| raid1(assume clean) | 19.5MiB/s | 11.9MiB/s | 19.5MiB/s |
| raid10 | 21.9MiB/s | 11.6MiB/s | 27.8MiB/s |
| raid10(assume clean) | 27.8MiB/s | 15.4MiB/s | 27.8MiB/s |
| raid5 | 14.0MiB/s | 11.6MiB/s | 12.9MiB/s |
| raid5(assume clean) | 17.8MiB/s | 13.4MiB/s | 13.9MiB/s |
For raid1/raid10 llbitmap can be better than none bitmap with background
initial resync, and it's the same as none bitmap without it.
Noted that llbitmap performance improvement for raid5 is not obvious,
this is due to raid5 has many other performance bottleneck, perf
results still shows that bitmap overhead will be much less.
following branch for review or test:
https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/md-llbitmap
Yu Kuai (23):
md: add a new parameter 'offset' to md_super_write()
md: factor out a helper raid_is_456()
md/md-bitmap: cleanup bitmap_ops->startwrite()
md/md-bitmap: support discard for bitmap ops
md/md-bitmap: remove parameter slot from bitmap_create()
md/md-bitmap: add a new sysfs api bitmap_type
md/md-bitmap: delay registration of bitmap_ops until creating bitmap
md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
md/md-bitmap: add a new method blocks_synced() in bitmap_operations
md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
md/md-bitmap: make method bitmap_ops->daemon_work optional
md/md-bitmap: add macros for lockless bitmap
md/md-bitmap: fix dm-raid max_write_behind setting
md/dm-raid: remove max_write_behind setting limit
md/md-llbitmap: implement llbitmap IO
md/md-llbitmap: implement bit state machine
md/md-llbitmap: implement APIs for page level dirty bits
synchronization
md/md-llbitmap: implement APIs to mange bitmap lifetime
md/md-llbitmap: implement APIs to dirty bits and clear bits
md/md-llbitmap: implement APIs for sync_thread
md/md-llbitmap: implement all bitmap operations
md/md-llbitmap: implement sysfs APIs
md/md-llbitmap: add Kconfig
Documentation/admin-guide/md.rst | 80 +-
drivers/md/Kconfig | 11 +
drivers/md/Makefile | 2 +-
drivers/md/dm-raid.c | 6 +-
drivers/md/md-bitmap.c | 50 +-
drivers/md/md-bitmap.h | 55 +-
drivers/md/md-llbitmap.c | 1556 ++++++++++++++++++++++++++++++
drivers/md/md.c | 247 +++--
drivers/md/md.h | 20 +-
drivers/md/raid5.c | 6 +
10 files changed, 1901 insertions(+), 132 deletions(-)
create mode 100644 drivers/md/md-llbitmap.c
--
2.39.2
Hi, 在 2025/05/24 14:12, Yu Kuai 写道: > following branch for review or test: > https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/md-llbitmap The correct branch is: https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/llbitmap Thanks, Kuai
Hi, 在 2025/05/24 14:12, Yu Kuai 写道: > Yu Kuai (23): > md: add a new parameter 'offset' to md_super_write() > md: factor out a helper raid_is_456() > md/md-bitmap: cleanup bitmap_ops->startwrite() > md/md-bitmap: support discard for bitmap ops > md/md-bitmap: remove parameter slot from bitmap_create() > md/md-bitmap: add a new sysfs api bitmap_type > md/md-bitmap: delay registration of bitmap_ops until creating bitmap > md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations > md/md-bitmap: add a new method blocks_synced() in bitmap_operations > md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER > md/md-bitmap: make method bitmap_ops->daemon_work optional > md/md-bitmap: add macros for lockless bitmap > md/md-bitmap: fix dm-raid max_write_behind setting > md/dm-raid: remove max_write_behind setting limit > md/md-llbitmap: implement llbitmap IO > md/md-llbitmap: implement bit state machine > md/md-llbitmap: implement APIs for page level dirty bits > synchronization > md/md-llbitmap: implement APIs to mange bitmap lifetime > md/md-llbitmap: implement APIs to dirty bits and clear bits > md/md-llbitmap: implement APIs for sync_thread > md/md-llbitmap: implement all bitmap operations > md/md-llbitmap: implement sysfs APIs > md/md-llbitmap: add Kconfig Patch 3, 13, 14 are applied to md-6.16, they are not related to new bitmap: md/md-bitmap: cleanup bitmap_ops->startwrite() md/md-bitmap: fix dm-raid max_write_behind setting md/dm-raid: remove max_write_behind setting limit Thanks, Kuai > > Documentation/admin-guide/md.rst | 80 +- > drivers/md/Kconfig | 11 + > drivers/md/Makefile | 2 +- > drivers/md/dm-raid.c | 6 +- > drivers/md/md-bitmap.c | 50 +- > drivers/md/md-bitmap.h | 55 +- > drivers/md/md-llbitmap.c | 1556 ++++++++++++++++++++++++++++++ > drivers/md/md.c | 247 +++-- > drivers/md/md.h | 20 +- > drivers/md/raid5.c | 6 + > 10 files changed, 1901 insertions(+), 132 deletions(-) > create mode 100644 drivers/md/md-llbitmap.c
在 2025/5/24 下午2:12, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> This is the formal version after previous RFC version:
>
> https://lore.kernel.org/all/20250512011927.2809400-1-yukuai1@huaweicloud.com/
>
> #### Background
>
> Redundant data is used to enhance data fault tolerance, and the storage
> method for redundant data vary depending on the RAID levels. And it's
> important to maintain the consistency of redundant data.
>
> Bitmap is used to record which data blocks have been synchronized and which
> ones need to be resynchronized or recovered. Each bit in the bitmap
> represents a segment of data in the array. When a bit is set, it indicates
> that the multiple redundant copies of that data segment may not be
> consistent. Data synchronization can be performed based on the bitmap after
> power failure or readding a disk. If there is no bitmap, a full disk
> synchronization is required.
Hi Kuai
>
> #### Key Features
>
> - IO fastpath is lockless, if user issues lots of write IO to the same
> bitmap bit in a short time, only the first write have additional overhead
> to update bitmap bit, no additional overhead for the following writes;
After reading other patches, I want to check if I understand right.
The first write sets the bitmap bit. The second write which hits the
same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits
to set all other bits. Then the third write doesn't need to set bitmap
bits. If I'm right, the comments above should say only the first two
writes have additional overhead?
> - support only resync or recover written data, means in the case creating
> new array or replacing with a new disk, there is no need to do a full disk
> resync/recovery;
>
> #### Key Concept
>
> ##### State Machine
>
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> there are total 8 differenct actions, see llbitmap_action, can change state:
>
> llbitmap state machine: transitions between states
>
> | | Startwrite | Startsync | Endsync | Abortsync|
> | --------- | ---------- | --------- | ------- | ------- |
> | Unwritten | Dirty | x | x | x |
> | Clean | Dirty | x | x | x |
> | Dirty | x | x | x | x |
> | NeedSync | x | Syncing | x | x |
> | Syncing | x | Syncing | Dirty | NeedSync |
>
> | | Reload | Daemon | Discard | Stale |
> | --------- | -------- | ------ | --------- | --------- |
> | Unwritten | x | x | x | x |
> | Clean | x | x | Unwritten | NeedSync |
> | Dirty | NeedSync | Clean | Unwritten | NeedSync |
> | NeedSync | x | x | Unwritten | x |
> | Syncing | NeedSync | x | Unwritten | NeedSync |
For Reload action, if the bitmap bit is NeedSync, the changed status
will be x. It can't trigger resync/recovery.
For example:
cat /sys/block/md127/md/llbitmap/bits
unwritten 3480
clean 2
dirty 0
need sync 510
It doesn't do resync after aseembling the array. Does it need to modify
the changed status from x to NeedSync?
Best Regards
Xiao
>
> Typical scenarios:
>
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,
> all bits will be set to Clean instead.
>
> 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> rely on xor data
>
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
>
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
>
> Because the initial recover for raid456 is skipped, the xor data is not build
> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);
>
> 2.3) cover write
> Clean --StartWrite--> Dirty
>
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
>
> For degraded array, the Dirty bit will never be cleared, prevent full disk
> recovery while readding a removed disk.
>
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
>
> 5) resync and recover
>
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
>
> 5.2) resync after power failure
> Dirty --Reload--> NeedSync
>
> 5.3) recover while replacing with a new disk
> By default, the old bitmap framework will recover all data, and llbitmap
> implement this by a new helper, see llbitmap_skip_sync_blocks:
>
> skip recover for bits other than dirty or clean;
>
> 5.4) lazy initial recover for raid5:
> By default, the old bitmap framework will only allow new recover when there
> are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> to perform raid456 lazy recover for set bits(from 2.2).
>
> ##### Bitmap IO
>
> ##### Chunksize
>
> The default bitmap size is 128k, incluing 1k bitmap super block, and
> the default size of segment of data in the array each bit(chunksize) is 64k,
> and chunksize will adjust to twice the old size each time if the total number
> bits is not less than 127k.(see llbitmap_init)
>
> ##### READ
>
> While creating bitmap, all pages will be allocated and read for llbitmap,
> there won't be read afterwards
>
> ##### WRITE
>
> WRITE IO is divided into logical_block_size of the array, the dirty state
> of each block is tracked independently, for example:
>
> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
>
> | page0 | page1 | ... | page 31 |
> | |
> | \-----------------------\
> | |
> | block0 | block1 | ... | block 8|
> | |
> | \-----------------\
> | |
> | bit0 | bit1 | ... | bit511 |
>
> From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> subpage will be marked dirty, such block must write first before the IO is
> issued. This behaviour will affect IO performance, to reduce the impact, if
> multiple bits are changed in the same block in a short time, all bits in this
> block will be changed to Dirty/NeedSync, so that there won't be any overhead
> until daemon clears dirty bits.
>
> ##### Dirty Bits syncronization
>
> IO fast path will set bits to dirty, and those dirty bits will be cleared
> by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> IO path and daemon;
>
> IO path:
> 1) try to grab a reference, if succeed, set expire time after 5s and return;
> 2) if failed to grab a reference, wait for daemon to finish clearing dirty
> bits;
>
> Daemon(Daemon will be waken up every daemon_sleep seconds):
> For each page:
> 1) check if page expired, if not skip this page; for expired page:
> 2) suspend the page and wait for inflight write IO to be done;
> 3) change dirty page to clean;
> 4) resume the page;
>
> Performance Test:
> Simple fio randwrite test to build array with 20GB ramdisk in my VM:
>
> | | none | bitmap | llbitmap |
> | -------------------- | --------- | --------- | --------- |
> | raid1 | 13.7MiB/s | 9696KiB/s | 19.5MiB/s |
> | raid1(assume clean) | 19.5MiB/s | 11.9MiB/s | 19.5MiB/s |
> | raid10 | 21.9MiB/s | 11.6MiB/s | 27.8MiB/s |
> | raid10(assume clean) | 27.8MiB/s | 15.4MiB/s | 27.8MiB/s |
> | raid5 | 14.0MiB/s | 11.6MiB/s | 12.9MiB/s |
> | raid5(assume clean) | 17.8MiB/s | 13.4MiB/s | 13.9MiB/s |
>
> For raid1/raid10 llbitmap can be better than none bitmap with background
> initial resync, and it's the same as none bitmap without it.
>
> Noted that llbitmap performance improvement for raid5 is not obvious,
> this is due to raid5 has many other performance bottleneck, perf
> results still shows that bitmap overhead will be much less.
>
> following branch for review or test:
> https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/md-llbitmap
>
> Yu Kuai (23):
> md: add a new parameter 'offset' to md_super_write()
> md: factor out a helper raid_is_456()
> md/md-bitmap: cleanup bitmap_ops->startwrite()
> md/md-bitmap: support discard for bitmap ops
> md/md-bitmap: remove parameter slot from bitmap_create()
> md/md-bitmap: add a new sysfs api bitmap_type
> md/md-bitmap: delay registration of bitmap_ops until creating bitmap
> md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
> md/md-bitmap: add a new method blocks_synced() in bitmap_operations
> md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
> md/md-bitmap: make method bitmap_ops->daemon_work optional
> md/md-bitmap: add macros for lockless bitmap
> md/md-bitmap: fix dm-raid max_write_behind setting
> md/dm-raid: remove max_write_behind setting limit
> md/md-llbitmap: implement llbitmap IO
> md/md-llbitmap: implement bit state machine
> md/md-llbitmap: implement APIs for page level dirty bits
> synchronization
> md/md-llbitmap: implement APIs to mange bitmap lifetime
> md/md-llbitmap: implement APIs to dirty bits and clear bits
> md/md-llbitmap: implement APIs for sync_thread
> md/md-llbitmap: implement all bitmap operations
> md/md-llbitmap: implement sysfs APIs
> md/md-llbitmap: add Kconfig
>
> Documentation/admin-guide/md.rst | 80 +-
> drivers/md/Kconfig | 11 +
> drivers/md/Makefile | 2 +-
> drivers/md/dm-raid.c | 6 +-
> drivers/md/md-bitmap.c | 50 +-
> drivers/md/md-bitmap.h | 55 +-
> drivers/md/md-llbitmap.c | 1556 ++++++++++++++++++++++++++++++
> drivers/md/md.c | 247 +++--
> drivers/md/md.h | 20 +-
> drivers/md/raid5.c | 6 +
> 10 files changed, 1901 insertions(+), 132 deletions(-)
> create mode 100644 drivers/md/md-llbitmap.c
>
Hi, 在 2025/06/30 9:59, Xiao Ni 写道: > > After reading other patches, I want to check if I understand right. > > The first write sets the bitmap bit. The second write which hits the > same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits > to set all other bits. Then the third write doesn't need to set bitmap > bits. If I'm right, the comments above should say only the first two > writes have additional overhead? Yes, for the same bit, it's twice; For different bit in the same block, it's third, by infect all bits in the block in the second. For Reload action, if the bitmap bit is > NeedSync, the changed status will be x. It can't trigger resync/recovery. This is not expected, see llbitmap_state_machine(), if old or new state is need_sync, it will trigger a resync. c = llbitmap_read(llbitmap, start); if (c == BitNeedSync) need_resync = true; -> for RELOAD case, need_resync is still set. state = state_machine[c][action]; if (state == BitNone) continue if (state == BitNeedSync) need_resync = true; > > For example: > > cat /sys/block/md127/md/llbitmap/bits > unwritten 3480 > clean 2 > dirty 0 > need sync 510 > > It doesn't do resync after aseembling the array. Does it need to modify > the changed status from x to NeedSync? Can you explain in detail how to reporduce this? Aseembling in my VM is fine. Thanks, Kuai
On Mon, Jun 30, 2025 at 10:34 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2025/06/30 9:59, Xiao Ni 写道: > > > > After reading other patches, I want to check if I understand right. > > > > The first write sets the bitmap bit. The second write which hits the > > same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits > > to set all other bits. Then the third write doesn't need to set bitmap > > bits. If I'm right, the comments above should say only the first two > > writes have additional overhead? > > Yes, for the same bit, it's twice; For different bit in the same block, > it's third, by infect all bits in the block in the second. For different bits in the same block, test_and_set_bit(bit, pctl->dirty) should be true too, right? So it infects other bits when second write hits the same block too. [946761.035079] llbitmap_set_page_dirty:390 page[0] offset 2024, block 3 [946761.035430] llbitmap_state_machine:646 delay raid456 initial recovery [946761.035802] llbitmap_state_machine:652 bit 1001 state from 0 to 3 [946761.036498] llbitmap_set_page_dirty:390 page[0] offset 2025, block 3 [946761.036856] llbitmap_set_page_dirty:403 call llbitmap_infect_dirty_bits As the debug logs show, different bits in the same block, the second write (offset 2025) infects other bits. > > For Reload action, if the bitmap bit is > > NeedSync, the changed status will be x. It can't trigger resync/recovery. > > This is not expected, see llbitmap_state_machine(), if old or new state > is need_sync, it will trigger a resync. > > c = llbitmap_read(llbitmap, start); > if (c == BitNeedSync) > need_resync = true; > -> for RELOAD case, need_resync is still set. > > state = state_machine[c][action]; > if (state == BitNone) > continue If bitmap bit is BitNeedSync, state_machine[BitNeedSync][BitmapActionReload] returns BitNone, so if (state == BitNone) is true, it can't set MD_RECOVERY_NEEDED and it can't start sync after assembling the array. > if (state == BitNeedSync) > need_resync = true; > > > > > For example: > > > > cat /sys/block/md127/md/llbitmap/bits > > unwritten 3480 > > clean 2 > > dirty 0 > > need sync 510 > > > > It doesn't do resync after aseembling the array. Does it need to modify > > the changed status from x to NeedSync? > > Can you explain in detail how to reporduce this? Aseembling in my VM is > fine. I added many debug logs, so the sync request runs slowly. The test I do: mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --bitmap=lockless -x 1 /dev/loop3 dd if=/dev/zero of=/dev/md0 bs=1M count=1 seek=500 oflag=direct mdadm --stop /dev/md0 (the sync thread finishes the region that two bitmap bits represent, so you can see llbitmap/bits has 510 bits (need sync)) mdadm -As Regards Xiao > > Thanks, > Kuai > >
Hi, 在 2025/06/30 11:25, Xiao Ni 写道: > On Mon, Jun 30, 2025 at 10:34 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: >> >> Hi, >> >> 在 2025/06/30 9:59, Xiao Ni 写道: >>> >>> After reading other patches, I want to check if I understand right. >>> >>> The first write sets the bitmap bit. The second write which hits the >>> same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits >>> to set all other bits. Then the third write doesn't need to set bitmap >>> bits. If I'm right, the comments above should say only the first two >>> writes have additional overhead? >> >> Yes, for the same bit, it's twice; For different bit in the same block, >> it's third, by infect all bits in the block in the second. > > For different bits in the same block, test_and_set_bit(bit, > pctl->dirty) should be true too, right? So it infects other bits when > second write hits the same block too. The dirty will be cleared after bitmap_unplug. > > [946761.035079] llbitmap_set_page_dirty:390 page[0] offset 2024, block 3 > [946761.035430] llbitmap_state_machine:646 delay raid456 initial recovery > [946761.035802] llbitmap_state_machine:652 bit 1001 state from 0 to 3 > [946761.036498] llbitmap_set_page_dirty:390 page[0] offset 2025, block 3 > [946761.036856] llbitmap_set_page_dirty:403 call llbitmap_infect_dirty_bits > > As the debug logs show, different bits in the same block, the second > write (offset 2025) infects other bits. > >> >> For Reload action, if the bitmap bit is >>> NeedSync, the changed status will be x. It can't trigger resync/recovery. >> >> This is not expected, see llbitmap_state_machine(), if old or new state >> is need_sync, it will trigger a resync. >> >> c = llbitmap_read(llbitmap, start); >> if (c == BitNeedSync) >> need_resync = true; >> -> for RELOAD case, need_resync is still set. >> >> state = state_machine[c][action]; >> if (state == BitNone) >> continue > > If bitmap bit is BitNeedSync, > state_machine[BitNeedSync][BitmapActionReload] returns BitNone, so if > (state == BitNone) is true, it can't set MD_RECOVERY_NEEDED and it > can't start sync after assembling the array. You missed what I said above that llbitmap_read() will trigger resync as well. > >> if (state == BitNeedSync) >> need_resync = true; >> >>> >>> For example: >>> >>> cat /sys/block/md127/md/llbitmap/bits >>> unwritten 3480 >>> clean 2 >>> dirty 0 >>> need sync 510 >>> >>> It doesn't do resync after aseembling the array. Does it need to modify >>> the changed status from x to NeedSync? >> >> Can you explain in detail how to reporduce this? Aseembling in my VM is >> fine. > > I added many debug logs, so the sync request runs slowly. The test I do: > mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --bitmap=lockless -x 1 /dev/loop3 > dd if=/dev/zero of=/dev/md0 bs=1M count=1 seek=500 oflag=direct > mdadm --stop /dev/md0 (the sync thread finishes the region that two > bitmap bits represent, so you can see llbitmap/bits has 510 bits (need > sync)) > mdadm -As I don't quite understand, in my case, mdadm -As works fine. > > Regards > Xiao >> >> Thanks, >> Kuai >> >> > > > . >
On Mon, Jun 30, 2025 at 11:46 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/06/30 11:25, Xiao Ni 写道:
> > On Mon, Jun 30, 2025 at 10:34 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2025/06/30 9:59, Xiao Ni 写道:
> >>>
> >>> After reading other patches, I want to check if I understand right.
> >>>
> >>> The first write sets the bitmap bit. The second write which hits the
> >>> same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits
> >>> to set all other bits. Then the third write doesn't need to set bitmap
> >>> bits. If I'm right, the comments above should say only the first two
> >>> writes have additional overhead?
> >>
> >> Yes, for the same bit, it's twice; For different bit in the same block,
> >> it's third, by infect all bits in the block in the second.
> >
> > For different bits in the same block, test_and_set_bit(bit,
> > pctl->dirty) should be true too, right? So it infects other bits when
> > second write hits the same block too.
>
> The dirty will be cleared after bitmap_unplug.
I understand you now. The for loop in llbitmap_set_page_dirty is used
for new writes after unplug.
> >
> > [946761.035079] llbitmap_set_page_dirty:390 page[0] offset 2024, block 3
> > [946761.035430] llbitmap_state_machine:646 delay raid456 initial recovery
> > [946761.035802] llbitmap_state_machine:652 bit 1001 state from 0 to 3
> > [946761.036498] llbitmap_set_page_dirty:390 page[0] offset 2025, block 3
> > [946761.036856] llbitmap_set_page_dirty:403 call llbitmap_infect_dirty_bits
> >
> > As the debug logs show, different bits in the same block, the second
> > write (offset 2025) infects other bits.
> >
> >>
> >> For Reload action, if the bitmap bit is
> >>> NeedSync, the changed status will be x. It can't trigger resync/recovery.
> >>
> >> This is not expected, see llbitmap_state_machine(), if old or new state
> >> is need_sync, it will trigger a resync.
> >>
> >> c = llbitmap_read(llbitmap, start);
> >> if (c == BitNeedSync)
> >> need_resync = true;
> >> -> for RELOAD case, need_resync is still set.
> >>
> >> state = state_machine[c][action];
> >> if (state == BitNone)
> >> continue
> >
> > If bitmap bit is BitNeedSync,
> > state_machine[BitNeedSync][BitmapActionReload] returns BitNone, so if
> > (state == BitNone) is true, it can't set MD_RECOVERY_NEEDED and it
> > can't start sync after assembling the array.
>
> You missed what I said above that llbitmap_read() will trigger resync as
> well.
> >
> >> if (state == BitNeedSync)
> >> need_resync = true;
> >>
> >>>
> >>> For example:
> >>>
> >>> cat /sys/block/md127/md/llbitmap/bits
> >>> unwritten 3480
> >>> clean 2
> >>> dirty 0
> >>> need sync 510
> >>>
> >>> It doesn't do resync after aseembling the array. Does it need to modify
> >>> the changed status from x to NeedSync?
> >>
> >> Can you explain in detail how to reporduce this? Aseembling in my VM is
> >> fine.
> >
> > I added many debug logs, so the sync request runs slowly. The test I do:
> > mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --bitmap=lockless -x 1 /dev/loop3
> > dd if=/dev/zero of=/dev/md0 bs=1M count=1 seek=500 oflag=direct
> > mdadm --stop /dev/md0 (the sync thread finishes the region that two
> > bitmap bits represent, so you can see llbitmap/bits has 510 bits (need
> > sync))
> > mdadm -As
>
> I don't quite understand, in my case, mdadm -As works fine.
Sorry for this, I forgot I removed the codes in function llbitmap_state_machine
//if (c == BitNeedSync)
// need_resync = true;
The reason I do this: I find if the status table changes like this, it
doesn't need to check the original status anymore
- [BitmapActionReload] = BitNone,
+ [BitmapActionReload] = BitNeedSync,//?
Regards
Xiao
Xiao
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>
> >>
> >
> >
> > .
> >
>
Hi, 在 2025/06/30 13:38, Xiao Ni 写道: >> I don't quite understand, in my case, mdadm -As works fine. > Sorry for this, I forgot I removed the codes in function llbitmap_state_machine > //if (c == BitNeedSync) > // need_resync = true; Ok. > The reason I do this: I find if the status table changes like this, it > doesn't need to check the original status anymore > - [BitmapActionReload] = BitNone, > + [BitmapActionReload] = BitNeedSync,//? However, we don't want do dirty the bitmap page in this case, as nothing chagned in the bitmap. And because of this, we have to check the old value anyway... Thanks, Kuai
© 2016 - 2025 Red Hat, Inc.