[v1] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap

[PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Yu Kuai 8 months, 3 weeks ago

From: Yu Kuai <yukuai3@huawei.com>

Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change
state:

llbitmap state machine: transitions between states

|           | Startwrite | Startsync | Endsync | Abortsync| Reload   | Daemon | Discard   | Stale     |
| --------- | ---------- | --------- | ------- | -------  | -------- | ------ | --------- | --------- |
| Unwritten | Dirty      | x         | x       | x        | x        | x      | x         | x         |
| Clean     | Dirty      | x         | x       | x        | x        | x      | Unwritten | NeedSync  |
| Dirty     | x          | x         | x       | x        | NeedSync | Clean  | Unwritten | NeedSync  |
| NeedSync  | x          | Syncing   | x       | x        | x        | x      | Unwritten | x         |
| Syncing   | x          | Syncing   | Dirty   | NeedSync | NeedSync | x      | Unwritten | NeedSync  |

Typical scenarios:

1) Create new array
All bits will be set to Unwritten by default, if --assume-clean is set,
All bits will be set to Clean instead.

2) write data, raid1/raid10 have full copy of data, while raid456 donen't and
rely on xor data

2.1) write new data to raid1/raid10:
Unwritten --StartWrite--> Dirty

2.2) write new data to raid456:
Unwritten --StartWrite--> NeedSync

Because the initial recover for raid456 is skipped, the xor data is not build
yet, the bit must set to NeedSync first and after lazy initial recover is
finished, the bit will finially set to Dirty(see 5.1 and 5.4);

2.3) cover write
Clean --StartWrite--> Dirty

3) daemon, if the array is not degraded:
Dirty --Daemon--> Clean

For degraded array, the Dirty bit will never be cleared, prevent full disk
recovery while readding a removed disk.

4) discard
{Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten

5) resync and recover

5.1) common process
NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean

5.2) resync after power failure
Dirty --Reload--> NeedSync

5.3) recover while replacing with a new disk
By default, the old bitmap framework will recover all data, and llbitmap
implement this by a new helper llbitmap_skip_sync_blocks:

skip recover for bits other than dirty or clean;

5.4) lazy initial recover for raid5:
By default, the old bitmap framework will only allow new recover when there
are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
to perform raid456 lazy recover for set bits(from 2.2).

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 83 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 1a01b6777527..f782f092ab5d 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -568,4 +568,87 @@ static int llbitmap_cache_pages(struct llbitmap *llbitmap)
 	return 0;
 }
 
+static void llbitmap_init_state(struct llbitmap *llbitmap)
+{
+	enum llbitmap_state state = BitUnwritten;
+	unsigned long i;
+
+	if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags))
+		state = BitClean;
+
+	for (i = 0; i < llbitmap->chunks; i++)
+		llbitmap_write(llbitmap, state, i);
+}
+
+/* The return value is only used from resync, where @start == @end. */
+static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
+						  unsigned long start,
+						  unsigned long end,
+						  enum llbitmap_action action)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	enum llbitmap_state state = BitNone;
+	bool need_resync = false;
+	bool need_recovery = false;
+
+	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
+		return BitNone;
+
+	if (action == BitmapActionInit) {
+		llbitmap_init_state(llbitmap);
+		return BitNone;
+	}
+
+	while (start <= end) {
+		enum llbitmap_state c = llbitmap_read(llbitmap, start);
+
+		if (c < 0 || c >= nr_llbitmap_state) {
+			pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n",
+			       __func__, start, c, action);
+			state = BitNeedSync;
+			goto write_bitmap;
+		}
+
+		if (c == BitNeedSync)
+			need_resync = true;
+
+		state = state_machine[c][action];
+		if (state == BitNone) {
+			start++;
+			continue;
+		}
+
+write_bitmap:
+		/* Delay raid456 initial recovery to first write. */
+		if (c == BitUnwritten && state == BitDirty &&
+		    action == BitmapActionStartwrite && raid_is_456(mddev)) {
+			state = BitNeedSync;
+			need_recovery = true;
+		}
+
+		llbitmap_write(llbitmap, state, start);
+
+		if (state == BitNeedSync)
+			need_resync = true;
+		else if (state == BitDirty &&
+			 !timer_pending(&llbitmap->pending_timer))
+			mod_timer(&llbitmap->pending_timer,
+				  jiffies + mddev->bitmap_info.daemon_sleep * HZ);
+
+		start++;
+	}
+
+	if (need_recovery) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	} else if (need_resync) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	}
+
+	return state;
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Xiao Ni 7 months, 2 weeks ago

在 2025/5/24 下午2:13, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> there are total 8 differenct actions, see llbitmap_action, can change
> state:
>
> llbitmap state machine: transitions between states
>
> |           | Startwrite | Startsync | Endsync | Abortsync| Reload   | Daemon | Discard   | Stale     |
> | --------- | ---------- | --------- | ------- | -------  | -------- | ------ | --------- | --------- |
> | Unwritten | Dirty      | x         | x       | x        | x        | x      | x         | x         |
> | Clean     | Dirty      | x         | x       | x        | x        | x      | Unwritten | NeedSync  |
> | Dirty     | x          | x         | x       | x        | NeedSync | Clean  | Unwritten | NeedSync  |
> | NeedSync  | x          | Syncing   | x       | x        | x        | x      | Unwritten | x         |
> | Syncing   | x          | Syncing   | Dirty   | NeedSync | NeedSync | x      | Unwritten | NeedSync  |
>
> Typical scenarios:
>
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,
> All bits will be set to Clean instead.
>
> 2) write data, raid1/raid10 have full copy of data, while raid456 donen't and
> rely on xor data
>
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
>
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
>
> Because the initial recover for raid456 is skipped, the xor data is not build
> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);
>
> 2.3) cover write
> Clean --StartWrite--> Dirty
>
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
>
> For degraded array, the Dirty bit will never be cleared, prevent full disk
> recovery while readding a removed disk.
>
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
>
> 5) resync and recover
>
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
>
> 5.2) resync after power failure
> Dirty --Reload--> NeedSync
>
> 5.3) recover while replacing with a new disk
> By default, the old bitmap framework will recover all data, and llbitmap
> implement this by a new helper llbitmap_skip_sync_blocks:
>
> skip recover for bits other than dirty or clean;
>
> 5.4) lazy initial recover for raid5:
> By default, the old bitmap framework will only allow new recover when there
> are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> to perform raid456 lazy recover for set bits(from 2.2).
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-llbitmap.c | 83 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 83 insertions(+)
>
> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> index 1a01b6777527..f782f092ab5d 100644
> --- a/drivers/md/md-llbitmap.c
> +++ b/drivers/md/md-llbitmap.c
> @@ -568,4 +568,87 @@ static int llbitmap_cache_pages(struct llbitmap *llbitmap)
>   	return 0;
>   }
>   
> +static void llbitmap_init_state(struct llbitmap *llbitmap)
> +{
> +	enum llbitmap_state state = BitUnwritten;
> +	unsigned long i;
> +
> +	if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags))
> +		state = BitClean;
> +
> +	for (i = 0; i < llbitmap->chunks; i++)
> +		llbitmap_write(llbitmap, state, i);
> +}
> +
> +/* The return value is only used from resync, where @start == @end. */
> +static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
> +						  unsigned long start,
> +						  unsigned long end,
> +						  enum llbitmap_action action)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	enum llbitmap_state state = BitNone;
> +	bool need_resync = false;
> +	bool need_recovery = false;
> +
> +	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
> +		return BitNone;
> +
> +	if (action == BitmapActionInit) {
> +		llbitmap_init_state(llbitmap);
> +		return BitNone;
> +	}
> +
> +	while (start <= end) {
> +		enum llbitmap_state c = llbitmap_read(llbitmap, start);
> +
> +		if (c < 0 || c >= nr_llbitmap_state) {
> +			pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n",
> +			       __func__, start, c, action);
> +			state = BitNeedSync;
> +			goto write_bitmap;
> +		}
> +
> +		if (c == BitNeedSync)
> +			need_resync = true;
> +
> +		state = state_machine[c][action];
> +		if (state == BitNone) {
> +			start++;
> +			continue;
> +		}

For reload action, it runs continue here.

And doesn't it need a lock when reading the state?

> +
> +write_bitmap:
> +		/* Delay raid456 initial recovery to first write. */
> +		if (c == BitUnwritten && state == BitDirty &&
> +		    action == BitmapActionStartwrite && raid_is_456(mddev)) {
> +			state = BitNeedSync;
> +			need_recovery = true;
> +		}
> +
> +		llbitmap_write(llbitmap, state, start);

Same question here, doesn't it need a lock when writing bitmap bits?

Regards

Xiao

> +
> +		if (state == BitNeedSync)
> +			need_resync = true;
> +		else if (state == BitDirty &&
> +			 !timer_pending(&llbitmap->pending_timer))
> +			mod_timer(&llbitmap->pending_timer,
> +				  jiffies + mddev->bitmap_info.daemon_sleep * HZ);
> +
> +		start++;
> +	}
> +
> +	if (need_recovery) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	} else if (need_resync) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	}
> +
> +	return state;
> +}
> +
>   #endif /* CONFIG_MD_LLBITMAP */

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Yu Kuai 7 months, 2 weeks ago

Hi,

在 2025/06/30 10:14, Xiao Ni 写道:
> For reload action, it runs continue here.

No one can concurent with reload.

> 
> And doesn't it need a lock when reading the state?

Notice that from IO path, all concurrent context are doing the same
thing, it doesn't matter if old state or new state are read. If old
state is read, it will write new state in memory again; if new state is
read, it just do nothing.

Thanks,
Kuai

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Xiao Ni 7 months, 2 weeks ago

On Mon, Jun 30, 2025 at 10:25 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/06/30 10:14, Xiao Ni 写道:
> > For reload action, it runs continue here.
>
> No one can concurent with reload.
>
> >
> > And doesn't it need a lock when reading the state?
>
> Notice that from IO path, all concurrent context are doing the same
> thing, it doesn't matter if old state or new state are read. If old
> state is read, it will write new state in memory again; if new state is
> read, it just do nothing.

Hi Kuai

This is the last place that I don't understand well. Is it the reason
that it only changes one byte at a time and the system can guarantee
the atomic when updating one byte?

If so, it only needs to concern the old and new data you mentioned
above. For example:
raid1 is created without --assume-clean, so all bits are BitUnwritten.
And a write bio comes, the bit changes to dirty. Then a discard is
submitted in another cpu context and it reads the old status
unwritten. From the status change table, the discard doesn't do
anything. In fact, discard should update dirty to unwritten. Can such
a case happen?

Regards
Xiao
>
> Thanks,
> Kuai
>

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Yu Kuai 7 months, 2 weeks ago

Hi,

在 2025/06/30 16:25, Xiao Ni 写道:
> On Mon, Jun 30, 2025 at 10:25 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/06/30 10:14, Xiao Ni 写道:
>>> For reload action, it runs continue here.
>>
>> No one can concurent with reload.
>>
>>>
>>> And doesn't it need a lock when reading the state?
>>
>> Notice that from IO path, all concurrent context are doing the same
>> thing, it doesn't matter if old state or new state are read. If old
>> state is read, it will write new state in memory again; if new state is
>> read, it just do nothing.
> 
> Hi Kuai
> 
> This is the last place that I don't understand well. Is it the reason
> that it only changes one byte at a time and the system can guarantee
> the atomic when updating one byte?
> 
> If so, it only needs to concern the old and new data you mentioned
> above. For example:
> raid1 is created without --assume-clean, so all bits are BitUnwritten.
> And a write bio comes, the bit changes to dirty. Then a discard is
> submitted in another cpu context and it reads the old status
> unwritten. From the status change table, the discard doesn't do
> anything. In fact, discard should update dirty to unwritten. Can such
> a case happen?

This can happen for raw disk, however, if there are filesystem, discard
and write can never race. And for raw disk, if user really issue write
and discard concurrently, the result really is uncertain, and it's fine
the bit result in dirty or unwritten.

Thanks,
Kuai

> 
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>
> 
> 
> .
>

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Xiao Ni 7 months, 2 weeks ago

On Mon, Jun 30, 2025 at 7:05 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/06/30 16:25, Xiao Ni 写道:
> > On Mon, Jun 30, 2025 at 10:25 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2025/06/30 10:14, Xiao Ni 写道:
> >>> For reload action, it runs continue here.
> >>
> >> No one can concurent with reload.
> >>
> >>>
> >>> And doesn't it need a lock when reading the state?
> >>
> >> Notice that from IO path, all concurrent context are doing the same
> >> thing, it doesn't matter if old state or new state are read. If old
> >> state is read, it will write new state in memory again; if new state is
> >> read, it just do nothing.
> >
> > Hi Kuai
> >
> > This is the last place that I don't understand well. Is it the reason
> > that it only changes one byte at a time and the system can guarantee
> > the atomic when updating one byte?
> >
> > If so, it only needs to concern the old and new data you mentioned
> > above. For example:
> > raid1 is created without --assume-clean, so all bits are BitUnwritten.
> > And a write bio comes, the bit changes to dirty. Then a discard is
> > submitted in another cpu context and it reads the old status
> > unwritten. From the status change table, the discard doesn't do
> > anything. In fact, discard should update dirty to unwritten. Can such
> > a case happen?
>
> This can happen for raw disk, however, if there are filesystem, discard
> and write can never race. And for raw disk, if user really issue write
> and discard concurrently, the result really is uncertain, and it's fine
> the bit result in dirty or unwritten.

Hi Kuai

If there is a filesystem and the write io returns. The discard must
see the memory changes without any memory barrier apis?

Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>
> >
> >
> > .
> >
>

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Yu Kuai 7 months, 2 weeks ago

Hi,

在 2025/07/01 9:55, Xiao Ni 写道:
> If there is a filesystem and the write io returns. The discard must
> see the memory changes without any memory barrier apis?

It's the filesystem itself should manage free blocks, and gurantee
discard can only be issued to free blocks that is not used at all.

Thanks,
Kuai

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Xiao Ni 7 months, 2 weeks ago

On Tue, Jul 1, 2025 at 10:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/07/01 9:55, Xiao Ni 写道:
> > If there is a filesystem and the write io returns. The discard must
> > see the memory changes without any memory barrier apis?
>
> It's the filesystem itself should manage free blocks, and gurantee
> discard can only be issued to free blocks that is not used at all.

Hi Kuai

Thanks for all the explanations and your patience.

Regards
Xiao
>
> Thanks,
> Kuai
>

Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine

Posted by Yu Kuai 7 months, 2 weeks ago

在 2025/06/30 19:05, Yu Kuai 写道:
> Is it the reason
> that it only changes one byte at a time and the system can guarantee
> the atomic when updating one byte?

I think it's not atomic, I don't use atomic API here, because all mormal
write are always changing the byte to the same value, if old value is
read by concurrent writes, then this byte will be written multiple
times, and I don't see any problems this way, just pure memory
operation.

Thanks,
Kuai