drivers/block/zram/zram_drv.c | 470 ++++++++++++++++++++++++++-------- drivers/block/zram/zram_drv.h | 2 +- 2 files changed, 364 insertions(+), 108 deletions(-)
RFC This is a different approach compared to [1]. Instead of using blk plug API to batch writeback bios, we just keep submitting them and track available of done/idle requests (we still use a pool of requests, to put a constraint on memory usage). The intuition is that blk plug API is good for sequential IO patterns, but zram writeback is more likely to use random IO patterns. I only did minimal testing so far (in a VM). More testing (on real H/W) is needed, any help is highly appreciated. [1] https://lore.kernel.org/linux-kernel/20251118073000.1928107-1-senozhatsky@chromium.org v3 -> v4: - do not use blk plug API Sergey Senozhatsky (6): zram: introduce writeback bio batching zram: add writeback batch size device attr zram: take write lock in wb limit store handlers zram: drop wb_limit_lock zram: rework bdev block allocation zram: read slot block idx under slot lock drivers/block/zram/zram_drv.c | 470 ++++++++++++++++++++++++++-------- drivers/block/zram/zram_drv.h | 2 +- 2 files changed, 364 insertions(+), 108 deletions(-) -- 2.52.0.rc1.455.g30608eb744-goog
On Fri, 21 Nov 2025 00:21:20 +0900, Sergey Senozhatsky wrote: > This is a different approach compared to [1]. Instead of > using blk plug API to batch writeback bios, we just keep > submitting them and track available of done/idle requests > (we still use a pool of requests, to put a constraint on > memory usage). The intuition is that blk plug API is good > for sequential IO patterns, but zram writeback is more > likely to use random IO patterns. > I only did minimal testing so far (in a VM). More testing > (on real H/W) is needed, any help is highly appreciated. I conducted a test on an NVMe host. When all requests were random, this fix was indeed a bit faster than the previous one. before: real 0m0.261s user 0m0.000s sys 0m0.243s real 0m0.260s user 0m0.000s sys 0m0.244s real 0m0.259s user 0m0.000s sys 0m0.243s after: real 0m0.322s user 0m0.000s sys 0m0.214s real 0m0.326s user 0m0.000s sys 0m0.206s real 0m0.325s user 0m0.000s sys 0m0.215s This result is something to be happy about. However, I'm also quite curious about the test results on devices like UFS, which have relatively less internal memory.
On (25/11/21 15:14), Yuwen Chen wrote: > On Fri, 21 Nov 2025 00:21:20 +0900, Sergey Senozhatsky wrote: > > This is a different approach compared to [1]. Instead of > > using blk plug API to batch writeback bios, we just keep > > submitting them and track available of done/idle requests > > (we still use a pool of requests, to put a constraint on > > memory usage). The intuition is that blk plug API is good > > for sequential IO patterns, but zram writeback is more > > likely to use random IO patterns. > > > I only did minimal testing so far (in a VM). More testing > > (on real H/W) is needed, any help is highly appreciated. > > I conducted a test on an NVMe host. When all requests were random, > this fix was indeed a bit faster than the previous one. Is "before" blk-plug based approach and "after" this new approach? > before: > real 0m0.261s > user 0m0.000s > sys 0m0.243s > > real 0m0.260s > user 0m0.000s > sys 0m0.244s > > real 0m0.259s > user 0m0.000s > sys 0m0.243s > > after: > real 0m0.322s > user 0m0.000s > sys 0m0.214s > > real 0m0.326s > user 0m0.000s > sys 0m0.206s > > real 0m0.325s > user 0m0.000s > sys 0m0.215s Hmm that's less than was anticipated.
On Fri, 21 Nov 2025 16:32:27 +0900, Sergey Senozhatsky wrote: > Is "before" blk-plug based approach and "after" this new approach? Sorry, I got the before and after mixed up. In addition, I also have some related questions to consult: 1. Will page fault exceptions be delayed during the writeback processing? 2. Since the loop device uses a work queue to handle requests, when the system load is relatively high, will it have a relatively large impact on the latency of page fault exceptions? Is there any way to solve this problem?
On (25/11/21 15:44), Yuwen Chen wrote: > On Fri, 21 Nov 2025 16:32:27 +0900, Sergey Senozhatsky wrote: > > Is "before" blk-plug based approach and "after" this new approach? > > Sorry, I got the before and after mixed up. No problem. I wonder if the effect is more visible on larger data sets. 0.3 second sounds like a very short write. In my VM tests I couldn't get more than 2 inflight requests at a time, I guess because decompression was much slower than IO. I wonder how many inflight requests you had in your tests. > In addition, I also have some related questions to consult: > > 1. Will page fault exceptions be delayed during the writeback processing? I don't think our reads are blocked by writes. > 2. Since the loop device uses a work queue to handle requests, when > the system load is relatively high, will it have a relatively large > impact on the latency of page fault exceptions? Is there any way to solve > this problem? I think page-fault latency of a written-back page is expected to be higher, that's a trade-off that we agree on. Off the top of my head, I don't think we can do anything about it. Is loop device always used as for writeback targets?
On Fri, 21 Nov 2025 16:58:41 +0900, Sergey Senozhatsky wrote:
> No problem. I wonder if the effect is more visible on larger data sets.
> 0.3 second sounds like a very short write. In my VM tests I couldn't get
> more than 2 inflight requests at a time, I guess because decompression
> was much slower than IO. I wonder how many inflight requests you had in
> your tests.
I used the following code for testing here, and the result was 32.
code:
@@ -983,6 +983,7 @@ static int zram_writeback_slots(struct zram *zram,
struct zram_pp_slot *pps;
int ret = 0, err = 0;
u32 index = 0;
+ int inflight = 0;
while ((pps = select_pp_slot(ctl))) {
spin_lock(&zram->wb_limit_lock);
@@ -993,6 +994,9 @@ static int zram_writeback_slots(struct zram *zram,
}
spin_unlock(&zram->wb_limit_lock);
+ if (inflight < atomic_read(&wb_ctl->num_inflight))
+ inflight = atomic_read(&wb_ctl->num_inflight);
+
while (!req) {
req = zram_select_idle_req(wb_ctl);
if (req)
@@ -1074,6 +1078,7 @@ next:
ret = err;
}
+ pr_err("%s: inflight max: %d\n", __func__, inflight);
return ret;
}
log:
[3741949.842927] zram: zram_writeback_slots: inflight max: 32
Changing ZRAM_WB_REQ_CNT to 64 didn't shorten the overall time.
> I think page-fault latency of a written-back page is expected to be
> higher, that's a trade-off that we agree on. Off the top of my head,
> I don't think we can do anything about it.
>
> Is loop device always used as for writeback targets?
On the Android platform, currently only the loop device is supported as
the backend for writeback, possibly for security reasons. I noticed that
EROFS has implemented a CONFIG_EROFS_FS_BACKED_BY_FILE to reduce this
latency. I think ZRAM might also be able to do this.
On (25/11/21 16:23), Yuwen Chen wrote:
> I used the following code for testing here, and the result was 32.
>
> code:
> @@ -983,6 +983,7 @@ static int zram_writeback_slots(struct zram *zram,
> struct zram_pp_slot *pps;
> int ret = 0, err = 0;
> u32 index = 0;
> + int inflight = 0;
>
> while ((pps = select_pp_slot(ctl))) {
> spin_lock(&zram->wb_limit_lock);
> @@ -993,6 +994,9 @@ static int zram_writeback_slots(struct zram *zram,
> }
> spin_unlock(&zram->wb_limit_lock);
>
> + if (inflight < atomic_read(&wb_ctl->num_inflight))
> + inflight = atomic_read(&wb_ctl->num_inflight);
> +
> while (!req) {
> req = zram_select_idle_req(wb_ctl);
> if (req)
> @@ -1074,6 +1078,7 @@ next:
> ret = err;
> }
>
> + pr_err("%s: inflight max: %d\n", __func__, inflight);
> return ret;
> }
I think this will always give you 32 (or you current batch size limit),
just because the way it works - we first deplete all ->idle (reaching
max ->inflight) and only then complete finished requests (dropping
->inflight).
I had a version of the patch that had different main loop. It would
always first complete finished requests. I think this one will give
accurate ->inflight number.
---
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index ab0785878069..398609e9d061 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -999,13 +999,6 @@ static int zram_writeback_slots(struct zram *zram,
}
while (!req) {
- req = zram_select_idle_req(wb_ctl);
- if (req)
- break;
-
- wait_event(wb_ctl->done_wait,
- !list_empty(&wb_ctl->done_reqs));
-
err = zram_complete_done_reqs(zram, wb_ctl);
/*
* BIO errors are not fatal, we continue and simply
@@ -1017,6 +1010,13 @@ static int zram_writeback_slots(struct zram *zram,
*/
if (err)
ret = err;
+
+ req = zram_select_idle_req(wb_ctl);
+ if (req)
+ break;
+
+ wait_event(wb_ctl->done_wait,
+ !list_empty(&wb_ctl->done_reqs));
}
if (blk_idx == INVALID_BDEV_BLOCK) {
---
> > I think page-fault latency of a written-back page is expected to be
> > higher, that's a trade-off that we agree on. Off the top of my head,
> > I don't think we can do anything about it.
> >
> > Is loop device always used as for writeback targets?
>
> On the Android platform, currently only the loop device is supported as
> the backend for writeback, possibly for security reasons. I noticed that
> EROFS has implemented a CONFIG_EROFS_FS_BACKED_BY_FILE to reduce this
> latency. I think ZRAM might also be able to do this.
I see. Do you use S/W or H/W compression?
On 2025/11/21 17:12, Sergey Senozhatsky wrote: > On (25/11/21 16:23), Yuwen Chen wrote: .. >>> I think page-fault latency of a written-back page is expected to be >>> higher, that's a trade-off that we agree on. Off the top of my head, >>> I don't think we can do anything about it. >>> >>> Is loop device always used as for writeback targets? >> >> On the Android platform, currently only the loop device is supported as >> the backend for writeback, possibly for security reasons. I noticed that >> EROFS has implemented a CONFIG_EROFS_FS_BACKED_BY_FILE to reduce this >> latency. I think ZRAM might also be able to do this. > > I see. Do you use S/W or H/W compression? No, I'm pretty sure it's impossible for zram to access file I/Os without another thread context (e.g. workqueue), especially for write I/Os, which is unlike erofs: EROFS can do because EROFS is a specific filesystem, you could see it's a seperate fs, and it can only read (no write context) backing files in erofs and/or other fses, which is much like vfs/overlayfs read_iter() directly going into the backing fses without nested contexts. (Even if loop is used, it will create its own thread contexts with workqueues, which is safe.) In the other hand, zram/loop can act as a virtual block device which is rather different, which means you could format an ext4 filesystem and backing another ext4/btrfs, like this: zram(ext4) -> backing ext4/btrfs It's unsafe (in addition to GFP_NOIO allocation restriction) since zram cannot manage those ext4/btrfs existing contexts: - Take one detailed example, if the upper zram ext4 assigns current->journal_info = xxx, and submit_bio() to zram, which will confuse the backing ext4 since it should assume current->journal_info == NULL, so the virtual block devices need another thread context to isolate those two different uncontrolled contexts. So I don't think it's feasible for block drivers to act like this, especially mixing with writing to backing fses operations. Thanks, Gao Xiang
On (25/11/21 20:21), Gao Xiang wrote: > > > > I think page-fault latency of a written-back page is expected to be > > > > higher, that's a trade-off that we agree on. Off the top of my head, > > > > I don't think we can do anything about it. > > > > > > > > Is loop device always used as for writeback targets? > > > > > > On the Android platform, currently only the loop device is supported as > > > the backend for writeback, possibly for security reasons. I noticed that > > > EROFS has implemented a CONFIG_EROFS_FS_BACKED_BY_FILE to reduce this > > > latency. I think ZRAM might also be able to do this. > > > > I see. Do you use S/W or H/W compression? > > No, I'm pretty sure it's impossible for zram to access > file I/Os without another thread context (e.g. workqueue), > especially for write I/Os, which is unlike erofs: > > EROFS can do because EROFS is a specific filesystem, you > could see it's a seperate fs, and it can only read (no > write context) backing files in erofs and/or other fses, > which is much like vfs/overlayfs read_iter() directly > going into the backing fses without nested contexts. > (Even if loop is used, it will create its own thread > contexts with workqueues, which is safe.) > > In the other hand, zram/loop can act as a virtual block > device which is rather different, which means you could > format an ext4 filesystem and backing another ext4/btrfs, > like this: > > zram(ext4) -> backing ext4/btrfs > > It's unsafe (in addition to GFP_NOIO allocation > restriction) since zram cannot manage those ext4/btrfs > existing contexts: > > - Take one detailed example, if the upper zram ext4 > assigns current->journal_info = xxx, and submit_bio() to > zram, which will confuse the backing ext4 since it should > assume current->journal_info == NULL, so the virtual block > devices need another thread context to isolate those two > different uncontrolled contexts. > > So I don't think it's feasible for block drivers to act > like this, especially mixing with writing to backing fses > operations. Sorry, I don't completely understand your point, but backing device is never expected to have any fs on it. So from your email: > zram(ext4) -> backing ext4/btrfs This is not a valid configuration, as far as I'm concerned. Unless I'm missing your point.
On 2025/11/22 18:07, Sergey Senozhatsky wrote: > On (25/11/21 20:21), Gao Xiang wrote: >>>>> I think page-fault latency of a written-back page is expected to be >>>>> higher, that's a trade-off that we agree on. Off the top of my head, >>>>> I don't think we can do anything about it. >>>>> >>>>> Is loop device always used as for writeback targets? >>>> >>>> On the Android platform, currently only the loop device is supported as >>>> the backend for writeback, possibly for security reasons. I noticed that >>>> EROFS has implemented a CONFIG_EROFS_FS_BACKED_BY_FILE to reduce this >>>> latency. I think ZRAM might also be able to do this. >>> >>> I see. Do you use S/W or H/W compression? >> >> No, I'm pretty sure it's impossible for zram to access >> file I/Os without another thread context (e.g. workqueue), >> especially for write I/Os, which is unlike erofs: >> >> EROFS can do because EROFS is a specific filesystem, you >> could see it's a seperate fs, and it can only read (no >> write context) backing files in erofs and/or other fses, >> which is much like vfs/overlayfs read_iter() directly >> going into the backing fses without nested contexts. >> (Even if loop is used, it will create its own thread >> contexts with workqueues, which is safe.) >> >> In the other hand, zram/loop can act as a virtual block >> device which is rather different, which means you could >> format an ext4 filesystem and backing another ext4/btrfs, >> like this: >> >> zram(ext4) -> backing ext4/btrfs >> >> It's unsafe (in addition to GFP_NOIO allocation >> restriction) since zram cannot manage those ext4/btrfs >> existing contexts: >> >> - Take one detailed example, if the upper zram ext4 >> assigns current->journal_info = xxx, and submit_bio() to >> zram, which will confuse the backing ext4 since it should >> assume current->journal_info == NULL, so the virtual block >> devices need another thread context to isolate those two >> different uncontrolled contexts. >> >> So I don't think it's feasible for block drivers to act >> like this, especially mixing with writing to backing fses >> operations. > > Sorry, I don't completely understand your point, but backing > device is never expected to have any fs on it. So from your > email: zram(ext4) means zram device itself is formated as ext4. > >> zram(ext4) -> backing ext4/btrfs > > This is not a valid configuration, as far as I'm concerned. > Unless I'm missing your point. Why it's not valid? zram can be used as a regular virtual block device, and format with any fs, and mount the zram then. Thanks, Gao Xiang
On (25/11/22 20:24), Gao Xiang wrote: > > > > > zram(ext4) -> backing ext4/btrfs > > > > This is not a valid configuration, as far as I'm concerned. > > Unless I'm missing your point. > > Why it's not valid? zram can be used as a regular virtual > block device, and format with any fs, and mount the zram > then. If you want to move data between two filesystems, then just mount both devices and cp/mv data between them. zram is not going to do that for you, zram writeback is for different purpose.
On 2025/11/23 08:22, Sergey Senozhatsky wrote: > On (25/11/22 20:24), Gao Xiang wrote: >>> >>>> zram(ext4) -> backing ext4/btrfs >>> >>> This is not a valid configuration, as far as I'm concerned. >>> Unless I'm missing your point. >> >> Why it's not valid? zram can be used as a regular virtual >> block device, and format with any fs, and mount the zram >> then. > > If you want to move data between two filesystems, then just > mount both devices and cp/mv data between them. zram is not > going to do that for you, zram writeback is for different > purpose. No, I know what zram writeback is and I was definitely not saying using zram writeback device to mount something (if you have interest, just check out my first reply, it's already clear. Also you can know why loop devices need a workqueue or a kthread since pre-v2.6 in the first place just because of the same reason). I want to stop here because it's none of my business. Thanks, Gao Xiang
On (25/11/22 20:24), Gao Xiang wrote: > zram(ext4) means zram device itself is formated as ext4. > > > > > > zram(ext4) -> backing ext4/btrfs > > > > This is not a valid configuration, as far as I'm concerned. > > Unless I'm missing your point. > > Why it's not valid? zram can be used as a regular virtual > block device, and format with any fs, and mount the zram > then. I thought you were talking about the backing device being ext4/btrfs. Sorry, I don't have enough context/knowledge to understand what you're getting at. zram has been doing writeback for ages, I really don't know what you mean by "to act like this".
On 2025/11/22 21:43, Sergey Senozhatsky wrote: > On (25/11/22 20:24), Gao Xiang wrote: >> zram(ext4) means zram device itself is formated as ext4. >> >>> >>>> zram(ext4) -> backing ext4/btrfs >>> >>> This is not a valid configuration, as far as I'm concerned. >>> Unless I'm missing your point. >> >> Why it's not valid? zram can be used as a regular virtual >> block device, and format with any fs, and mount the zram >> then. > > I thought you were talking about the backing device being > ext4/btrfs. Sorry, I don't have enough context/knowledge > to understand what you're getting at. zram has been doing > writeback for ages, I really don't know what you mean by > "to act like this". I mean, if zram is formatted as ext4, and then mount it; and then there is a backing file which is also in another ext4, you'd need a workqueue to do writeback I/Os (or needs a loop device to transit), was that the original question raised by Yuwen? If it's backed by a physical device rather than a file in a filesystem, such potential problem doesn't exist. Thanks, Gao Xiang
On (25/11/22 22:09), Gao Xiang wrote: > > I thought you were talking about the backing device being > > ext4/btrfs. Sorry, I don't have enough context/knowledge > > to understand what you're getting at. zram has been doing > > writeback for ages, I really don't know what you mean by > > "to act like this". > > I mean, if zram is formatted as ext4, and then mount it; > and then there is a backing file which is also in another > ext4, you'd need a workqueue to do writeback I/Os (or needs > a loop device to transit), was that the original question > raised by Yuwen? We take pages of data from zram0 and write them straight to the backing device. Those writes don't go through vfs/fs so fs on the backing device will simply be corrupted, as far as I can tell. This is not intendant use case for zram writeback.
On 2025/11/23 08:08, Sergey Senozhatsky wrote: > On (25/11/22 22:09), Gao Xiang wrote: >>> I thought you were talking about the backing device being >>> ext4/btrfs. Sorry, I don't have enough context/knowledge >>> to understand what you're getting at. zram has been doing >>> writeback for ages, I really don't know what you mean by >>> "to act like this". >> >> I mean, if zram is formatted as ext4, and then mount it; >> and then there is a backing file which is also in another >> ext4, you'd need a workqueue to do writeback I/Os (or needs >> a loop device to transit), was that the original question >> raised by Yuwen? > > We take pages of data from zram0 and write them straight to > the backing device. Those writes don't go through vfs/fs so > fs on the backing device will simply be corrupted, as far as > I can tell. This is not intendant use case for zram writeback. I'm pretty sure you don't understand what I meant. I won't reply this anymore, good luck. Thanks, Gao Xiang
On 2025/11/21 20:21, Gao Xiang wrote:
>
>
> On 2025/11/21 17:12, Sergey Senozhatsky wrote:
>> On (25/11/21 16:23), Yuwen Chen wrote:
>
> ..
>
>
>>>> I think page-fault latency of a written-back page is expected to be
>>>> higher, that's a trade-off that we agree on. Off the top of my head,
>>>> I don't think we can do anything about it.
>>>>
>>>> Is loop device always used as for writeback targets?
>>>
>>> On the Android platform, currently only the loop device is supported as
>>> the backend for writeback, possibly for security reasons. I noticed that
>>> EROFS has implemented a CONFIG_EROFS_FS_BACKED_BY_FILE to reduce this
>>> latency. I think ZRAM might also be able to do this.
>>
>> I see. Do you use S/W or H/W compression?
>
> No, I'm pretty sure it's impossible for zram to access
> file I/Os without another thread context (e.g. workqueue),
> especially for write I/Os, which is unlike erofs:
>
> EROFS can do because EROFS is a specific filesystem, you
> could see it's a seperate fs, and it can only read (no
> write context) backing files in erofs and/or other fses,
> which is much like vfs/overlayfs read_iter() directly
> going into the backing fses without nested contexts.
> (Even if loop is used, it will create its own thread
> contexts with workqueues, which is safe.)
>
> In the other hand, zram/loop can act as a virtual block
> device which is rather different, which means you could
> format an ext4 filesystem and backing another ext4/btrfs,
> like this:
>
> zram(ext4) -> backing ext4/btrfs
>
> It's unsafe (in addition to GFP_NOIO allocation
> restriction) since zram cannot manage those ext4/btrfs
> existing contexts:
>
> - Take one detailed example, if the upper zram ext4
> assigns current->journal_info = xxx, and submit_bio() to
> zram, which will confuse the backing ext4 since it should
> assume current->journal_info == NULL, so the virtual block
> devices need another thread context to isolate those two
> different uncontrolled contexts.
>
> So I don't think it's feasible for block drivers to act
> like this, especially mixing with writing to backing fses
> operations.
In other words, a fs claims it can do file-backed-mounts
without a new context only if:
- Its own implementation can be safely applied to any
other kernel filesystem (e.g., it shouldn't change
current->journal_info or do context save/restore before
handing over, for example); and its own implementation
can safely mount itself with file-backed mounts.
So it's filesystem-specific internals to make sure it can
work like this (for example ext4 on erofs, ext4 still uses
loop to mount). The virtual block device layer knows
nothing about what the upper filesystem did before the
execution passes through, so it's unsafe to work like this.
Thanks,
Gao Xiang
© 2016 - 2025 Red Hat, Inc.