From nobody Mon Dec 1 22:05:43 2025 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 218AA3016FD for ; Mon, 1 Dec 2025 09:48:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764582492; cv=none; b=RiwsFLXvZBhLHE9QjzCJaBzDswb2RCIbTi5V4q7VWfjhgTAVLb2toQ2vd7WlgVZCES5BybSct6qOxnJ5tiX0TQLFffnDxmxt6KWtw3QaZ0/Ji053YAyNmMG6Di2uMWoLG9crWa38TwM0/Njztnfeq2pV/v2l3AUlMPNxjI5rx/I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764582492; c=relaxed/simple; bh=gJjRJgokMBBuKpP/pYHkY9trICF6nRm1R9+3OXfaNsE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=DKGh2eb7PnsLlMjuFsIf0FAwnhdJ6fxQTtXmgbniLvJ6+uN3qmkVU5e+cI06Q+fUnAwk7JVIQurka6Q17jHSU9lA5JcuLT0NM0Ak/TLJhEkGxRfnDTPgHksMT4MSkBTLsPKpGaKLmDGyu4mhyuatM2hkXeudvQnxuOt8CT02UN4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=chromium.org; spf=pass smtp.mailfrom=chromium.org; dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org header.b=a4p+8Npn; arc=none smtp.client-ip=209.85.210.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=chromium.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=chromium.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org header.b="a4p+8Npn" Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-7bb3092e4d7so4026818b3a.0 for ; Mon, 01 Dec 2025 01:48:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1764582489; x=1765187289; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=fnhNR+agQ932W+14bFvb5ZmmvFAQtUwOKUgk6jRlfpU=; b=a4p+8Npn4AxOImUyEOyhtvgWlh4NgfOMdNFsD2zkN/Uu9HcXBTZRfV/+le68KRfuCG 4pEoyxeU2hCR6mQZlIdmTbdgwAsmfzYGqCeGQYFPtU9PqyVIRKceNguGdzjlHS3prE2T 3ldI4LCrbXSKZ6xRoZ8L8ipnuADw2piAH1JIU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764582489; x=1765187289; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=fnhNR+agQ932W+14bFvb5ZmmvFAQtUwOKUgk6jRlfpU=; b=hbLQs5e8N17Yn33M8O9Ngi0ZP6W0bnmFgwjPeTr2yZ//84Mnt0tyFbFRT4JsVClyjY GW9c8aUb2SFbxM4Y5Sr6JZ1dDs5IfOT4xckzUv3I1BGMksrAqwGXxrwtJUKnewwIFFJI qX2z+sJ3/ae67dDQ45UlenmBGT+kOxhoLlecSVv27loL3rfq77r478+4EORjS4Guik4P NZwOLBl0HQ1dqCKt8D9SvgN0Y5KZixEkdIcygqlsFQjns8yHCzKXsKBCAF1YjGy4jpcQ AsSeaBcrfUo8+1KVHxhJEdndi5juqkOBpkT7A0jxOJIDCjNBCsOblMB6Ds0DqlwwVr/7 QlVw== X-Forwarded-Encrypted: i=1; AJvYcCUL3AFhF+OJBAwm/FzykuKjVBvkzKN/PlHTymsRqW7JWoF45y+7N/FCKNiY9lzS7eQifPkggFliGk4FlB0=@vger.kernel.org X-Gm-Message-State: AOJu0Ywh/Rwm5y8Wkdfvz7bI9eWZUwAGB0ngJNpvEmkeZofVQa9De3sn 5P/Bg/3lra3++cjQSkg0UiufS5OyAydBLte1wjt9pIbiUQ/GyCg99dWWLxxqILaV1w== X-Gm-Gg: ASbGnctGIUvH7pdXao4G75vi3xqUry7lHpjG9hDTSQwvGPbxesWM3Z+ONMAKCcwhKhl rtmEK/VQ0lWT1B0KeOzcAP2F1GCZCSwaVxxjmbg1VP0cgn3jsvA+iYwbeMxRlULGKzBz7XYo5j3 pVTq0K+HEhkSKk9n8lFzY4kbFUU5EyU3uImYRrXsynB3Vb/DYfjSyen8Gx7nBv7I2Q0v0EmFL3A Gc/ZGpacb0EVuy5vKv6D5u4yrn2pJDPy6EBj1WpxDTdMIb2BXZfzQMBGQs+Vxh/wrHeMZphSRk1 xUJ7cOO+i7uQj2STM9hgEHmDpKKKGHv1TfeyUEM2Y3QOd4qhTc3WYhmucMfYoMNk/7UeBz0xI5q czWQSvEC4nUaYS/FVJsoY1Kza1djjmDXow3jOcIWc3EmdfEXHgOtibvl0i5UZRgW5HVATImBW12 3VFzWcohbjk0BP4efcijw5UztDQpeetiE8l0Fx9sFqy3xlqpqI4QS4+ATcUGvD3k3Y4LeQcU5rx A== X-Google-Smtp-Source: AGHT+IHKSmALbuKSgE06X0oPsRvPSiRabvdVpp7WYRIWzuOcPk/FALFSBZoiw65yQsKAfcJNqwyM2A== X-Received: by 2002:aa7:989d:0:b0:7aa:ac12:2c33 with SMTP id d2e1a72fcca58-7c58c2a7354mr29267009b3a.1.1764582489249; Mon, 01 Dec 2025 01:48:09 -0800 (PST) Received: from tigerii.tok.corp.google.com ([2a00:79e0:2031:6:943c:f651:f00f:2459]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7d15e7db577sm12882074b3a.31.2025.12.01.01.48.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Dec 2025 01:48:08 -0800 (PST) From: Sergey Senozhatsky To: Andrew Morton , Richard Chang , Minchan Kim Cc: Brian Geffon , David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, Sergey Senozhatsky , Minchan Kim Subject: [PATCHv2 1/7] zram: introduce compressed data writeback Date: Mon, 1 Dec 2025 18:47:48 +0900 Message-ID: <20251201094754.4149975-2-senozhatsky@chromium.org> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog In-Reply-To: <20251201094754.4149975-1-senozhatsky@chromium.org> References: <20251201094754.4149975-1-senozhatsky@chromium.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Richard Chang zram stores all written back slots raw, which implies that during writeback zram first has to decompress slots (except for ZRAM_HUGE slots, which are raw already). The problem with this approach is that not every written back page gets read back (either via read() or via page-fault), which means that zram basically wastes CPU cycles and battery decompressing such slots. This changes with introduction of decompression on demand, in other words decompression on read()/page-fault. One caveat of decompression on demand is that async read is completed in IRQ context, while zram decompression is sleepable. To workaround this, read-back decompression is offloaded to a preemptible context - system high-prio work-queue. At this point compressed writeback is still disabled, a follow up patch will introduce a new device attribute which will make it possible to toggle compressed writeback per-device. [senozhatsky: rewrote original implementation] Signed-off-by: Richard Chang Co-developed-by: Sergey Senozhatsky Suggested-by: Minchan Kim Suggested-by: Brian Geffon --- drivers/block/zram/zram_drv.c | 279 +++++++++++++++++++++++++++------- drivers/block/zram/zram_drv.h | 1 + 2 files changed, 227 insertions(+), 53 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 5759823d6314..6263d300312e 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -57,9 +57,6 @@ static size_t huge_class_size; static const struct block_device_operations zram_devops; =20 static void zram_free_page(struct zram *zram, size_t index); -static int zram_read_from_zspool(struct zram *zram, struct page *page, - u32 index); - #define slot_dep_map(zram, index) (&(zram)->table[(index)].dep_map) =20 static void zram_slot_lock_init(struct zram *zram, u32 index) @@ -502,6 +499,10 @@ static ssize_t idle_store(struct device *dev, #ifdef CONFIG_ZRAM_WRITEBACK #define INVALID_BDEV_BLOCK (~0UL) =20 +static int read_from_zspool_raw(struct zram *zram, struct page *page, + u32 index); +static int read_from_zspool(struct zram *zram, struct page *page, u32 inde= x); + struct zram_wb_ctl { /* idle list is accessed only by the writeback task, no concurency */ struct list_head idle_reqs; @@ -522,6 +523,22 @@ struct zram_wb_req { struct list_head entry; }; =20 +struct zram_rb_req { + struct work_struct work; + struct zram *zram; + struct page *page; + /* The read bio for backing device */ + struct bio *bio; + unsigned long blk_idx; + union { + /* The original bio to complete (async read) */ + struct bio *parent; + /* error status (sync read) */ + int error; + }; + u32 index; +}; + static ssize_t writeback_limit_enable_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) @@ -780,18 +797,6 @@ static void zram_release_bdev_block(struct zram *zram,= unsigned long blk_idx) atomic64_dec(&zram->stats.bd_count); } =20 -static void read_from_bdev_async(struct zram *zram, struct page *page, - unsigned long entry, struct bio *parent) -{ - struct bio *bio; - - bio =3D bio_alloc(zram->bdev, 1, parent->bi_opf, GFP_NOIO); - bio->bi_iter.bi_sector =3D entry * (PAGE_SIZE >> 9); - __bio_add_page(bio, page, PAGE_SIZE, 0); - bio_chain(bio, parent); - submit_bio(bio); -} - static void release_wb_req(struct zram_wb_req *req) { __free_page(req->page); @@ -886,8 +891,9 @@ static void zram_account_writeback_submit(struct zram *= zram) =20 static int zram_writeback_complete(struct zram *zram, struct zram_wb_req *= req) { - u32 index =3D req->pps->index; - int err; + u32 size, index =3D req->pps->index; + int err, prio; + bool huge; =20 err =3D blk_status_to_errno(req->bio.bi_status); if (err) { @@ -914,9 +920,27 @@ static int zram_writeback_complete(struct zram *zram, = struct zram_wb_req *req) goto out; } =20 + if (zram->wb_compressed) { + /* + * ZRAM_WB slots get freed, we need to preserve data required + * for read decompression. + */ + size =3D zram_get_obj_size(zram, index); + prio =3D zram_get_priority(zram, index); + huge =3D zram_test_flag(zram, index, ZRAM_HUGE); + } + zram_free_page(zram, index); zram_set_flag(zram, index, ZRAM_WB); zram_set_handle(zram, index, req->blk_idx); + + if (zram->wb_compressed) { + if (huge) + zram_set_flag(zram, index, ZRAM_HUGE); + zram_set_obj_size(zram, index, size); + zram_set_priority(zram, index, prio); + } + atomic64_inc(&zram->stats.pages_stored); =20 out: @@ -1050,7 +1074,11 @@ static int zram_writeback_slots(struct zram *zram, */ if (!zram_test_flag(zram, index, ZRAM_PP_SLOT)) goto next; - if (zram_read_from_zspool(zram, req->page, index)) + if (zram->wb_compressed) + err =3D read_from_zspool_raw(zram, req->page, index); + else + err =3D read_from_zspool(zram, req->page, index); + if (err) goto next; zram_slot_unlock(zram, index); =20 @@ -1313,24 +1341,140 @@ static ssize_t writeback_store(struct device *dev, return ret; } =20 -struct zram_work { - struct work_struct work; - struct zram *zram; - unsigned long entry; - struct page *page; - int error; -}; +static int decompress_bdev_page(struct zram *zram, struct page *page, u32 = index) +{ + struct zcomp_strm *zstrm; + unsigned int size; + int ret, prio; + void *src; + + zram_slot_lock(zram, index); + /* Since slot was unlocked we need to make sure it's still ZRAM_WB */ + if (!zram_test_flag(zram, index, ZRAM_WB)) { + zram_slot_unlock(zram, index); + /* We read some stale data, zero it out */ + memset_page(page, 0, 0, PAGE_SIZE); + return -EIO; + } + + if (zram_test_flag(zram, index, ZRAM_HUGE)) { + zram_slot_unlock(zram, index); + return 0; + } + + size =3D zram_get_obj_size(zram, index); + prio =3D zram_get_priority(zram, index); =20 -static void zram_sync_read(struct work_struct *work) + zstrm =3D zcomp_stream_get(zram->comps[prio]); + src =3D kmap_local_page(page); + ret =3D zcomp_decompress(zram->comps[prio], zstrm, src, size, + zstrm->local_copy); + if (!ret) + copy_page(src, zstrm->local_copy); + kunmap_local(src); + zcomp_stream_put(zstrm); + zram_slot_unlock(zram, index); + + return ret; +} + +static void zram_deferred_decompress(struct work_struct *w) { - struct zram_work *zw =3D container_of(work, struct zram_work, work); + struct zram_rb_req *req =3D container_of(w, struct zram_rb_req, work); + struct page *page =3D bio_first_page_all(req->bio); + struct zram *zram =3D req->zram; + u32 index =3D req->index; + int ret; + + ret =3D decompress_bdev_page(zram, page, index); + if (ret) + req->parent->bi_status =3D BLK_STS_IOERR; + + /* Decrement parent's ->remaining */ + bio_endio(req->parent); + bio_put(req->bio); + kfree(req); +} + +static void zram_async_read_endio(struct bio *bio) +{ + struct zram_rb_req *req =3D bio->bi_private; + struct zram *zram =3D req->zram; + + if (bio->bi_status) { + req->parent->bi_status =3D bio->bi_status; + bio_endio(req->parent); + bio_put(bio); + kfree(req); + return; + } + + /* + * NOTE: zram_async_read_endio() is not exactly right place for this. + * Ideally, we need to do it after ZRAM_WB check, but this requires + * us to use wq path even on systems that don't enable compressed + * writeback, because we cannot take slot-lock in the current context. + * + * Keep the existing behavior for now. + */ + if (zram->wb_compressed =3D=3D false) { + /* No decompression needed, complete the parent IO */ + bio_endio(req->parent); + bio_put(bio); + kfree(req); + return; + } + + /* + * zram decompression is sleepable, so we need to deffer it to + * a preemptible context. + */ + INIT_WORK(&req->work, zram_deferred_decompress); + queue_work(system_highpri_wq, &req->work); +} + +static void read_from_bdev_async(struct zram *zram, struct page *page, + u32 index, unsigned long blk_idx, + struct bio *parent) +{ + struct zram_rb_req *req; + struct bio *bio; + + req =3D kmalloc(sizeof(*req), GFP_NOIO); + if (!req) + return; + + bio =3D bio_alloc(zram->bdev, 1, parent->bi_opf, GFP_NOIO); + if (!bio) { + kfree(req); + return; + } + + req->zram =3D zram; + req->index =3D index; + req->blk_idx =3D blk_idx; + req->bio =3D bio; + req->parent =3D parent; + + bio->bi_iter.bi_sector =3D blk_idx * (PAGE_SIZE >> 9); + bio->bi_private =3D req; + bio->bi_end_io =3D zram_async_read_endio; + + __bio_add_page(bio, page, PAGE_SIZE, 0); + bio_inc_remaining(parent); + submit_bio(bio); +} + +static void zram_sync_read(struct work_struct *w) +{ + struct zram_rb_req *req =3D container_of(w, struct zram_rb_req, work); struct bio_vec bv; struct bio bio; =20 - bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ); - bio.bi_iter.bi_sector =3D zw->entry * (PAGE_SIZE >> 9); - __bio_add_page(&bio, zw->page, PAGE_SIZE, 0); - zw->error =3D submit_bio_wait(&bio); + bio_init(&bio, req->zram->bdev, &bv, 1, REQ_OP_READ); + bio.bi_iter.bi_sector =3D req->blk_idx * (PAGE_SIZE >> 9); + __bio_add_page(&bio, req->page, PAGE_SIZE, 0); + req->error =3D submit_bio_wait(&bio); } =20 /* @@ -1338,39 +1482,42 @@ static void zram_sync_read(struct work_struct *work) * chained IO with parent IO in same context, it's a deadlock. To avoid th= at, * use a worker thread context. */ -static int read_from_bdev_sync(struct zram *zram, struct page *page, - unsigned long entry) +static int read_from_bdev_sync(struct zram *zram, struct page *page, u32 i= ndex, + unsigned long blk_idx) { - struct zram_work work; + struct zram_rb_req req; =20 - work.page =3D page; - work.zram =3D zram; - work.entry =3D entry; + req.page =3D page; + req.zram =3D zram; + req.blk_idx =3D blk_idx; =20 - INIT_WORK_ONSTACK(&work.work, zram_sync_read); - queue_work(system_dfl_wq, &work.work); - flush_work(&work.work); - destroy_work_on_stack(&work.work); + INIT_WORK_ONSTACK(&req.work, zram_sync_read); + queue_work(system_dfl_wq, &req.work); + flush_work(&req.work); + destroy_work_on_stack(&req.work); =20 - return work.error; + if (req.error || zram->wb_compressed =3D=3D false) + return req.error; + + return decompress_bdev_page(zram, page, index); } =20 -static int read_from_bdev(struct zram *zram, struct page *page, - unsigned long entry, struct bio *parent) +static int read_from_bdev(struct zram *zram, struct page *page, u32 index, + unsigned long blk_idx, struct bio *parent) { atomic64_inc(&zram->stats.bd_reads); if (!parent) { if (WARN_ON_ONCE(!IS_ENABLED(ZRAM_PARTIAL_IO))) return -EIO; - return read_from_bdev_sync(zram, page, entry); + return read_from_bdev_sync(zram, page, index, blk_idx); } - read_from_bdev_async(zram, page, entry, parent); + read_from_bdev_async(zram, page, index, blk_idx, parent); return 0; } #else static inline void reset_bdev(struct zram *zram) {}; -static int read_from_bdev(struct zram *zram, struct page *page, - unsigned long entry, struct bio *parent) +static int read_from_bdev(struct zram *zram, struct page *page, u32 index, + unsigned long blk_idx, struct bio *parent) { return -EIO; } @@ -1977,12 +2124,37 @@ static int read_compressed_page(struct zram *zram, = struct page *page, u32 index) return ret; } =20 +#if defined CONFIG_ZRAM_WRITEBACK +static int read_from_zspool_raw(struct zram *zram, struct page *page, u32 = index) +{ + struct zcomp_strm *zstrm; + unsigned long handle; + unsigned int size; + void *src; + + handle =3D zram_get_handle(zram, index); + size =3D zram_get_obj_size(zram, index); + + /* + * We need to get stream just for ->local_copy buffer, in + * case if object spans two physical pages. No decompression + * takes place here, as we read raw compressed data. + */ + zstrm =3D zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]); + src =3D zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy); + memcpy_to_page(page, 0, src, size); + zs_obj_read_end(zram->mem_pool, handle, src); + zcomp_stream_put(zstrm); + + return 0; +} +#endif + /* * Reads (decompresses if needed) a page from zspool (zsmalloc). * Corresponding ZRAM slot should be locked. */ -static int zram_read_from_zspool(struct zram *zram, struct page *page, - u32 index) +static int read_from_zspool(struct zram *zram, struct page *page, u32 inde= x) { if (zram_test_flag(zram, index, ZRAM_SAME) || !zram_get_handle(zram, index)) @@ -2002,7 +2174,7 @@ static int zram_read_page(struct zram *zram, struct p= age *page, u32 index, zram_slot_lock(zram, index); if (!zram_test_flag(zram, index, ZRAM_WB)) { /* Slot should be locked through out the function call */ - ret =3D zram_read_from_zspool(zram, page, index); + ret =3D read_from_zspool(zram, page, index); zram_slot_unlock(zram, index); } else { unsigned long blk_idx =3D zram_get_handle(zram, index); @@ -2012,7 +2184,7 @@ static int zram_read_page(struct zram *zram, struct p= age *page, u32 index, * device. */ zram_slot_unlock(zram, index); - ret =3D read_from_bdev(zram, page, blk_idx, parent); + ret =3D read_from_bdev(zram, page, index, blk_idx, parent); } =20 /* Should NEVER happen. Return bio error if it does. */ @@ -2273,7 +2445,7 @@ static int recompress_slot(struct zram *zram, u32 ind= ex, struct page *page, if (comp_len_old < threshold) return 0; =20 - ret =3D zram_read_from_zspool(zram, page, index); + ret =3D read_from_zspool(zram, page, index); if (ret) return ret; =20 @@ -2960,6 +3132,7 @@ static int zram_add(void) init_rwsem(&zram->init_lock); #ifdef CONFIG_ZRAM_WRITEBACK zram->wb_batch_size =3D 32; + zram->wb_compressed =3D false; #endif =20 /* gendisk structure */ diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index c6d94501376c..72fdf66c78ab 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -128,6 +128,7 @@ struct zram { #ifdef CONFIG_ZRAM_WRITEBACK struct file *backing_dev; bool wb_limit_enable; + bool wb_compressed; u32 wb_batch_size; u64 bd_wb_limit; struct block_device *bdev; --=20 2.52.0.487.g5c8c507ade-goog