From nobody Thu Dec 18 13:46:49 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D21E225412; Thu, 6 Feb 2025 07:21:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.8 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738826479; cv=none; b=Tat4mm/Pkvce9Bt+dPs77IotimZxM/Z3imExnM4tAyyu59I9ZY5kQs5y5DXem0EgEVbw4bcq3Ktxew0Pet5O+AFgpVnamYZS710rs0mbuS4uP4guGHKA1bwGWD0+gsjgLgvdqqPe6haPY5cUYux8cG7CxYy09X41WoBA0I7b6Hk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738826479; c=relaxed/simple; bh=2J9gtcIwHxH1MpbEJssRtTtbQQgA3xu9INBQ54csBJY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=HZ15E8cVcUTuj+m2lbhxp572d/X+NzV4sjarFlAM1uniyzCjX3k6e7sSL0Jbl/th9tnHKJViaiJNcAc7WK/Izq3mctso2+cGxGkY0yeJcU011rsUsvpiS8Q1pzQBHcTHryYrN4B7iqnyYnRpfHp/cw+RQ0ddceTNMzwXkUe/Fgg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=O5eYAZik; arc=none smtp.client-ip=192.198.163.8 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="O5eYAZik" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1738826477; x=1770362477; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2J9gtcIwHxH1MpbEJssRtTtbQQgA3xu9INBQ54csBJY=; b=O5eYAZikCExUuQIZ7OeVyni78s6TnfqRkbYcrSgtddlNE87g6pfskQUa vLLfxgj/CZPBmnwhKP96S160d/DJ+3KHlWBBjHn8MFhJ/D3AvV+4okfHs 96aFEAtvHBBlT6/uDeBDIRc1Ww6m+2nyz3L06uJ36FUBYvB+HiQH+yoX5 WamXgtHTDHpQj2Tcoq5XdA1n1/HEeuaphoT3+pyUMcGXduPVdCYO98VA1 4Oc6wcsTbo4RHB7xmmLDOP1fkvi0+CHonjtlB5zUynIK3R2HNbee7O4yS 5ixWbIpyZuMoFwM5vAo9ko7SBt9cw4GcmE7zjsyH76iybLXuVEitMGGEu g==; X-CSE-ConnectionGUID: wiJ9mxGaRdWv1G2C18wmVQ== X-CSE-MsgGUID: bdJ0sKziRMi7XKXLQL1xaQ== X-IronPort-AV: E=McAfee;i="6700,10204,11336"; a="56962756" X-IronPort-AV: E=Sophos;i="6.13,263,1732608000"; d="scan'208";a="56962756" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2025 23:21:06 -0800 X-CSE-ConnectionGUID: IlfEp/8QS12Kl4i9twxvoA== X-CSE-MsgGUID: bYZGwRz4Rc256Qiw4cxcsg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="112022652" Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.115]) by orviesa008.jf.intel.com with ESMTP; 05 Feb 2025 23:21:06 -0800 From: Kanchana P Sridhar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com, chengming.zhou@linux.dev, usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com, akpm@linux-foundation.org, linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au, davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com, surenb@google.com, kristen.c.accardi@intel.com Cc: wajdi.k.feghali@intel.com, vinodh.gopal@intel.com, kanchana.p.sridhar@intel.com Subject: [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios. Date: Wed, 5 Feb 2025 23:21:02 -0800 Message-Id: <20250206072102.29045-17-kanchana.p.sridhar@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20250206072102.29045-1-kanchana.p.sridhar@intel.com> References: <20250206072102.29045-1-kanchana.p.sridhar@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With the previous patch that enables support for batch compressions in zswap_compress_folio(), a 6.2% throughput regression was seen with zstd and 2M folios, using vm-scalability/usemem. For compressors that don't support batching, this was root-caused to the following zswap_store_folio() structure: Batched stores: --------------- - Allocate all entries, - Compress all entries, - Store all entries in xarray/LRU. Hence, the above structure is maintained only for batched stores, and the following structure is implemented for sequential stores of large folio pag= es, that fixes the zstd regression, while preserving common code paths for batc= hed and sequential stores of a folio: Sequential stores: ------------------ For each page in folio: - allocate an entry, - compress the page, - store the entry in xarray/LRU. This is submitted as a separate patch only for code review purposes. I will squash this with the previous commit in subsequent versions of this patch-series. Signed-off-by: Kanchana P Sridhar --- mm/zswap.c | 193 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 111 insertions(+), 82 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index f1cba77eda62..7bfc720a6201 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1592,40 +1592,32 @@ static bool zswap_batch_compress(struct folio *foli= o, return ret; } =20 -static bool zswap_compress_folio(struct folio *folio, - struct zswap_entry *entries[], - struct zswap_pool *pool) +static __always_inline bool zswap_compress_folio(struct folio *folio, + struct zswap_entry *entries[], + long from_index, + struct zswap_pool *pool, + unsigned int batch_size, + struct crypto_acomp_ctx *acomp_ctx) { - long index, nr_pages =3D folio_nr_pages(folio); - struct crypto_acomp_ctx *acomp_ctx; - unsigned int batch_size; + long index =3D from_index, nr_pages =3D folio_nr_pages(folio); bool ret =3D true; =20 - acomp_ctx =3D acomp_ctx_get_cpu_lock(pool); - batch_size =3D acomp_ctx->nr_reqs; - if ((batch_size > 1) && (nr_pages > 1)) { - for (index =3D 0; index < nr_pages; index +=3D batch_size) { + for (; index < nr_pages; index +=3D batch_size) { =20 if (!zswap_batch_compress(folio, index, batch_size, &entries[index], pool, acomp_ctx)) { ret =3D false; - goto unlock_acomp_ctx; + break; } } } else { - for (index =3D 0; index < nr_pages; ++index) { - struct page *page =3D folio_page(folio, index); + struct page *page =3D folio_page(folio, index); =20 - if (!zswap_compress(page, entries[index], pool, acomp_ctx)) { - ret =3D false; - goto unlock_acomp_ctx; - } - } + if (!zswap_compress(page, entries[index], pool, acomp_ctx)) + ret =3D false; } =20 -unlock_acomp_ctx: - acomp_ctx_put_unlock(acomp_ctx); return ret; } =20 @@ -1637,92 +1629,128 @@ static bool zswap_compress_folio(struct folio *fol= io, * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the * entry's handle is subsequently modified only upon a successful zpool_ma= lloc() * after the page is compressed. + * + * For compressors that don't support batching, the following structure + * showed a performance regression with zstd/2M folios: + * + * Batched stores: + * --------------- + * - Allocate all entries, + * - Compress all entries, + * - Store all entries in xarray/LRU. + * + * Hence, the above structure is maintained only for batched stores, and t= he + * following structure is implemented for sequential stores of large folio= pages, + * that fixes the regression, while preserving common code paths for batch= ed + * and sequential stores of a folio: + * + * Sequential stores: + * ------------------ + * For each page in folio: + * - allocate an entry, + * - compress the page, + * - store the entry in xarray/LRU. */ static bool zswap_store_folio(struct folio *folio, struct obj_cgroup *objcg, struct zswap_pool *pool) { - long index, from_index =3D 0, nr_pages =3D folio_nr_pages(folio); + long index =3D 0, from_index =3D 0, nr_pages, nr_folio_pages =3D folio_nr= _pages(folio); struct zswap_entry **entries =3D NULL; + struct crypto_acomp_ctx *acomp_ctx; int node_id =3D folio_nid(folio); + unsigned int batch_size; =20 - entries =3D kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL); + entries =3D kmalloc(nr_folio_pages * sizeof(*entries), GFP_KERNEL); if (!entries) return false; =20 - for (index =3D 0; index < nr_pages; ++index) { - entries[index] =3D zswap_entry_cache_alloc(GFP_KERNEL, node_id); + acomp_ctx =3D acomp_ctx_get_cpu_lock(pool); + batch_size =3D acomp_ctx->nr_reqs; =20 - if (!entries[index]) { - zswap_reject_kmemcache_fail++; - nr_pages =3D index; - goto store_folio_failed; + nr_pages =3D (batch_size > 1) ? nr_folio_pages : 1; + + while (1) { + for (index =3D from_index; index < nr_pages; ++index) { + entries[index] =3D zswap_entry_cache_alloc(GFP_KERNEL, node_id); + + if (!entries[index]) { + zswap_reject_kmemcache_fail++; + nr_pages =3D index; + goto store_folio_failed; + } + + entries[index]->handle =3D (unsigned long)ERR_PTR(-EINVAL); } =20 - entries[index]->handle =3D (unsigned long)ERR_PTR(-EINVAL); - } + if (!zswap_compress_folio(folio, entries, from_index, pool, batch_size, = acomp_ctx)) + goto store_folio_failed; =20 - if (!zswap_compress_folio(folio, entries, pool)) - goto store_folio_failed; + for (index =3D from_index; index < nr_pages; ++index) { + swp_entry_t page_swpentry =3D page_swap_entry(folio_page(folio, index)); + struct zswap_entry *old, *entry =3D entries[index]; =20 - for (index =3D 0; index < nr_pages; ++index) { - swp_entry_t page_swpentry =3D page_swap_entry(folio_page(folio, index)); - struct zswap_entry *old, *entry =3D entries[index]; + old =3D xa_store(swap_zswap_tree(page_swpentry), + swp_offset(page_swpentry), + entry, GFP_KERNEL); + if (xa_is_err(old)) { + int err =3D xa_err(old); =20 - old =3D xa_store(swap_zswap_tree(page_swpentry), - swp_offset(page_swpentry), - entry, GFP_KERNEL); - if (xa_is_err(old)) { - int err =3D xa_err(old); + WARN_ONCE(err !=3D -ENOMEM, "unexpected xarray error: %d\n", err); + zswap_reject_alloc_fail++; + from_index =3D index; + goto store_folio_failed; + } =20 - WARN_ONCE(err !=3D -ENOMEM, "unexpected xarray error: %d\n", err); - zswap_reject_alloc_fail++; - from_index =3D index; - goto store_folio_failed; - } + /* + * We may have had an existing entry that became stale when + * the folio was redirtied and now the new version is being + * swapped out. Get rid of the old. + */ + if (old) + zswap_entry_free(old); =20 - /* - * We may have had an existing entry that became stale when - * the folio was redirtied and now the new version is being - * swapped out. Get rid of the old. - */ - if (old) - zswap_entry_free(old); + /* + * The entry is successfully compressed and stored in the tree, there is + * no further possibility of failure. Grab refs to the pool and objcg, + * charge zswap memory, and increment zswap_stored_pages. + * The opposite actions will be performed by zswap_entry_free() + * when the entry is removed from the tree. + */ + zswap_pool_get(pool); + if (objcg) { + obj_cgroup_get(objcg); + obj_cgroup_charge_zswap(objcg, entry->length); + } + atomic_long_inc(&zswap_stored_pages); =20 - /* - * The entry is successfully compressed and stored in the tree, there is - * no further possibility of failure. Grab refs to the pool and objcg, - * charge zswap memory, and increment zswap_stored_pages. - * The opposite actions will be performed by zswap_entry_free() - * when the entry is removed from the tree. - */ - zswap_pool_get(pool); - if (objcg) { - obj_cgroup_get(objcg); - obj_cgroup_charge_zswap(objcg, entry->length); + /* + * We finish initializing the entry while it's already in xarray. + * This is safe because: + * + * 1. Concurrent stores and invalidations are excluded by folio lock. + * + * 2. Writeback is excluded by the entry not being on the LRU yet. + * The publishing order matters to prevent writeback from seeing + * an incoherent entry. + */ + entry->pool =3D pool; + entry->swpentry =3D page_swpentry; + entry->objcg =3D objcg; + entry->referenced =3D true; + if (entry->length) { + INIT_LIST_HEAD(&entry->lru); + zswap_lru_add(&zswap_list_lru, entry); + } } - atomic_long_inc(&zswap_stored_pages); =20 - /* - * We finish initializing the entry while it's already in xarray. - * This is safe because: - * - * 1. Concurrent stores and invalidations are excluded by folio lock. - * - * 2. Writeback is excluded by the entry not being on the LRU yet. - * The publishing order matters to prevent writeback from seeing - * an incoherent entry. - */ - entry->pool =3D pool; - entry->swpentry =3D page_swpentry; - entry->objcg =3D objcg; - entry->referenced =3D true; - if (entry->length) { - INIT_LIST_HEAD(&entry->lru); - zswap_lru_add(&zswap_list_lru, entry); - } + from_index =3D nr_pages++; + + if (nr_pages > nr_folio_pages) + break; } =20 + acomp_ctx_put_unlock(acomp_ctx); kfree(entries); return true; =20 @@ -1734,6 +1762,7 @@ static bool zswap_store_folio(struct folio *folio, zswap_entry_cache_free(entries[index]); } =20 + acomp_ctx_put_unlock(acomp_ctx); kfree(entries); return false; } --=20 2.27.0