From nobody Fri Nov 29 00:58:47 2024 Received: from out-177.mta0.migadu.com (out-177.mta0.migadu.com [91.218.175.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C351155C82 for ; Wed, 25 Sep 2024 22:47:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727304459; cv=none; b=qJxHo1n+4HfpgJT9PnIMIbSnxU2EulPejKHsbNAt73E5k+E5U4rmyc37jBDbzZ1c6L5QJiAkQOnXFF4xQfWHR+KrdHCjwoHU0sea/Wah7lF41bkg4D6ewYHNWPZQrAhurJybqfQjpZ43kuZnyCNUP8XLLbMsHTaxeR5g/hNWTI0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727304459; c=relaxed/simple; bh=KjmLV7WqS296Xi4O8B9vKhHEMe5ASci0o7Xfc5ngjh4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=md+0M33k0bbsZi+f5np0HU9GTisZT8Vzqgs0Nw16si2wBOaVxAqQiX5yKMU2d26OFGspN0vdFgZM1c0An7R7wAVA9PgZC4lfcsQ6zSHIYysGgLv5SnDA6H8u0UHqM15glhuNxLEqXCHBbKBmSNjbLdnvD3P73LO079/Q7T4YaFY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=poOh5QVY; arc=none smtp.client-ip=91.218.175.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="poOh5QVY" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1727304454; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=g+3OMHz/a1Wuc/jYKVUdzYlQ51tOrama2GsuX31VDms=; b=poOh5QVY7GqwCIGs3N3EgUq00XLMjZVziVPSxT/oZnhktBQ98KhavI5/FmhMPdgT2NDmdX lYrvY/IXi0XwdZcdRgIshA4r1/ScVCRIQJPqZYEKtHPmNWQ0WHyaFJGaetAHpqy/sSFMYP mHZqHopFHnlCTGKvYkanahcImiiMTJA= From: Shakeel Butt To: Andrew Morton Cc: Johannes Weiner , Matthew Wilcox , Omar Sandoval , Chris Mason , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH v2 1/2] mm: optimize truncation of shadow entries Date: Wed, 25 Sep 2024 15:47:15 -0700 Message-ID: <20240925224716.2904498-2-shakeel.butt@linux.dev> In-Reply-To: <20240925224716.2904498-1-shakeel.butt@linux.dev> References: <20240925224716.2904498-1-shakeel.butt@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each batch, it traverses the page cache tree and collects the entries (folio and shadow entries) in the struct folio_batch. For the shadow entries present in the folio_batch, it has to traverse the page cache tree for each individual entry to remove them. This patch optimize this by removing them in a single tree traversal. On large machines in our production which run workloads manipulating large amount of data, we have observed that a large amount of CPUs are spent on truncation of very large files (100s of GiBs file sizes). More specifically most of time was spent on shadow entries cleanup, so optimizing the shadow entries cleanup, even a little bit, has good impact. To evaluate the changes, we created 200GiB file on a fuse fs and in a memcg. We created the shadow entries by triggering reclaim through memory.reclaim in that specific memcg and measure the simple truncation operation. # time truncate -s 0 file time (sec) Without 5.164 +- 0.059 With-patch 4.21 +- 0.066 (18.47% decrease) Acked-by: Johannes Weiner Signed-off-by: Shakeel Butt --- Changes since v1: - Added a comment on the assumption of indices array (Johannes) mm/truncate.c | 53 +++++++++++++++++++++++++-------------------------- 1 file changed, 26 insertions(+), 27 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index 0668cd340a46..1d51c023d9c5 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -68,54 +68,53 @@ static void clear_shadow_entries(struct address_space *= mapping, * Unconditionally remove exceptional entries. Usually called from truncate * path. Note that the folio_batch may be altered by this function by remo= ving * exceptional entries similar to what folio_batch_remove_exceptionals() d= oes. + * Please note that indices[] has entries in ascending order as guaranteed= by + * either find_get_entries() or find_lock_entries(). */ static void truncate_folio_batch_exceptionals(struct address_space *mappin= g, struct folio_batch *fbatch, pgoff_t *indices) { + XA_STATE(xas, &mapping->i_pages, indices[0]); + int nr =3D folio_batch_count(fbatch); + struct folio *folio; int i, j; - bool dax; =20 /* Handled by shmem itself */ if (shmem_mapping(mapping)) return; =20 - for (j =3D 0; j < folio_batch_count(fbatch); j++) + for (j =3D 0; j < nr; j++) if (xa_is_value(fbatch->folios[j])) break; =20 - if (j =3D=3D folio_batch_count(fbatch)) + if (j =3D=3D nr) return; =20 - dax =3D dax_mapping(mapping); - if (!dax) { - spin_lock(&mapping->host->i_lock); - xa_lock_irq(&mapping->i_pages); + if (dax_mapping(mapping)) { + for (i =3D j; i < nr; i++) { + if (xa_is_value(fbatch->folios[i])) + dax_delete_mapping_entry(mapping, indices[i]); + } + goto out; } =20 - for (i =3D j; i < folio_batch_count(fbatch); i++) { - struct folio *folio =3D fbatch->folios[i]; - pgoff_t index =3D indices[i]; - - if (!xa_is_value(folio)) { - fbatch->folios[j++] =3D folio; - continue; - } + xas_set(&xas, indices[j]); + xas_set_update(&xas, workingset_update_node); =20 - if (unlikely(dax)) { - dax_delete_mapping_entry(mapping, index); - continue; - } + spin_lock(&mapping->host->i_lock); + xas_lock_irq(&xas); =20 - __clear_shadow_entry(mapping, index, folio); + xas_for_each(&xas, folio, indices[nr-1]) { + if (xa_is_value(folio)) + xas_store(&xas, NULL); } =20 - if (!dax) { - xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); - } - fbatch->nr =3D j; + xas_unlock_irq(&xas); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); +out: + folio_batch_remove_exceptionals(fbatch); } =20 /** --=20 2.43.5 From nobody Fri Nov 29 00:58:47 2024 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB7A916C6B7 for ; Wed, 25 Sep 2024 22:47:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727304462; cv=none; b=OZNvgOmJTY4qqI1GIVY2L+2A96BccJwpNWsXgTOPWc4wrktm3YjHl8Y4Fr/QehbAlqee/7aMBgiof4kv+u8t1kp7Kq4M3HXTImX6rN2u6bzY7BkKUjk+6gMGvc7OyqRUUwE/DFu7O2vY67V6odQVXyTmFM0QGf0/is3cXTCFbnE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727304462; c=relaxed/simple; bh=1g4IdDRqJh6D2QlxKfW8as0oMlgwAnyNl6zJXteLUks=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oQHpOyBJJkIh+AZ+4b9NQlQBMejQhv2YhKeMoiOcP7ZmBnfBcLHyIFY8FLym5d3gzQlPNEZ/3XRqqj1a4tA9OvN5ich9qdbR9VhgF74X7l//x6tOmXjWNf+/q+lDNatDVUCFE6b84WrREyhy4POioHNf3RV95l8d/3heKYiDTbg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=b96mvYb6; arc=none smtp.client-ip=91.218.175.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="b96mvYb6" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1727304458; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=slbZ9FKSwH6OAIBuSfcvPtxMsbwgB1Nos7HczeUcRQE=; b=b96mvYb6xAxPtPPb3lkOg3MmF6z+lRx20ztLlq+lEb6GvoxLsb5txOTLu2fQ0nNRYskp6j MXlHRTFvlL2g/67ij5+TC0d0ktcdzvREN/Ogrrl0x2xzD70paFu5oXoOQ5D0zDYq+9XLgl ySe24QdKOYsZ0zMIOw2wwx9WjVcHvd4= From: Shakeel Butt To: Andrew Morton Cc: Johannes Weiner , Matthew Wilcox , Omar Sandoval , Chris Mason , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH v2 2/2] mm: optimize invalidation of shadow entries Date: Wed, 25 Sep 2024 15:47:16 -0700 Message-ID: <20240925224716.2904498-3-shakeel.butt@linux.dev> In-Reply-To: <20240925224716.2904498-1-shakeel.butt@linux.dev> References: <20240925224716.2904498-1-shakeel.butt@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" The kernel invalidates the page cache in batches of PAGEVEC_SIZE. For each batch, it traverses the page cache tree and collects the entries (folio and shadow entries) in the struct folio_batch. For the shadow entries present in the folio_batch, it has to traverse the page cache tree for each individual entry to remove them. This patch optimize this by removing them in a single tree traversal. To evaluate the changes, we created 200GiB file on a fuse fs and in a memcg. We created the shadow entries by triggering reclaim through memory.reclaim in that specific memcg and measure the simple fadvise(DONTNEED) operation. # time xfs_io -c 'fadvise -d 0 ${file_size}' file time (sec) Without 5.12 +- 0.061 With-patch 4.19 +- 0.086 (18.16% decrease) Signed-off-by: Shakeel Butt --- Changes since v1: - N/A mm/truncate.c | 46 ++++++++++++++++++---------------------------- 1 file changed, 18 insertions(+), 28 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index 1d51c023d9c5..520c8cf8f58f 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -23,42 +23,28 @@ #include #include "internal.h" =20 -/* - * Regular page slots are stabilized by the page lock even without the tree - * itself locked. These unlocked entries need verification under the tree - * lock. - */ -static inline void __clear_shadow_entry(struct address_space *mapping, - pgoff_t index, void *entry) -{ - XA_STATE(xas, &mapping->i_pages, index); - - xas_set_update(&xas, workingset_update_node); - if (xas_load(&xas) !=3D entry) - return; - xas_store(&xas, NULL); -} - static void clear_shadow_entries(struct address_space *mapping, - struct folio_batch *fbatch, pgoff_t *indices) + unsigned long start, unsigned long max) { - int i; + XA_STATE(xas, &mapping->i_pages, start); + struct folio *folio; =20 /* Handled by shmem itself, or for DAX we do nothing. */ if (shmem_mapping(mapping) || dax_mapping(mapping)) return; =20 - spin_lock(&mapping->host->i_lock); - xa_lock_irq(&mapping->i_pages); + xas_set_update(&xas, workingset_update_node); =20 - for (i =3D 0; i < folio_batch_count(fbatch); i++) { - struct folio *folio =3D fbatch->folios[i]; + spin_lock(&mapping->host->i_lock); + xas_lock_irq(&xas); =20 + /* Clear all shadow entries from start to max */ + xas_for_each(&xas, folio, max) { if (xa_is_value(folio)) - __clear_shadow_entry(mapping, indices[i], folio); + xas_store(&xas, NULL); } =20 - xa_unlock_irq(&mapping->i_pages); + xas_unlock_irq(&xas); if (mapping_shrinkable(mapping)) inode_add_lru(mapping->host); spin_unlock(&mapping->host->i_lock); @@ -481,7 +467,9 @@ unsigned long mapping_try_invalidate(struct address_spa= ce *mapping, =20 folio_batch_init(&fbatch); while (find_lock_entries(mapping, &index, end, &fbatch, indices)) { - for (i =3D 0; i < folio_batch_count(&fbatch); i++) { + int nr =3D folio_batch_count(&fbatch); + + for (i =3D 0; i < nr; i++) { struct folio *folio =3D fbatch.folios[i]; =20 /* We rely upon deletion not changing folio->index */ @@ -508,7 +496,7 @@ unsigned long mapping_try_invalidate(struct address_spa= ce *mapping, } =20 if (xa_has_values) - clear_shadow_entries(mapping, &fbatch, indices); + clear_shadow_entries(mapping, indices[0], indices[nr-1]); =20 folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); @@ -612,7 +600,9 @@ int invalidate_inode_pages2_range(struct address_space = *mapping, folio_batch_init(&fbatch); index =3D start; while (find_get_entries(mapping, &index, end, &fbatch, indices)) { - for (i =3D 0; i < folio_batch_count(&fbatch); i++) { + int nr =3D folio_batch_count(&fbatch); + + for (i =3D 0; i < nr; i++) { struct folio *folio =3D fbatch.folios[i]; =20 /* We rely upon deletion not changing folio->index */ @@ -658,7 +648,7 @@ int invalidate_inode_pages2_range(struct address_space = *mapping, } =20 if (xa_has_values) - clear_shadow_entries(mapping, &fbatch, indices); + clear_shadow_entries(mapping, indices[0], indices[nr-1]); =20 folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); --=20 2.43.5