From nobody Tue Sep 9 18:11:27 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7EECC001E0 for ; Wed, 26 Jul 2023 08:24:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233221AbjGZIYk (ORCPT ); Wed, 26 Jul 2023 04:24:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232091AbjGZIX2 (ORCPT ); Wed, 26 Jul 2023 04:23:28 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BDBEE72BC for ; Wed, 26 Jul 2023 01:10:52 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id 98e67ed59e1d1-2680a031283so1983666a91.3 for ; Wed, 26 Jul 2023 01:10:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690359052; x=1690963852; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=AgnuXatC2qbxaqbZ65LcEyHnR8tMExq1rwNTS9IK9fs=; b=kHc9/HMqaF9R8G/rXM59e0F3CbnZX/wdeNMwFdC4XbxVCwGzgEppqL5qnyB3vNmNVg c8cJwfifm5dQn0XdItECcYmL7213bc+olR8WIHI+oZFMOk9CMsY7zzqyGkHUyoCUbS5O dEHqxuqR01Hr6wkpXFSSeIiFRl0E2XvoCppdLpSz0jYGs9bX0Vjkjgcf/CoG7h0V+GT0 hgH6/mckjlLMtGPSbuo3RnGG1oclyUlcIARBc+DEe60f642YycPhYk3JYCpHJAImiCtD q04bJ/2nn8P7ybfvGd8hRKKdkHJCH61syEN3XyDj2vXfd5qQFV1ojWAT1xwPxp7AAMlx S/qw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690359052; x=1690963852; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AgnuXatC2qbxaqbZ65LcEyHnR8tMExq1rwNTS9IK9fs=; b=E1CyTe1fmk5AAR8lwHcG4JV4ACgIkyI7hSfFI8kQHbszlchlMFHzU5Wk1b2Qb0+qsU yp9/dqBr+QRMRj5tiU4tHyKzjiQHky5F67Cs7C3As4inKRm6/gT8IHIdgBZlBdnVt6D0 hjpEZU0eAlNvaafQ11GLrrsMY4JxaVR8Png3qb8SaOSUk7CXBTspDIyRSaSSfWt3I54j nrL6SwnNcYVcwfN0CXF6ZoS70/T49Byx379U/DUIaHCWyFJrZvXHMQ5EZHmpimpN4F0M DJg1vuzFNkhEL1EsvUX7DitzjZ8qk4vPnBbNmzO+XYMVeTcNLrON/NKWTVpFxPO7foLM 97tQ== X-Gm-Message-State: ABy/qLYz8t0jJ1apN9yo2XgVzsDYRr3UWcBHEL6g607A071Mb9YV0PFr GdCqaIX0Lk/GDnCuYaDWiuYyrQ== X-Google-Smtp-Source: APBJJlE/k5FlCPp9wjkDmS2lNntkmIxAcglpNb71TQX77k742PkAt1Hf8XjFuPAOgyeUv3smQ4YSOQ== X-Received: by 2002:a17:90a:6344:b0:263:e423:5939 with SMTP id v4-20020a17090a634400b00263e4235939mr1066813pjs.28.1690359052249; Wed, 26 Jul 2023 01:10:52 -0700 (PDT) Received: from GL4FX4PXWL.bytedance.net ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id gc17-20020a17090b311100b002680b2d2ab6sm756540pjb.19.2023.07.26.01.10.47 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 26 Jul 2023 01:10:52 -0700 (PDT) From: Peng Zhang To: Liam.Howlett@oracle.com, corbet@lwn.net, akpm@linux-foundation.org, willy@infradead.org, brauner@kernel.org, surenb@google.com, michael.christie@oracle.com, peterz@infradead.org, mathieu.desnoyers@efficios.com, npiggin@gmail.com, avagin@gmail.com Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Peng Zhang Subject: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Date: Wed, 26 Jul 2023 16:09:16 +0800 Message-Id: <20230726080916.17454-12-zhangpeng.00@bytedance.com> X-Mailer: git-send-email 2.37.0 (Apple Git-136) In-Reply-To: <20230726080916.17454-1-zhangpeng.00@bytedance.com> References: <20230726080916.17454-1-zhangpeng.00@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then directly modify the entries of VMAs in the new maple tree, which can get better performance. dup_mmap() is used by fork(), so this patch optimizes fork(). The optimization effect is proportional to the number of VMAs. Due to the introduction of this method, the optimization in (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer has an effect here, but it is also an optimization of the maple tree. There is a unixbench test suite[2] where 'spawn' is used to test fork(). 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a bit to use mmap() to control the number of VMAs. Therefore, the performance under different numbers of VMAs can be measured. Insert code like below into 'spawn': for (int i =3D 0; i < 200; ++i) { size_t size =3D 10 * getpagesize(); void *addr; if (i & 1) { addr =3D mmap(NULL, size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); } else { addr =3D mmap(NULL, size, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); } if (addr =3D=3D MAP_FAILED) ... } Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test 4 times in 30 seconds each time, and get the following numbers. These numbers are the number of fork() successes in 30s (average of the best 3 out of 4). By the way, based on next-20230725, I reverted [1], and tested it together as a comparison. In order to ensure the reliability of the test results, these tests were run on a physical machine. 23VMAs 223VMAs 4023VMAs revert [1]: 159104.00 73316.33 6787.00 +0.77% +0.42% +0.28% next-20230721: 160321.67 73624.67 6806.33 +2.77% +15.42% +29.86% apply this: 164751.67 84980.33 8838.67 It can be seen that the performance improvement is proportional to the number of VMAs. With 23 VMAs, performance improves by about 3%, with 223 VMAs, performance improves by about 15%, and with 4023 VMAs, performance improves by about 30%. [1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@byteda= nce.com/ [2] https://github.com/kdlucas/byte-unixbench/tree/master Signed-off-by: Peng Zhang --- kernel/fork.c | 35 +++++++++++++++++++++++++++-------- mm/mmap.c | 14 ++++++++++++-- 2 files changed, 39 insertions(+), 10 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index f81149739eb9..ef80025b62d6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *= mm, int retval; unsigned long charge =3D 0; LIST_HEAD(uf); - VMA_ITERATOR(old_vmi, oldmm, 0); VMA_ITERATOR(vmi, mm, 0); =20 uprobe_start_dup_mmap(); @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct= *mm, goto out; khugepaged_fork(mm, oldmm); =20 - retval =3D vma_iter_bulk_alloc(&vmi, oldmm->map_count); - if (retval) + /* Use __mt_dup() to efficiently build an identical maple tree. */ + retval =3D __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN); + if (unlikely(retval)) goto out; =20 mt_clear_in_rcu(vmi.mas.tree); - for_each_vma(old_vmi, mpnt) { + for_each_vma(vmi, mpnt) { struct file *file; =20 vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); + + /* + * Since the new tree is exactly the same as the old one, + * we need to remove the unneeded VMAs. + */ + mas_store(&vmi.mas, NULL); + + /* + * Even removing an entry may require memory allocation, + * and if removal fails, we use XA_ZERO_ENTRY to mark + * from which VMA it failed. The case of encountering + * XA_ZERO_ENTRY will be handled in exit_mmap(). + */ + if (unlikely(mas_is_err(&vmi.mas))) { + retval =3D xa_err(vmi.mas.node); + mas_reset(&vmi.mas); + if (mas_find(&vmi.mas, ULONG_MAX)) + mas_replace_entry(&vmi.mas, + XA_ZERO_ENTRY); + goto loop_out; + } + continue; } charge =3D 0; @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *= mm, hugetlb_dup_vma_private(tmp); =20 /* Link the vma into the MT */ - if (vma_iter_bulk_store(&vmi, tmp)) - goto fail_nomem_vmi_store; + mas_replace_entry(&vmi.mas, tmp); =20 mm->map_count++; if (!(tmp->vm_flags & VM_WIPEONFORK)) @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *= mm, uprobe_end_dup_mmap(); return retval; =20 -fail_nomem_vmi_store: - unlink_anon_vmas(tmp); fail_nomem_anon_vma_fork: mpol_put(vma_policy(tmp)); fail_nomem_policy: diff --git a/mm/mmap.c b/mm/mmap.c index bc91d91261ab..5bfba2fb0e39 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm) arch_exit_mmap(mm); =20 vma =3D mas_find(&mas, ULONG_MAX); - if (!vma) { + /* + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY, + * xa_is_zero(vma) may be true. + */ + if (!vma || xa_is_zero(vma)) { /* Can happen if dup_mmap() received an OOM */ mmap_read_unlock(mm); return; @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm) remove_vma(vma, true); count++; cond_resched(); - } while ((vma =3D mas_find(&mas, ULONG_MAX)) !=3D NULL); + vma =3D mas_find(&mas, ULONG_MAX); + /* + * If xa_is_zero(vma) is true, it means that subsequent VMAs + * donot need to be removed. Can happen if dup_mmap() fails to + * remove a VMA marked VM_DONTCOPY. + */ + } while (vma !=3D NULL && !xa_is_zero(vma)); =20 BUG_ON(count !=3D mm->map_count); =20 --=20 2.20.1