From nobody Sat May 9 09:32:35 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01C05C433F5 for ; Thu, 19 May 2022 18:30:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230001AbiESSaU (ORCPT ); Thu, 19 May 2022 14:30:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56770 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243935AbiESSaN (ORCPT ); Thu, 19 May 2022 14:30:13 -0400 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DB9BED73F for ; Thu, 19 May 2022 11:30:11 -0700 (PDT) Received: by mail-pg1-x535.google.com with SMTP id r71so5834188pgr.0 for ; Thu, 19 May 2022 11:30:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=MBcruspzlRSUkj1U/ztmfktnidm/dp1cZWRRvf7+pnI=; b=QsQqdtLwZc0eErpW+lp8P5RtGtErHFCKbEYB+utK3i4Dk+nEWoNs0zGNOCrdNS5tks W+VXbeWrAeOf9ogl8Actgzl2iaul9VDKLfM8opBiXTVR5cZZBB4tT2T5txkr49i8clTY XbC0KH7+Cj64LFCjy5BkGSMaJGyAbraOwWrttPhJdaL2QIzYxAQqJKaqqV9cx/V1qzKf gtr26jaZz/hWK/Pk6eT1jyIq/xvT/oDREMpFQNt/C8B8eIkzH1drfql4ckWNvyvdhSOi N+kPGeboi8SFwHrsKQX40KifLiRqH62bvq7ejKCf7HltCBOWTKAvJkMwJPrlPUgneNKE aK3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=MBcruspzlRSUkj1U/ztmfktnidm/dp1cZWRRvf7+pnI=; b=MAFREv0PNChNiQxiWLmBAowMrAEFuV1mJheXwK99ckghIxIbcj0Y8OselY+a5s6PQF wqKsrhFXdIflbrIp1wjrBTf4wVZcQsAt+qe5BY9aeGea6CKU8N9pTSiowIRB80xdmDOW wuM0MgdP5Q3qjrI0ba8G9nP3qnUb6FKCGeTUuoP2e59c5W12kb7hY2xeKOSWj19qSJvw LZrS6KezsbdSx6fJdbour7PUQSbFnW4YUWHb5Wyc+tYwPSVp6WFvDbMayqCCEmKjHgq1 AvkhT5MOfv2F+mg4Xqe+o43iBiXBDXmwbYMtCv9Klad77mKLkXxaTx3A8Qjz9GC7ZKLX vd+Q== X-Gm-Message-State: AOAM532ouiPJGo9ex53Efv49bcMEIMMoN9EOlifZMA7vsiwwcR/7rIZ6 dBHlID7AjHJdtpEDucoM37w= X-Google-Smtp-Source: ABdhPJwjZtwEwrpyhy+etq/VjPlWGxVwe2qV4NvoJ4SLLfXvXkWATt7s2S7siucglap/0xqXp1MQUw== X-Received: by 2002:a63:6705:0:b0:3c1:976d:bd68 with SMTP id b5-20020a636705000000b003c1976dbd68mr5049666pgc.133.1652985011160; Thu, 19 May 2022 11:30:11 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id z5-20020a63e105000000b003c14af505f6sm3884674pgh.14.2022.05.19.11.30.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 May 2022 11:30:10 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , linux-mm@kvack.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Christian Brauner , "Matthew Wilcox (Oracle)" , Vlastimil Babka , William Kucharski , John Hubbard , Yunsheng Lin , Arnd Bergmann , Suren Baghdasaryan , Chih-En Lin , Colin Cross , Feng Tang , "Eric W. Biederman" , Mike Rapoport , Geert Uytterhoeven , Anshuman Khandual , "Aneesh Kumar K.V" , Daniel Axtens , Jonathan Marek , Christophe Leroy , Pasha Tatashin , Peter Xu , Andrea Arcangeli , Thomas Gleixner , Andy Lutomirski , Sebastian Andrzej Siewior , Fenghua Yu , David Hildenbrand , linux-kernel@vger.kernel.org, Kaiyang Zhao , Huichun Feng , Jim Huang Subject: [RFC PATCH 1/6] mm: Add a new mm flag for Copy-On-Write PTE table Date: Fri, 20 May 2022 02:31:22 +0800 Message-Id: <20220519183127.3909598-2-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220519183127.3909598-1-shiyn.lin@gmail.com> References: <20220519183127.3909598-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add MMF_COW_PGTABLE flag to prepare the subsequent implementation of copy-on-write for the page table. Signed-off-by: Chih-En Lin --- include/linux/sched/coredump.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 4d9e3a656875..19e9f2b71398 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -83,7 +83,10 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_PINNED 28 /* FOLL_PIN has run, never cleared */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) =20 +#define MMF_COW_PGTABLE 29 +#define MMF_COW_PGTABLE_MASK (1 << MMF_COW_PGTABLE) + #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ - MMF_DISABLE_THP_MASK) + MMF_DISABLE_THP_MASK | MMF_COW_PGTABLE_MASK) =20 #endif /* _LINUX_SCHED_COREDUMP_H */ --=20 2.36.1 From nobody Sat May 9 09:32:35 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C01EC433FE for ; Thu, 19 May 2022 18:30:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243950AbiESSaZ (ORCPT ); Thu, 19 May 2022 14:30:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56868 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243939AbiESSaT (ORCPT ); Thu, 19 May 2022 14:30:19 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC676EC303 for ; Thu, 19 May 2022 11:30:17 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id q4so5499936plr.11 for ; Thu, 19 May 2022 11:30:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=3okhAQVmzjH9kpTDWZv9di+1X0JN/hwQHbaAYTPmRKk=; b=I+QLZYTCidWLfryGwh4g+iNPPl6KK/+PTmZ5tmus8ew7FcE0+F4I6GR1AeKJ0IgWba YoSZ0KTQ021omXkmBeN09uz6+kjyWiwe0P61FuRW57jxuBJoiY3flBF1gfJxlqx/wu8s SahdVzfO46wtZo7jwf+jUA7Ste66ESPiqB/P8XhoAnMnx5KQNDZoTGCLYK61sI4/2G8x bcepaLLWxq6ImNnKHmOfB9KobRNM6EChRhPzjt2ff2iZKbMECqDLDHkn3l+7k/tpWFsP uUeE0dd13WF976vBM0KJCsqZ3XnHUTjRIuXwoywUgfL74ck007qYJ1pFPou57Va32SXL d+RQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=3okhAQVmzjH9kpTDWZv9di+1X0JN/hwQHbaAYTPmRKk=; b=7VPQXUO/laCdwFocjgNOvKGql/B3ylDqRsZRpJgPO1z+5RhBds2sn/5sLDNXH+7OnA xXlJQfShpui0uwMKjnKclf0mU5Wd2kNr3rTP7r2KtbwM3hcuH+dKYkRnQiowHlqpiY7i 8u4fOTzKW7nt9hn+66dVEBeJDWIcwDClnc4hL2w64ihVqvlMyj5EHZbkSgDPhWNEmXVn 5BpNXeFdb+jhtHLkpEsCDDju7n+f8u1ey1MdC8ecHqzED6p6uphGLb0asnboMONwn1Tc gFFijVsZlYKcwuGrITDQrzmbBXsQpU+hj1pmvK27T5At1/CGHZLMpXzNiEMe+9veE6wx Zawg== X-Gm-Message-State: AOAM531m+DluSrhwUObSpRP75DdKm9OxQCWyhbnmvh+j0tCjqEdasbhx voz/aSw5Tk3lZJGwc21a+Lw= X-Google-Smtp-Source: ABdhPJxwGQw9CuhqHPaJc6onGODQtEQxG6uPknIr9RC26LYZwaGLAi7jy/9FM5GFXFgarv/bsML44w== X-Received: by 2002:a17:90b:1b44:b0:1dc:315f:4510 with SMTP id nv4-20020a17090b1b4400b001dc315f4510mr7126569pjb.28.1652985017382; Thu, 19 May 2022 11:30:17 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id z5-20020a63e105000000b003c14af505f6sm3884674pgh.14.2022.05.19.11.30.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 May 2022 11:30:16 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , linux-mm@kvack.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Christian Brauner , "Matthew Wilcox (Oracle)" , Vlastimil Babka , William Kucharski , John Hubbard , Yunsheng Lin , Arnd Bergmann , Suren Baghdasaryan , Chih-En Lin , Colin Cross , Feng Tang , "Eric W. Biederman" , Mike Rapoport , Geert Uytterhoeven , Anshuman Khandual , "Aneesh Kumar K.V" , Daniel Axtens , Jonathan Marek , Christophe Leroy , Pasha Tatashin , Peter Xu , Andrea Arcangeli , Thomas Gleixner , Andy Lutomirski , Sebastian Andrzej Siewior , Fenghua Yu , David Hildenbrand , linux-kernel@vger.kernel.org, Kaiyang Zhao , Huichun Feng , Jim Huang Subject: [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag Date: Fri, 20 May 2022 02:31:23 +0800 Message-Id: <20220519183127.3909598-3-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220519183127.3909598-1-shiyn.lin@gmail.com> References: <20220519183127.3909598-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add CLONE_COW_PGTABLE flag to support clone3() system call to enable the Copy-On-Write (COW) mechanism on the page table. Signed-off-by: Chih-En Lin --- include/uapi/linux/sched.h | 1 + kernel/fork.c | 6 +++++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 3bac0a8ceab2..3b92ff589e0f 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -36,6 +36,7 @@ /* Flags for the clone3() syscall. */ #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and= reset to SIG_DFL. */ #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup g= iven the right permissions. */ +#define CLONE_COW_PGTABLE 0x400000000ULL /* Copy-On-Write for page table */ =20 /* * cloning flags intersect with CSIGNAL so can be used with unshare and cl= one3 diff --git a/kernel/fork.c b/kernel/fork.c index 35a3beff140b..08cf95201333 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2636,6 +2636,9 @@ pid_t kernel_clone(struct kernel_clone_args *args) trace =3D 0; } =20 + if (clone_flags & CLONE_COW_PGTABLE) + set_bit(MMF_COW_PGTABLE, ¤t->mm->flags); + p =3D copy_process(NULL, trace, NUMA_NO_NODE, args); add_latent_entropy(); =20 @@ -2860,7 +2863,8 @@ static bool clone3_args_valid(struct kernel_clone_arg= s *kargs) { /* Verify that no unknown flags are passed along. */ if (kargs->flags & - ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP)) + ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP | + CLONE_COW_PGTABLE)) return false; =20 /* --=20 2.36.1 From nobody Sat May 9 09:32:35 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5EA17C433FE for ; Thu, 19 May 2022 18:30:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243935AbiESSah (ORCPT ); Thu, 19 May 2022 14:30:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243955AbiESSac (ORCPT ); Thu, 19 May 2022 14:30:32 -0400 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6861AED8E2 for ; Thu, 19 May 2022 11:30:24 -0700 (PDT) Received: by mail-pl1-x631.google.com with SMTP id i1so5510684plg.7 for ; Thu, 19 May 2022 11:30:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mGwVn8RHselKV0air8yMqpzaEEJKFNShOYXjybSXJNI=; b=GGS28zJ2n2buwhhkptrcMlxiv/ufYgHyGsbryhRNp3gNf94fakNVd+3CP6VO6oLiHO 4SPYQzlz8D9i7daFH1molB3QLlP8VJ/KGVCqbJdr02oaQEhcJBj2pVKDWpWDArmB1RSD CRYwCc8aQufDQCLI7pILGztuHJIzcNKBGVJWgEftXS2StQPkocMb57XlZsjqHSwS4OMd +bG6y9kcYZyWesa+NUzWt8dNKs/eco+x43uzhC2YKiCA0psA6d+JlXlW1z/QlBflv5kZ GCPSFju9p6WrOej/yIx5CFC+SUa79W/LYShld15IKoIBlf8GOQLD2qKoL7G5U6lxjhHq P+fQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mGwVn8RHselKV0air8yMqpzaEEJKFNShOYXjybSXJNI=; b=B5FtFylg06rza2Eb2Vs54YQbB6FnNlcfOuNyCLVsmr/nZhUjfj0gshiu7JHWDyOcaz mhRPnrmZHNGuq98wPaaB3tham3GTX3dbosCN5hKRGjg4jBVtCc8OhSTPlrWne2pwB/RC PClDetVNr7zd2y9/ZZ3D/f4auPdlvcnRUmsodoEdVyhQCIB/7ZgMn7rkQCVwDIKgmVOK ngi9hg3q3TXTinoVAV1g8WiOZlXLEO7dUl6kQf/G80bm9fYVrbNcgkMWzzunev15jiBX HdA8ONLNg+9Exd2bZbYWHDWrhoybRqNGUKVDKm4tm87oVZbT5/h8cdpQIFrU6QbmqH08 LBuQ== X-Gm-Message-State: AOAM533NUBhk/tgy7nh9v1ama+builVSqHxdelywv5kTccI3jLFG/Fni Sx7kOGiwHeE7GRRQo+TV884= X-Google-Smtp-Source: ABdhPJxT1QxjP12z0PKvPrABcDUayvMg83/qvOvUBLREZ00HVl+HxsB9o+PmAXQYSJAqpz/2P+yNRw== X-Received: by 2002:a17:902:da90:b0:15e:adc2:191d with SMTP id j16-20020a170902da9000b0015eadc2191dmr5907894plx.134.1652985023806; Thu, 19 May 2022 11:30:23 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id z5-20020a63e105000000b003c14af505f6sm3884674pgh.14.2022.05.19.11.30.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 May 2022 11:30:23 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , linux-mm@kvack.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Christian Brauner , "Matthew Wilcox (Oracle)" , Vlastimil Babka , William Kucharski , John Hubbard , Yunsheng Lin , Arnd Bergmann , Suren Baghdasaryan , Chih-En Lin , Colin Cross , Feng Tang , "Eric W. Biederman" , Mike Rapoport , Geert Uytterhoeven , Anshuman Khandual , "Aneesh Kumar K.V" , Daniel Axtens , Jonathan Marek , Christophe Leroy , Pasha Tatashin , Peter Xu , Andrea Arcangeli , Thomas Gleixner , Andy Lutomirski , Sebastian Andrzej Siewior , Fenghua Yu , David Hildenbrand , linux-kernel@vger.kernel.org, Kaiyang Zhao , Huichun Feng , Jim Huang Subject: [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Date: Fri, 20 May 2022 02:31:24 +0800 Message-Id: <20220519183127.3909598-4-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220519183127.3909598-1-shiyn.lin@gmail.com> References: <20220519183127.3909598-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Introduce the ownership for the PTE table to prepare the following patch of the Copy-On-Write (COW) page table. It uses the address of PMD index to become the owner to identify which process can update its page table state from the COW page table. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 1 + include/linux/mm_types.h | 1 + include/linux/pgtable.h | 14 ++++++++++++++ 3 files changed, 16 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 9f44254af8ce..221926a3d818 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2328,6 +2328,7 @@ static inline bool pgtable_pte_page_ctor(struct page = *page) return false; __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); + page->cow_pte_owner =3D NULL; return true; } =20 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8834e38c06a4..5dcbd7f6c361 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -221,6 +221,7 @@ struct page { #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS int _last_cpupid; #endif + pmd_t *cow_pte_owner; /* cow pte: pmd */ } _struct_page_alignment; =20 /** diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f4f4077b97aa..faca57af332e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -590,6 +590,20 @@ static inline int pte_unused(pte_t pte) } #endif =20 +static inline bool set_cow_pte_owner(pmd_t *pmd, pmd_t *owner) +{ + struct page *page =3D pmd_page(*pmd); + + smp_store_release(&page->cow_pte_owner, owner); + return true; +} + +static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner) +{ + return (smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) =3D=3D owner) ? + true : false; +} + #ifndef pte_access_permitted #define pte_access_permitted(pte, write) \ (pte_present(pte) && (!(write) || pte_write(pte))) --=20 2.36.1 From nobody Sat May 9 09:32:35 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA7CAC433FE for ; Thu, 19 May 2022 18:30:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243973AbiESSal (ORCPT ); Thu, 19 May 2022 14:30:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243970AbiESSad (ORCPT ); Thu, 19 May 2022 14:30:33 -0400 Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8A92FFD371 for ; Thu, 19 May 2022 11:30:30 -0700 (PDT) Received: by mail-pf1-x432.google.com with SMTP id bo5so5851473pfb.4 for ; Thu, 19 May 2022 11:30:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=0GM+Jpl2PB5bL1v4oYzWVj7sToezUiR+DzB0/ofcA2U=; b=edalsKs+cr1bHh8k68c1cKtEAXiGwPDe+9mwnH4RI6wby9g93BgMBhEo2FvH1BiNE/ Rf+mHA2Avzet8LIPH8/gei2VwN5rRzg5NjmV0iOB/d33JAiwPchFJ9a5YyeQop5KvfBJ xFGjMQxMBs6fyOWtImOPcl8yz553+6++yGTZA8+IjrC73jMjrs8/l1KiPB0fSA/9IloT AxTnr8x+VXzunejPPM1Z5lNAGMyCaRpIAe5ngyNUIXVqKjwip9YS5xFhkzU/n5rYFmBI uf17WAO9nDvdwAl18G8ppoKLB/3TsZv110JHXS4noK0X0hhxJcxYWuQuROcuw2ohusip LgKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=0GM+Jpl2PB5bL1v4oYzWVj7sToezUiR+DzB0/ofcA2U=; b=ptqMHr5/cECDZXcp9m6ohPWpg9oTmLIW2q3jhmu02xiSYZ2hYP9veZ5jJJchKzxkVp ARYJgPpa/aBco5NAB0RtU6JZyHvFCHG+C7di21lesFN//CBX+L7STkvpx/dysqalpRuC owDodhVwBRJJjR+BPQ0ZNk0JTKnvRX46q0u//kfceYJecMC0+VnxauFOhGyWvS7+mDW+ GrmMqNd2Y3dKXsSQF2vEYbqWWBxc9gcL4VdvY5fUC/yMpdw4bbYwoYIN5a6A+jTamMdF QUDqG/+96svIfeCC7Ix6IO1hMJxnkRGOOVa+ON54hQchELJO/p3cyADlW+jmZ2v2QXJe 3mDg== X-Gm-Message-State: AOAM530RhRtxeyKHN9joUlUy1UC2QGGxUQvPCpuBOtyNjB0FCsZnXdOC UJ9drnRHQRyEtALTRsRqQkJbOrsy1aM= X-Google-Smtp-Source: ABdhPJwd4x1e0UIjlmbwBKk07N79TOTNwuw1JTWaurKAs+KJuC4ETrijeUFGJjgHf84Ek3MdEHyQLQ== X-Received: by 2002:a65:6d08:0:b0:3c6:8a08:3b9f with SMTP id bf8-20020a656d08000000b003c68a083b9fmr5013534pgb.147.1652985030008; Thu, 19 May 2022 11:30:30 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id z5-20020a63e105000000b003c14af505f6sm3884674pgh.14.2022.05.19.11.30.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 May 2022 11:30:29 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , linux-mm@kvack.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Christian Brauner , "Matthew Wilcox (Oracle)" , Vlastimil Babka , William Kucharski , John Hubbard , Yunsheng Lin , Arnd Bergmann , Suren Baghdasaryan , Chih-En Lin , Colin Cross , Feng Tang , "Eric W. Biederman" , Mike Rapoport , Geert Uytterhoeven , Anshuman Khandual , "Aneesh Kumar K.V" , Daniel Axtens , Jonathan Marek , Christophe Leroy , Pasha Tatashin , Peter Xu , Andrea Arcangeli , Thomas Gleixner , Andy Lutomirski , Sebastian Andrzej Siewior , Fenghua Yu , David Hildenbrand , linux-kernel@vger.kernel.org, Kaiyang Zhao , Huichun Feng , Jim Huang Subject: [RFC PATCH 4/6] mm: Add COW PTE fallback function Date: Fri, 20 May 2022 02:31:25 +0800 Message-Id: <20220519183127.3909598-5-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220519183127.3909598-1-shiyn.lin@gmail.com> References: <20220519183127.3909598-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The lifetime of COW PTE will handle by ownership and a reference count. When the process wants to write the COW PTE, which reference count is 1, it will reuse the COW PTE instead of copying then free. Only the owner will update its RSS state and the record of page table bytes allocation. So we need to handle when the non-owner process gets the fallback COW PTE. This commit prepares for the following implementation of the reference count for COW PTE. Signed-off-by: Chih-En Lin --- mm/memory.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 76e3af9639d9..dcb678cbb051 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1000,6 +1000,34 @@ page_copy_prealloc(struct mm_struct *src_mm, struct = vm_area_struct *vma, return new_page; } =20 +static inline void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct= *vma, + pmd_t *pmdp, unsigned long addr, unsigned long end, bool inc_dec) +{ + int rss[NR_MM_COUNTERS]; + pte_t *orig_ptep, *ptep; + struct page *page; + + init_rss_vec(rss); + + ptep =3D pte_offset_map(pmdp, addr); + orig_ptep =3D ptep; + arch_enter_lazy_mmu_mode(); + do { + if (pte_none(*ptep) || pte_special(*ptep)) + continue; + + page =3D vm_normal_page(vma, addr, *ptep); + if (page) { + if (inc_dec) + rss[mm_counter(page)]++; + else + rss[mm_counter(page)]--; + } + } while (ptep++, addr +=3D PAGE_SIZE, addr !=3D end); + arch_leave_lazy_mmu_mode(); + add_mm_rss_vec(mm, rss); +} + static int copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_= vma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, @@ -4554,6 +4582,44 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, = pud_t orig_pud) return VM_FAULT_FALLBACK; } =20 +/* COW PTE fallback to normal PTE: + * - two state here + * - After break child : [parent, rss=3D1, ref=3D1, write=3DNO , owner= =3Dparent] + * to [parent, rss=3D1, ref=3D1, write=3DYES, owner= =3DNULL ] + * - After break parent: [child , rss=3D0, ref=3D1, write=3DNO , owner= =3DNULL ] + * to [child , rss=3D1, ref=3D1, write=3DYES, owner= =3DNULL ] + */ +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct mm_struct *mm =3D vma->vm_mm; + unsigned long start, end; + pmd_t new; + + BUG_ON(pmd_write(*pmd)); + + start =3D addr & PMD_MASK; + end =3D (addr + PMD_SIZE) & PMD_MASK; + + /* If pmd is not owner, it needs to increase the rss. + * Since only the owner has the RSS state for the COW PTE. + */ + if (!cow_pte_owner_is_same(pmd, pmd)) { + cow_pte_rss(mm, vma, pmd, start, end, true /* inc */); + mm_inc_nr_ptes(mm); + smp_wmb(); + pmd_populate(mm, pmd, pmd_page(*pmd)); + } + + /* Reuse the pte page */ + set_cow_pte_owner(pmd, NULL); + new =3D pmd_mkwrite(*pmd); + set_pmd_at(mm, addr, pmd, new); + + BUG_ON(!pmd_write(*pmd)); + BUG_ON(pmd_page(*pmd)->cow_pte_owner); +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most --=20 2.36.1 From nobody Sat May 9 09:32:35 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B6B2C433F5 for ; Thu, 19 May 2022 18:30:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240544AbiESSat (ORCPT ); Thu, 19 May 2022 14:30:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57280 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243945AbiESSah (ORCPT ); Thu, 19 May 2022 14:30:37 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A9D70EBA97 for ; Thu, 19 May 2022 11:30:36 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id ds11so6075021pjb.0 for ; Thu, 19 May 2022 11:30:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=UmD3lcFeTPDWHVGorCCwoEXIVzcsZGCT2BZYAichU1Y=; b=oY+dcEgNEIo9g6YqyXfxa63f/ZoyciRGzpdYhqxPmlGdmu4UQhDp5GSrZVEekt9cBe aleWBsFQEgfuiiGNK9Qy/YGRavjN8QNttJmZTKDfcyUN77iJOC8n1IEfzfWwmYR4QSkb A5dbTP0JAKWckixtpadpAGtfgOrUbdXiEJHAiYu5FtYDQ4zIWzth5qNKm+dTvlId4ow+ NyVzELN+Ux+WowpkTTgAvo4QYSQ3XbPWUIdix32vBd5x4b6Dzo/bdu4zBw2V6JZvoko5 b6kZYT5eK2OWI8AshzsQXn3D7kp8hnzwbYXgt7chaULMqKqU5qcjZT4ujoTvz/2JD9/d ucnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=UmD3lcFeTPDWHVGorCCwoEXIVzcsZGCT2BZYAichU1Y=; b=4EWVmfv/DCzQ48UzNSNrJAlvcK2KskxtaxD+Ty9Vt2qj92seQk93VxNUB8Ozv84Rux 430SvswTGFnXISixnD6JwB5Q+R7EaJ1jZ5rLVSZBzd3GXgr0rbhU5LlGC+CNImOGLwJr jszz2G7dVyGXJ3hTEK9EkRzoto3mb+N+M/ktioTO6SIbsd05hvNqFZIrb5NJ7kIlg086 GuVrCgfLLBoeipHarzpidCpIGER8qcShAlJX5WYRd6okZ6wpYUV8khCNCMiZ33aaCutF xUhvfiitrqJzxPN1fWF9u39DTGW6IqB6rZCFxzQJyQfQrEEOBHlYveHJ/d3kmzRbZdn2 Q2Hw== X-Gm-Message-State: AOAM531dzzRjVdaKFAQqaalviC22bGVLz5g4d4HAzyCKihFIvivNK9Pk EAvXHb3cDIRLFl07auSHgqg= X-Google-Smtp-Source: ABdhPJzcgtA40HpIUmFEvX1Y8bcQpMq8LvSrxZ0XWDgxCPWU8IRvzIFvnpxFPWLY1xszYrbDzx6oEA== X-Received: by 2002:a17:90a:7441:b0:1df:5f54:502c with SMTP id o1-20020a17090a744100b001df5f54502cmr7132196pjk.129.1652985036178; Thu, 19 May 2022 11:30:36 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id z5-20020a63e105000000b003c14af505f6sm3884674pgh.14.2022.05.19.11.30.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 May 2022 11:30:35 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , linux-mm@kvack.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Christian Brauner , "Matthew Wilcox (Oracle)" , Vlastimil Babka , William Kucharski , John Hubbard , Yunsheng Lin , Arnd Bergmann , Suren Baghdasaryan , Chih-En Lin , Colin Cross , Feng Tang , "Eric W. Biederman" , Mike Rapoport , Geert Uytterhoeven , Anshuman Khandual , "Aneesh Kumar K.V" , Daniel Axtens , Jonathan Marek , Christophe Leroy , Pasha Tatashin , Peter Xu , Andrea Arcangeli , Thomas Gleixner , Andy Lutomirski , Sebastian Andrzej Siewior , Fenghua Yu , David Hildenbrand , linux-kernel@vger.kernel.org, Kaiyang Zhao , Huichun Feng , Jim Huang Subject: [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE Date: Fri, 20 May 2022 02:31:26 +0800 Message-Id: <20220519183127.3909598-6-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220519183127.3909598-1-shiyn.lin@gmail.com> References: <20220519183127.3909598-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add the reference counter cow_pgtable_refcount to maintain the number of process references to COW PTE. Before decreasing the reference count, it will check whether the counter is one or not for reusing COW PTE when the counter is one. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 1 + include/linux/mm_types.h | 1 + include/linux/pgtable.h | 27 +++++++++++++++++++++++++++ mm/memory.c | 1 + 4 files changed, 30 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 221926a3d818..e48bb3fbc33c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2329,6 +2329,7 @@ static inline bool pgtable_pte_page_ctor(struct page = *page) __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); page->cow_pte_owner =3D NULL; + atomic_set(&page->cow_pgtable_refcount, 1); return true; } =20 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5dcbd7f6c361..984d81e47d53 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -221,6 +221,7 @@ struct page { #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS int _last_cpupid; #endif + atomic_t cow_pgtable_refcount; /* COW page table */ pmd_t *cow_pte_owner; /* cow pte: pmd */ } _struct_page_alignment; =20 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index faca57af332e..33c01fec7b92 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -604,6 +604,33 @@ static inline bool cow_pte_owner_is_same(pmd_t *pmd, p= md_t *owner) true : false; } =20 +extern void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr); + +static inline int pmd_get_pte(pmd_t *pmd) +{ + return atomic_inc_return(&pmd_page(*pmd)->cow_pgtable_refcount); +} + +/* If the COW PTE page->cow_pgtable_refcount is 1, instead of decreasing t= he + * counter, clear write protection of the corresponding PMD entry and reset + * the COW PTE owner to reuse the table. + */ +static inline int pmd_put_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + if (!atomic_add_unless(&pmd_page(*pmd)->cow_pgtable_refcount, -1, 1)) { + cow_pte_fallback(vma, pmd, addr); + return 1; + } + return 0; +} + +static inline int cow_pte_refcount_read(pmd_t *pmd) +{ + return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount); +} + #ifndef pte_access_permitted #define pte_access_permitted(pte, write) \ (pte_present(pte) && (!(write) || pte_write(pte))) diff --git a/mm/memory.c b/mm/memory.c index dcb678cbb051..aa66af76e214 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4597,6 +4597,7 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd= _t *pmd, pmd_t new; =20 BUG_ON(pmd_write(*pmd)); + BUG_ON(cow_pte_refcount_read(pmd) !=3D 1); =20 start =3D addr & PMD_MASK; end =3D (addr + PMD_SIZE) & PMD_MASK; --=20 2.36.1 From nobody Sat May 9 09:32:35 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBFA3C433EF for ; Thu, 19 May 2022 18:31:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243958AbiESSbT (ORCPT ); Thu, 19 May 2022 14:31:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58104 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243999AbiESSav (ORCPT ); Thu, 19 May 2022 14:30:51 -0400 Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3C9EE26E6 for ; Thu, 19 May 2022 11:30:43 -0700 (PDT) Received: by mail-pj1-x1033.google.com with SMTP id l7-20020a17090aaa8700b001dd1a5b9965so5985275pjq.2 for ; Thu, 19 May 2022 11:30:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=3AnSMal5kZBx9xm4mSXNs4WPl8F37Q3+gJ0liTuUGj8=; b=DqyOtcWpw4/3gxmaDqPLZ8Nw2zNdRfmi5MkPTr3BdkmaJjnEHaRPCqeIAFBBKuNKBE h2dju+OhAqrp2X6l0mMHrxj4mzqMZ0eWPKEKD9b+L7/GThKosMVuqNcyOWKS60R34KM1 hediFHBh1JTxzqa2Z5ehPZ3hBA4/tfyYLAgxGUedkLNypGn0dr8L4a1wo2CbSzmxdUSa WOkF+6I7F0pHpnkPTMT10UEnmIklsh4mHE7d6NqYABpgJANOoIawN/sUwhG6ym72WJOQ LsOA2DmFBfEbg4X5vCFpwNcKunJBCswMsmuOG1np8WTFAzdILVpbSh7SfBZqblYG2r5q BN4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=3AnSMal5kZBx9xm4mSXNs4WPl8F37Q3+gJ0liTuUGj8=; b=lFEwxd3m6Pg+ILzK/0ptxHFV30yYJk7mxnc7kAdCSctdkz+b9XisYrbzyyz6MgBbM1 GnZyhwT/lk/y0QtIR1/PqKEGCjBJ70PFZjJ2Js2YQRnSi6F0ytq66pBp66bfTxwtNYMY knaoHGxGKIW2kNW1KSVic9rCs4j+qoGxvASFS8AWg4uU2vw2TiLtLDLQlu4UqFaay8Dk yYC1Yto0/dyt1cennTUB7sEt5xex2DSJWOWmgeg++syubLGCOM5CEYcAb7d2R4ZpF1Xv wG9KBOqdJwFI31X46rUMaUBeUkgCQ56OwBk68vXXwSWPW2r3Yw0oSVTDgLcjQn0olOPU RxYg== X-Gm-Message-State: AOAM532unj/RJgnwBksS3sXJ+PLP20es9TwiU4p3UplZTR7E6wSc1RK2 1E5JSW9dWPSdS4g79aOT1Hg= X-Google-Smtp-Source: ABdhPJyMRyDLR4APr3oRJmf7ZTPIAO7XQa8q/I4hne4Qf279PozzpMkdV7r2ne7m+D/j/icwgzsbJA== X-Received: by 2002:a17:90b:4c07:b0:1df:755e:e0df with SMTP id na7-20020a17090b4c0700b001df755ee0dfmr6456309pjb.244.1652985042428; Thu, 19 May 2022 11:30:42 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id z5-20020a63e105000000b003c14af505f6sm3884674pgh.14.2022.05.19.11.30.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 May 2022 11:30:42 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , linux-mm@kvack.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Christian Brauner , "Matthew Wilcox (Oracle)" , Vlastimil Babka , William Kucharski , John Hubbard , Yunsheng Lin , Arnd Bergmann , Suren Baghdasaryan , Chih-En Lin , Colin Cross , Feng Tang , "Eric W. Biederman" , Mike Rapoport , Geert Uytterhoeven , Anshuman Khandual , "Aneesh Kumar K.V" , Daniel Axtens , Jonathan Marek , Christophe Leroy , Pasha Tatashin , Peter Xu , Andrea Arcangeli , Thomas Gleixner , Andy Lutomirski , Sebastian Andrzej Siewior , Fenghua Yu , David Hildenbrand , linux-kernel@vger.kernel.org, Kaiyang Zhao , Huichun Feng , Jim Huang Subject: [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table Date: Fri, 20 May 2022 02:31:27 +0800 Message-Id: <20220519183127.3909598-7-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220519183127.3909598-1-shiyn.lin@gmail.com> References: <20220519183127.3909598-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch adds the Copy-On-Write (COW) mechanism to the PTE table. To enable the COW page table use the clone3() system call with the CLONE_COW_PGTABLE flag. It will set the MMF_COW_PGTABLE flag to the processes. It uses the MMF_COW_PGTABLE flag to distinguish the default page table and the COW one. Moreover, it is difficult to distinguish whether the entire page table is out of COW state. So the MMF_COW_PGTABLE flag won't be disabled after its setup. Since the memory space of the page table is distinctive for each process in kernel space. It uses the address of the PMD index for the ownership of the PTE table to identify which one of the processes needs to update the page table state. In other words, only the owner will update COW PTE state, like the RSS and pgtable_bytes. It uses the reference count to control the lifetime of COW PTE table. When someone breaks COW, it will copy the COW PTE table and decrease the reference count. But if the reference count is equal to one before the break COW, it will reuse the COW PTE table. This patch modifies the part of the copy page table to do the basic COW. For the break COW, it modifies the part of a page fault, zaps page table , unmapping, and remapping. Signed-off-by: Chih-En Lin --- include/linux/pgtable.h | 3 + mm/memory.c | 262 ++++++++++++++++++++++++++++++++++++---- mm/mmap.c | 4 + mm/mremap.c | 5 + 4 files changed, 251 insertions(+), 23 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 33c01fec7b92..357ce3722ee8 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -631,6 +631,9 @@ static inline int cow_pte_refcount_read(pmd_t *pmd) return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount); } =20 +extern int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, bool alloc); + #ifndef pte_access_permitted #define pte_access_permitted(pte, write) \ (pte_present(pte) && (!(write) || pte_write(pte))) diff --git a/mm/memory.c b/mm/memory.c index aa66af76e214..ff3fcbe4dfb5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -247,6 +247,8 @@ static inline void free_pmd_range(struct mmu_gather *tl= b, pud_t *pud, next =3D pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) continue; + BUG_ON(cow_pte_refcount_read(pmd) !=3D 1); + BUG_ON(!cow_pte_owner_is_same(pmd, NULL)); free_pte_range(tlb, pmd, addr); } while (pmd++, addr =3D next, addr !=3D end); =20 @@ -1031,7 +1033,7 @@ static inline void cow_pte_rss(struct mm_struct *mm, = struct vm_area_struct *vma, static int copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_= vma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, - unsigned long end) + unsigned long end, bool is_src_pte_locked) { struct mm_struct *dst_mm =3D dst_vma->vm_mm; struct mm_struct *src_mm =3D src_vma->vm_mm; @@ -1053,8 +1055,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struc= t vm_area_struct *src_vma, goto out; } src_pte =3D pte_offset_map(src_pmd, addr); - src_ptl =3D pte_lockptr(src_mm, src_pmd); - spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); + if (!is_src_pte_locked) { + src_ptl =3D pte_lockptr(src_mm, src_pmd); + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); + } orig_src_pte =3D src_pte; orig_dst_pte =3D dst_pte; arch_enter_lazy_mmu_mode(); @@ -1067,7 +1071,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct= vm_area_struct *src_vma, if (progress >=3D 32) { progress =3D 0; if (need_resched() || - spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) + (!is_src_pte_locked && spin_needbreak(src_ptl)) || + spin_needbreak(dst_ptl)) break; } if (pte_none(*src_pte)) { @@ -1118,7 +1123,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct= vm_area_struct *src_vma, } while (dst_pte++, src_pte++, addr +=3D PAGE_SIZE, addr !=3D end); =20 arch_leave_lazy_mmu_mode(); - spin_unlock(src_ptl); + if (!is_src_pte_locked) + spin_unlock(src_ptl); pte_unmap(orig_src_pte); add_mm_rss_vec(dst_mm, rss); pte_unmap_unlock(orig_dst_pte, dst_ptl); @@ -1180,11 +1186,55 @@ copy_pmd_range(struct vm_area_struct *dst_vma, stru= ct vm_area_struct *src_vma, continue; /* fall through */ } - if (pmd_none_or_clear_bad(src_pmd)) - continue; - if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, - addr, next)) + + if (test_bit(MMF_COW_PGTABLE, &src_mm->flags)) { + + if (pmd_none(*src_pmd)) + continue; + + /* XXX: Skip if the PTE already COW this time. */ + if (!pmd_none(*dst_pmd) && + cow_pte_refcount_read(src_pmd) > 1) + continue; + + /* If PTE doesn't have an owner, the parent needs to + * take this PTE. + */ + if (cow_pte_owner_is_same(src_pmd, NULL)) { + set_cow_pte_owner(src_pmd, src_pmd); + /* XXX: The process may COW PTE fork two times. + * But in some situations, owner has cleared. + * Previously Child (This time is the parent) + * COW PTE forking, but previously parent, owner + * , break COW. So it needs to add back the RSS + * state and pgtable bytes. + */ + if (!pmd_write(*src_pmd)) { + unsigned long pte_start =3D + addr & PMD_MASK; + unsigned long pte_end =3D + (addr + PMD_SIZE) & PMD_MASK; + cow_pte_rss(src_mm, src_vma, src_pmd, + pte_start, pte_end, true /* inc */); + mm_inc_nr_ptes(src_mm); + smp_wmb(); + pmd_populate(src_mm, src_pmd, + pmd_page(*src_pmd)); + } + } + + pmdp_set_wrprotect(src_mm, addr, src_pmd); + + /* Child reference count */ + pmd_get_pte(src_pmd); + + /* COW for PTE table */ + set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd); + } else if (!pmd_none_or_clear_bad(src_pmd) && + copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, + addr, next, false)) { return -ENOMEM; + } } while (dst_pmd++, src_pmd++, addr =3D next, addr !=3D end); return 0; } @@ -1336,6 +1386,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struc= t vm_area_struct *src_vma) struct zap_details { struct folio *single_folio; /* Locked folio to be unmapped */ bool even_cows; /* Zap COWed private pages too? */ + bool cow_pte; /* Do not free COW PTE */ }; =20 /* Whether we should zap all COWed (private) pages too */ @@ -1398,8 +1449,9 @@ static unsigned long zap_pte_range(struct mmu_gather = *tlb, page =3D vm_normal_page(vma, addr, ptent); if (unlikely(!should_zap_page(details, page))) continue; - ptent =3D ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); + if (!details || !details->cow_pte) + ptent =3D ptep_get_and_clear_full(mm, addr, pte, + tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; @@ -1413,8 +1465,11 @@ static unsigned long zap_pte_range(struct mmu_gather= *tlb, likely(!(vma->vm_flags & VM_SEQ_READ))) mark_page_accessed(page); } - rss[mm_counter(page)]--; - page_remove_rmap(page, vma, false); + if (!details || !details->cow_pte) { + rss[mm_counter(page)]--; + page_remove_rmap(page, vma, false); + } else + continue; if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); if (unlikely(__tlb_remove_page(tlb, page))) { @@ -1425,6 +1480,8 @@ static unsigned long zap_pte_range(struct mmu_gather = *tlb, continue; } =20 + // TODO: Deal COW PTE with swap + entry =3D pte_to_swp_entry(ptent); if (is_device_private_entry(entry) || is_device_exclusive_entry(entry)) { @@ -1513,16 +1570,34 @@ static inline unsigned long zap_pmd_range(struct mm= u_gather *tlb, spin_unlock(ptl); } =20 - /* - * Here there can be other concurrent MADV_DONTNEED or - * trans huge page faults running, and if the pmd is - * none or trans huge it can change under us. This is - * because MADV_DONTNEED holds the mmap_lock in read - * mode. - */ - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - goto next; - next =3D zap_pte_range(tlb, vma, pmd, addr, next, details); + + if (test_bit(MMF_COW_PGTABLE, &tlb->mm->flags) && + !pmd_none(*pmd) && !pmd_write(*pmd)) { + struct zap_details cow_pte_details =3D {0}; + if (details) + cow_pte_details =3D *details; + cow_pte_details.cow_pte =3D true; + /* Flush the TLB but do not free the COW PTE */ + next =3D zap_pte_range(tlb, vma, pmd, addr, + next, &cow_pte_details); + if (details) + *details =3D cow_pte_details; + handle_cow_pte(vma, pmd, addr, false); + } else { + if (details) + details->cow_pte =3D false; + /* + * Here there can be other concurrent MADV_DONTNEED or + * trans huge page faults running, and if the pmd is + * none or trans huge it can change under us. This is + * because MADV_DONTNEED holds the mmap_lock in read + * mode. + */ + if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + goto next; + next =3D zap_pte_range(tlb, vma, pmd, addr, next, + details); + } next: cond_resched(); } while (pmd++, addr =3D next, addr !=3D end); @@ -4621,6 +4696,134 @@ void cow_pte_fallback(struct vm_area_struct *vma, p= md_t *pmd, BUG_ON(pmd_page(*pmd)->cow_pte_owner); } =20 +/* Break COW PTE: + * - two state here + * - After fork : [parent, rss=3D1, ref=3D2, write=3DNO , owner=3Dpare= nt] + * to [parent, rss=3D1, ref=3D1, write=3DYES, owner=3DNULL= ] + * COW PTE become [ref=3D1, write=3DNO , owner=3DNULL ] + * [child , rss=3D0, ref=3D2, write=3DNO , owner=3Dpare= nt] + * to [child , rss=3D1, ref=3D1, write=3DYES, owner=3DNULL= ] + * COW PTE become [ref=3D1, write=3DNO , owner=3Dparent] + * NOTE + * - Copy the COW PTE to new PTE. + * - Clear the owner of COW PTE and set PMD entry writable when it is = owner. + * - Increase RSS if it is not owner. + */ +static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct mm_struct *mm =3D vma->vm_mm; + unsigned long start, end; + pmd_t cowed_entry =3D *pmd; + + if (cow_pte_refcount_read(&cowed_entry) =3D=3D 1) { + cow_pte_fallback(vma, pmd, addr); + return 1; + } + + BUG_ON(pmd_write(cowed_entry)); + + start =3D addr & PMD_MASK; + end =3D (addr + PMD_SIZE) & PMD_MASK; + + pmd_clear(pmd); + if (copy_pte_range(vma, vma, pmd, &cowed_entry, + start, end, true)) + return -ENOMEM; + + /* Here, it is the owner, so clear the ownership. To keep RSS state and + * page table bytes correct, it needs to decrease them. + */ + if (cow_pte_owner_is_same(&cowed_entry, pmd)) { + set_cow_pte_owner(&cowed_entry, NULL); + cow_pte_rss(mm, vma, pmd, start, end, false /* dec */); + mm_dec_nr_ptes(mm); + } + + pmd_put_pte(vma, &cowed_entry, addr); + + BUG_ON(!pmd_write(*pmd)); + BUG_ON(cow_pte_refcount_read(pmd) !=3D 1); + + return 0; +} + +static int zap_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct mm_struct *mm =3D vma->vm_mm; + unsigned long start, end; + + if (pmd_put_pte(vma, pmd, addr)) { + // fallback + return 1; + } + + start =3D addr & PMD_MASK; + end =3D (addr + PMD_SIZE) & PMD_MASK; + + /* If PMD entry is owner, clear the ownership, and decrease RSS state + * and pgtable_bytes. + */ + if (cow_pte_owner_is_same(pmd, pmd)) { + set_cow_pte_owner(pmd, NULL); + cow_pte_rss(mm, vma, pmd, start, end, false /* dec */); + mm_dec_nr_ptes(mm); + } + + pmd_clear(pmd); + return 0; +} + +/* If alloc set means it won't break COW. For this case, it will just decr= ease + * the reference count. The address needs to be at the beginning of the PT= E page + * since COW PTE is copy-on-write the entire PTE. + * If pmd is NULL, it will get the pmd from vma and check it is cowing. + */ +int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, bool alloc) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + struct mm_struct *mm =3D vma->vm_mm; + int ret =3D 0; + spinlock_t *ptl =3D NULL; + + if (!pmd) { + pgd =3D pgd_offset(mm, addr); + if (pgd_none_or_clear_bad(pgd)) + return 0; + p4d =3D p4d_offset(pgd, addr); + if (p4d_none_or_clear_bad(p4d)) + return 0; + pud =3D pud_offset(p4d, addr); + if (pud_none_or_clear_bad(pud)) + return 0; + pmd =3D pmd_offset(pud, addr); + if (pmd_none(*pmd) || pmd_write(*pmd)) + return 0; + } + + // TODO: handle COW PTE with swap + BUG_ON(is_swap_pmd(*pmd)); + BUG_ON(pmd_trans_huge(*pmd)); + BUG_ON(pmd_devmap(*pmd)); + + BUG_ON(pmd_none(*pmd)); + BUG_ON(pmd_write(*pmd)); + + ptl =3D pte_lockptr(mm, pmd); + spin_lock(ptl); + if (!alloc) + ret =3D zap_cow_pte(vma, pmd, addr); + else + ret =3D break_cow_pte(vma, pmd, addr); + spin_unlock(ptl); + + return ret; +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most @@ -4825,6 +5028,19 @@ static vm_fault_t __handle_mm_fault(struct vm_area_s= truct *vma, return 0; } } + + /* When the PMD entry is set with write protection, it needs to + * handle the on-demand PTE. It will allocate a new PTE and copy + * the old one, then set this entry writeable and decrease the + * reference count at COW PTE. + */ + if (test_bit(MMF_COW_PGTABLE, &mm->flags) && + !pmd_none(vmf.orig_pmd) && !pmd_write(vmf.orig_pmd)) { + if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address, + (cow_pte_refcount_read(&vmf.orig_pmd) > 1) ? + true : false) < 0) + return VM_FAULT_OOM; + } } =20 return handle_pte_fault(&vmf); diff --git a/mm/mmap.c b/mm/mmap.c index 313b57d55a63..e3a9c38e87e8 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2709,6 +2709,10 @@ int __split_vma(struct mm_struct *mm, struct vm_area= _struct *vma, return err; } =20 + if (test_bit(MMF_COW_PGTABLE, &vma->vm_mm->flags) && + handle_cow_pte(vma, NULL, addr, true) < 0) + return -ENOMEM; + new =3D vm_area_dup(vma); if (!new) return -ENOMEM; diff --git a/mm/mremap.c b/mm/mremap.c index 303d3290b938..01aefdfc61b7 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -532,6 +532,11 @@ unsigned long move_page_tables(struct vm_area_struct *= vma, old_pmd =3D get_old_pmd(vma->vm_mm, old_addr); if (!old_pmd) continue; + + if (test_bit(MMF_COW_PGTABLE, &vma->vm_mm->flags) && + !pmd_none(*old_pmd) && !pmd_write(*old_pmd)) + handle_cow_pte(vma, old_pmd, old_addr, true); + new_pmd =3D alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; --=20 2.36.1