From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 789F0C6FA82 for ; Tue, 27 Sep 2022 16:27:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231727AbiI0Q1t (ORCPT ); Tue, 27 Sep 2022 12:27:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42428 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231879AbiI0Q1m (ORCPT ); Tue, 27 Sep 2022 12:27:42 -0400 Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C262048EA6 for ; Tue, 27 Sep 2022 09:27:41 -0700 (PDT) Received: by mail-pf1-x434.google.com with SMTP id i6so5962959pfb.2 for ; Tue, 27 Sep 2022 09:27:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=mZ0ic9EB4+tzuNdj/CFplry2nB0UlL8aeOFVdviEjyk=; b=dDhu03HoHqtx2WXjQchPedVhAxIJsNBfEaj5pSXtu5I6p2ogdvS8/URft5lKjPQVMX KO2ExjK8I0Dm2RvEBkTFabGtZ99ppt1NVmR9oeGojcXuTHuqmRiPZqb6FhrlEAbqdGXp qFxAVD70FqY7faBtqWPlEhnQBwhQGwIlRUPp7dAK290DLDU9KutrOOrZJbBIzOx+c43r aDjhclGS/L8ou4YGOuZZ9xkjd8Xnvaz7FOCmMWAaRB5mH8OPnnkg+m1lypQ4UmcxQXzW Y40EGz3RwZiUhx8TtxjdK5G+FSyaixZ82mEzDnOZWRRbjxbos7BXQOeUNhTw6IY+9Zoo 4lIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=mZ0ic9EB4+tzuNdj/CFplry2nB0UlL8aeOFVdviEjyk=; b=6jG+p5mZfFLQnCm/pYgYUk9xg+g+GQ9Zfb7WBHIHKc2xxrSWkLFlsBAobg3rRG7zMA jQuAPqlPM+L2mmeFLbIE41YJkZ0/LWjXR+ShnvuYqmMtSWqfOkH03zTzVDuuRMe55K7M OgPOCmemFmmiRZtuidzYuJqtAsy3i/X8UiMdf2W4+az6WqIhcs7jrT9m6UKQhlHhPN8v SxdTEAot41FlZnejlZ3kE2xiJhJpUM2PqPJ5lWdYoSiFd9Y+6rhOQ2xE/EaeuFaJVNdh zF65paprR8QwfYb58l9+d5FZT6Mvxwxe1Rg35Ss1ZVt4pGTuxVbxXrd9CUyvp//CQyr5 kb5w== X-Gm-Message-State: ACrzQf2ztt+a6GSiCbfqBIUA7CbAFBnx3fNHVorfN47IvL2PJQTVvH3h fAfPylZ8VRcNozZYTtdhyWE= X-Google-Smtp-Source: AMsMyM4X8q5DLOnenuOpkczDjluEL+5YCxAl4GIpEEy9RDAAZfHYJA96bX3p3JTHiMYDraXh1AoL8A== X-Received: by 2002:a65:6ccd:0:b0:439:2033:6ee with SMTP id g13-20020a656ccd000000b00439203306eemr25656463pgw.271.1664296061307; Tue, 27 Sep 2022 09:27:41 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.27.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:27:40 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Date: Wed, 28 Sep 2022 00:29:49 +0800 Message-Id: <20220927162957.270460-2-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add MMF_COW_PTE{, _READY} flags to prepare the subsequent implementation of Copy-On-Write for the page table. Signed-off-by: Chih-En Lin --- include/linux/sched/coredump.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 4d0a5be28b70f..f03ff69c90c8c 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -84,7 +84,13 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_PINNED 28 /* FOLL_PIN has run, never cleared */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) =20 +#define MMF_COW_PTE_READY 29 +#define MMF_COW_PTE_READY_MASK (1 << MMF_COW_PTE_READY) + +#define MMF_COW_PTE 30 +#define MMF_COW_PTE_MASK (1 << MMF_COW_PTE) + #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ - MMF_DISABLE_THP_MASK) + MMF_DISABLE_THP_MASK | MMF_COW_PTE_MASK) =20 #endif /* _LINUX_SCHED_COREDUMP_H */ --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9856C54EE9 for ; Tue, 27 Sep 2022 16:28:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232592AbiI0Q2K (ORCPT ); Tue, 27 Sep 2022 12:28:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49572 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232220AbiI0Q1s (ORCPT ); Tue, 27 Sep 2022 12:27:48 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 68A4F17E01 for ; Tue, 27 Sep 2022 09:27:47 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id w20so9527279ply.12 for ; Tue, 27 Sep 2022 09:27:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=Ch3iRYZG7BvL1WANyT2EwsRBrbvSXhTcdOkD0Wx+tJI=; b=pHNTTWGV0DU4hVyvtY8TSpgEsj5JBeZ0rgMCeZ1pu5d+GY8morwfkAM9ptjbMVdEyV imM/g2tHkw7xSPN6fTQ6IGpuWJJr0EOnbyn/XoIvshsjJ50vnPRCxNlUkF3jaUVWmU7F xASEF/UvfT3Vh4pOYFibeqK6mPl8pI81Ac3DVe2ScTQs1DfV2bOmjuoQeJLPlrvZIcC4 HzYkpeh0ZE+m4I8MtBo4z484pUCJHOrvcY0OE0LoR9klOfG7RTIF6qbmdP1WpWscKnwK RZx0TUoBv9qUKFVVTOWkZO8yQqdNeoJpi/In6PXF7/whmu03EZ7QotXz5Qsb+WOLMr4n T7OA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=Ch3iRYZG7BvL1WANyT2EwsRBrbvSXhTcdOkD0Wx+tJI=; b=xO3P5uCJ1ob3VeUg8YE8Pl03E4oAX/7RFw+HeJUcZjKUcasMGwJlq19yt0RZ9H496X IeT5J6thH3ra48YBCOn9PAWzL4wmGsVH96gW14DfGVkHYQX8zp2XNdeZpto6UIX3BJxE yID6y6PjcodM+ddUCsMIAOC++mfJqiMHimrriRGe+BXHZs2niYIagZBCXXjx+5pINr4G ieIEciSSyeVu6r7WKtn4Onh6ZNPq92sBgGlZobBzKtGvIAC7fZKht67lnqxEoBE4fDos EXpSeEtyv9Qy/sKR763VPXMiehnQb18JqlGo+uVx+HOqdvxSOYWON8n821gU7x2adnYn a70g== X-Gm-Message-State: ACrzQf2b4reF1edhqMBv3HpdKGWbMDMiwuUjRaSfE8jLnaW8J0GqWXyg id/Oa0M0HmZNBT2SynpCV2Y= X-Google-Smtp-Source: AMsMyM7/Fqm4HVnIjXka29tQsuAw6B6XFfSVE1NQCCiSrxD8Nnbtv5yV5dZwRsdAW0VLVfM3KrVz5w== X-Received: by 2002:a17:902:f70e:b0:178:9a17:8e42 with SMTP id h14-20020a170902f70e00b001789a178e42mr28423664plo.14.1664296066874; Tue, 27 Sep 2022 09:27:46 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.27.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:27:46 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Date: Wed, 28 Sep 2022 00:29:50 +0800 Message-Id: <20220927162957.270460-3-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add a new sysctl vm.cow_pte to set MMF_COW_PTE_READY flag for enabling copy-on-write (COW) to the PTE page table during the next time of fork. Since it has a time gap between using the sysctl to enable the COW PTE and doing the fork, we use two states to determine the task that wants to do COW PTE or already doing it. Signed-off-by: Chih-En Lin --- include/linux/pgtable.h | 6 ++++++ kernel/fork.c | 5 +++++ kernel/sysctl.c | 8 ++++++++ mm/Makefile | 2 +- mm/cow_pte.c | 39 +++++++++++++++++++++++++++++++++++++++ 5 files changed, 59 insertions(+), 1 deletion(-) create mode 100644 mm/cow_pte.c diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 014ee8f0fbaab..d03d01aefe989 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -937,6 +937,12 @@ static inline void ptep_modify_prot_commit(struct vm_a= rea_struct *vma, __ptep_modify_prot_commit(vma, addr, ptep, pte); } #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */ + +int cow_pte_handler(struct ctl_table *table, int write, void *buffer, + size_t *lenp, loff_t *ppos); + +extern int sysctl_cow_pte_pid; + #endif /* CONFIG_MMU */ =20 /* diff --git a/kernel/fork.c b/kernel/fork.c index 8a9e92068b150..6981944a7c6ec 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2671,6 +2671,11 @@ pid_t kernel_clone(struct kernel_clone_args *args) trace =3D 0; } =20 + if (current->mm && test_bit(MMF_COW_PTE_READY, ¤t->mm->flags)) { + clear_bit(MMF_COW_PTE_READY, ¤t->mm->flags); + set_bit(MMF_COW_PTE, ¤t->mm->flags); + } + p =3D copy_process(NULL, trace, NUMA_NO_NODE, args); add_latent_entropy(); =20 diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 205d605cacc5b..c4f54412ae3a9 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2360,6 +2360,14 @@ static struct ctl_table vm_table[] =3D { .mode =3D 0644, .proc_handler =3D mmap_min_addr_handler, }, + { + .procname =3D "cow_pte", + .data =3D &sysctl_cow_pte_pid, + .maxlen =3D sizeof(int), + .mode =3D 0644, + .proc_handler =3D cow_pte_handler, + .extra1 =3D SYSCTL_ZERO, + }, #endif #ifdef CONFIG_NUMA { diff --git a/mm/Makefile b/mm/Makefile index 9a564f8364035..7a568d5066ee6 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -40,7 +40,7 @@ mmu-y :=3D nommu.o mmu-$(CONFIG_MMU) :=3D highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o + pgtable-generic.o rmap.o vmalloc.o cow_pte.o =20 =20 ifdef CONFIG_CROSS_MEMORY_ATTACH diff --git a/mm/cow_pte.c b/mm/cow_pte.c new file mode 100644 index 0000000000000..4e50aa4294ce7 --- /dev/null +++ b/mm/cow_pte.c @@ -0,0 +1,39 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +/* sysctl will write to this variable */ +int sysctl_cow_pte_pid =3D -1; + +static void set_cow_pte_task(void) +{ + struct pid *pid; + struct task_struct *task; + + pid =3D find_get_pid(sysctl_cow_pte_pid); + if (!pid) { + pr_info("pid %d does not exist\n", sysctl_cow_pte_pid); + sysctl_cow_pte_pid =3D -1; + return; + } + task =3D get_pid_task(pid, PIDTYPE_PID); + if (!test_bit(MMF_COW_PTE, &task->mm->flags)) + set_bit(MMF_COW_PTE_READY, &task->mm->flags); + sysctl_cow_pte_pid =3D -1; +} + +int cow_pte_handler(struct ctl_table *table, int write, void *buffer, + size_t *lenp, loff_t *ppos) +{ + int ret; + + ret =3D proc_dointvec(table, write, buffer, lenp, ppos); + + if (write && !ret) + set_cow_pte_task(); + + return ret; +} --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28EF6C07E9D for ; Tue, 27 Sep 2022 16:28:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232220AbiI0Q2Y (ORCPT ); Tue, 27 Sep 2022 12:28:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50114 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232404AbiI0Q2F (ORCPT ); Tue, 27 Sep 2022 12:28:05 -0400 Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ECA48DCEA8 for ; Tue, 27 Sep 2022 09:27:52 -0700 (PDT) Received: by mail-pg1-x530.google.com with SMTP id f193so9909501pgc.0 for ; Tue, 27 Sep 2022 09:27:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=qop4kS7tfgVAWONSShTjlSwOzcS5SD4NChZjNU/F2nQ=; b=bdViRXIQLD0mZYILY5hdkcVtOxo8rJzC8pDOup5INaReSB1+OZFg6CNzWUIE5BoKlO dIfOAa7OJkvup0fHXtjsFxciDPdJadIVweVlvWKGvO2J76yDUKyS7/4vW/gC7WkYZs+z DjMDXYiEJKmynMd3aq1VffDohQY92eIALGOII45QcStOgx2Njh9+LvLHMKgwnYac70o8 UEWJq/Mfz6mKV0ZjGssEda6J8o3Kkm3c9ViszcnWDOV9ZkMgxvOEeA9nediWg8YVLG+i UyaB5tsMUGqhRx+froXMyrEMO+198iLRw+fWEwE2kkwxAU7DSzokRX22Ez3PAJf2iINy Xq2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=qop4kS7tfgVAWONSShTjlSwOzcS5SD4NChZjNU/F2nQ=; b=AphDjZJAPXtTaanakdjAovQD2GR9LFAmHk4hVYT/1s+TqCokJ1f05y1622NPHQq6g8 Z9XDymnwv+ppAGGzo3YiJU709wGGElBdLcUGp0dCdGpwPEkQq3Dy1MIG3NciWZFdrzAy wK6ftGai2URI3bkIDvg/W43c5TLoMWvwYxkeYngFd7oOfbTDJCxvcJlWueASFt4B5nsV Q8sj5dhXR9fwzbKrsY869Ixw8xYgYaIqVZxStrix6ONxZdp1laXffEpJ3FB+1YWpraMB 4AO3fF8ZQpa9RUExrY4sS3IUo9enkZ11G5CeIVcS0cbAz8lH5DytZ6HXDE+ZZveXfODC bKvw== X-Gm-Message-State: ACrzQf1hb/J65newZgNjgJKUmaKi/2FdIdTuFUfutr2sGwZKJSNyQtF7 KcBVPij2MnxE4iFEneKgDM0= X-Google-Smtp-Source: AMsMyM7VMvbPQOTjzD9+Ao59BdB9ZhHh61sNAq/kZfwKTlLP+8k0zXC+njFiksYsSQsqL75utiqfnA== X-Received: by 2002:a63:3c5:0:b0:43c:8455:d67d with SMTP id 188-20020a6303c5000000b0043c8455d67dmr15487199pgd.73.1664296072465; Tue, 27 Sep 2022 09:27:52 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.27.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:27:51 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table Date: Wed, 28 Sep 2022 00:29:51 +0800 Message-Id: <20220927162957.270460-4-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Introduce the ownership to the PTE table. It uses the address of PMD index to track the ownership to identify which process can update their page table state from the COWed PTE table. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 1 + include/linux/mm_types.h | 5 ++++- include/linux/pgtable.h | 10 ++++++++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 21f8b27bd9fd3..965523dcca3b8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2289,6 +2289,7 @@ static inline bool pgtable_pte_page_ctor(struct page = *page) return false; __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); + page->cow_pte_owner =3D NULL; return true; } =20 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index cf97f3884fda2..42798b59cec4e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -152,7 +152,10 @@ struct page { struct list_head deferred_list; }; struct { /* Page table pages */ - unsigned long _pt_pad_1; /* compound_head */ + union { + unsigned long _pt_pad_1; /* compound_head */ + pmd_t *cow_pte_owner; /* cow pte: pmd */ + }; pgtable_t pmd_huge_pte; /* protected by page->ptl */ unsigned long _pt_pad_2; /* mapping */ union { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d03d01aefe989..9dca787a3f4dd 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -615,6 +615,16 @@ static inline int pte_unused(pte_t pte) } #endif =20 +static inline void set_cow_pte_owner(pmd_t *pmd, pmd_t *owner) +{ + smp_store_release(&pmd_page(*pmd)->cow_pte_owner, owner); +} + +static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner) +{ + return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) =3D=3D owner; +} + #ifndef pte_access_permitted #define pte_access_permitted(pte, write) \ (pte_present(pte) && (!(write) || pte_write(pte))) --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A62FEC07E9D for ; Tue, 27 Sep 2022 16:28:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233122AbiI0Q2f (ORCPT ); Tue, 27 Sep 2022 12:28:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50426 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232058AbiI0Q2T (ORCPT ); Tue, 27 Sep 2022 12:28:19 -0400 Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com [IPv6:2607:f8b0:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AF55D1D627A for ; Tue, 27 Sep 2022 09:27:58 -0700 (PDT) Received: by mail-pg1-x533.google.com with SMTP id bh13so9875542pgb.4 for ; Tue, 27 Sep 2022 09:27:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=qtDTgqvL6bCy2no1plZzpxkuhtLBIBb+n3/hplN/Z/Y=; b=FzFtQXjj92j3O/H1zVoIJZp8ly+GIWZRl3xTccpG09pEOnFdvbMOlsIcUIBeNI/v3B UD1S8mWZ2yXBrb3n6jgJSt0E5RF+gXqgGtuh86h2BpC3F3FRMrrWsXamK9xElcwQqols jPrnfjL1eCfd1DqzvEYirEyTIVoeW5jBow7P94Ini/tlROY4ZMG0DYedWqT3RcRe+iui +It/6Wdgz5o7ZszruRcP1toM0eek1btHojwSLVbC4aUpskYK4HHxV8x1NH+R9xk+kX6Q EaYJWFpB9iayD9ALgD3Mh1qmX+HqtmsGlOhWtft/GRaOkkauRHPGjUXzF7Uy1ap1K/AG nPhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=qtDTgqvL6bCy2no1plZzpxkuhtLBIBb+n3/hplN/Z/Y=; b=aU+EL2d99vFaxkiw/8wolMiEm19RL7cEr0Qjo8aWlJTeVPPrAAtlAWRVUMoJjGEOb0 beYIxqo1w/JnnfNwTckh7wla0wFYFhtzZ2qitfADfb9unYER50ZrF9Dl1wRJ9QfIwqSi f1tlGGyBLw0Hmf/m8KY4HXRTYpAmwSP6WeDDMBcvLIlmzXR6l0BdOO59Q3QxPl6F3PHC cMkyDzuStuKoeEmBU6+xDaW6h5Yl3y/IIqGz7Jr8EdSOHQZXSNyB8nyuBmF0eBO/xRlK 1CtyOiiZZjvlzGALAwFm1FxBLU4P/wJVq28Sbn3vK7jBbHsdxjiTHJJc5/TOPmr4Pyn3 pRBQ== X-Gm-Message-State: ACrzQf1RGxQ9kfecXJC5+rZlI54LD82n3g0sLlrFzokk6qWS5CwF1sxl FbNwg2fquUUuzxcxKjtr23g= X-Google-Smtp-Source: AMsMyM76qYcRgHJgPAFLzh39+ye4Tgr8AF4QcPQDirN61vSKKf/FpRs4uUf3Owokdc+QLh8yo/Czvg== X-Received: by 2002:a63:500d:0:b0:439:dcdd:87af with SMTP id e13-20020a63500d000000b00439dcdd87afmr25634968pgb.231.1664296078073; Tue, 27 Sep 2022 09:27:58 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.27.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:27:57 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions Date: Wed, 28 Sep 2022 00:29:52 +0800 Message-Id: <20220927162957.270460-5-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The lifetime of COWed PTE table will handle by a reference count. When the process wants to write the COWed PTE table, which refcount is 1, it will reuse the shared PTE. Since only the owner will update their page table state. the fallback function also needs to handle the situation of non-owner COWed PTE table fallback to normal PTE. This commit prepares for the following implementation of the reference count for COW PTE. Signed-off-by: Chih-En Lin --- include/linux/pgtable.h | 3 ++ mm/memory.c | 93 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 96 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 9dca787a3f4dd..25c1e5c42fdf3 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -615,6 +615,9 @@ static inline int pte_unused(pte_t pte) } #endif =20 +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr); + static inline void set_cow_pte_owner(pmd_t *pmd, pmd_t *owner) { smp_store_release(&pmd_page(*pmd)->cow_pte_owner, owner); diff --git a/mm/memory.c b/mm/memory.c index 4ba73f5aa8bb7..d29f84801f3cd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -509,6 +509,37 @@ static inline void add_mm_rss_vec(struct mm_struct *mm= , int *rss) add_mm_counter(mm, i, rss[i]); } =20 +static void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma, + pmd_t *pmdp, unsigned long addr, + unsigned long end, bool inc_dec) +{ + int rss[NR_MM_COUNTERS]; + spinlock_t *ptl; + pte_t *orig_ptep, *ptep; + struct page *page; + + init_rss_vec(rss); + + ptep =3D pte_offset_map_lock(mm, pmdp, addr, &ptl); + orig_ptep =3D ptep; + arch_enter_lazy_mmu_mode(); + do { + if (pte_none(*ptep)) + continue; + + page =3D vm_normal_page(vma, addr, *ptep); + if (page) { + if (inc_dec) + rss[mm_counter(page)]++; + else + rss[mm_counter(page)]--; + } + } while (ptep++, addr +=3D PAGE_SIZE, addr !=3D end); + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(orig_ptep, ptl); + add_mm_rss_vec(mm, rss); +} + /* * This function is called to print an error when a bad pte * is found. For example, we might have a PFN-mapped pte in @@ -2817,6 +2848,68 @@ int apply_to_existing_page_range(struct mm_struct *m= m, unsigned long addr, } EXPORT_SYMBOL_GPL(apply_to_existing_page_range); =20 +/** + * cow_pte_fallback - reuse the shared PTE table + * @vma: vma that coever the shared PTE table + * @pmd: pmd index maps to the shared PTE table + * @addr: the address trigger the break COW, + * + * Reuse the shared (COW) PTE table when the refcount is equal to one. + * @addr needs to be in the range of the shared PTE table that @vma and + * @pmd mapped to it. + * + * COW PTE fallback to normal PTE: + * - two state here + * - After break child : [parent, rss=3D1, ref=3D1, write=3DNO , owner= =3Dparent] + * to [parent, rss=3D1, ref=3D1, write=3DYES, owner= =3DNULL ] + * - After break parent: [child , rss=3D0, ref=3D1, write=3DNO , owner= =3DNULL ] + * to [child , rss=3D1, ref=3D1, write=3DYES, owner= =3DNULL ] + */ +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct mm_struct *mm =3D vma->vm_mm; + struct vm_area_struct *prev =3D vma->vm_prev; + struct vm_area_struct *next =3D vma->vm_next; + unsigned long start, end; + pmd_t new; + + VM_WARN_ON(pmd_write(*pmd)); + + start =3D addr & PMD_MASK; + end =3D (addr + PMD_SIZE) & PMD_MASK; + + /* + * If pmd is not owner, it needs to increase the rss. + * Since only the owner has the RSS state for the COW PTE. + */ + if (!cow_pte_owner_is_same(pmd, pmd)) { + /* The part of address range is covered by previous. */ + if (start < vma->vm_start && prev && start < prev->vm_end) { + cow_pte_rss(mm, prev, pmd, + start, prev->vm_end, true /* inc */); + start =3D vma->vm_start; + } + /* The part of address range is covered by next. */ + if (end > vma->vm_end && next && end > next->vm_start) { + cow_pte_rss(mm, next, pmd, + next->vm_start, end, true /* inc */); + end =3D vma->vm_end; + } + cow_pte_rss(mm, vma, pmd, start, end, true /* inc */); + + mm_inc_nr_ptes(mm); + /* Memory barrier here is the same as pmd_install(). */ + smp_wmb(); + pmd_populate(mm, pmd, pmd_page(*pmd)); + } + + /* Reuse the pte page */ + set_cow_pte_owner(pmd, NULL); + new =3D pmd_mkwrite(*pmd); + set_pmd_at(mm, addr, pmd, new); +} + /* * handle_pte_fault chooses page fault handler according to an entry which= was * read non-atomically. Before making any commitment, on those architectu= res --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD60BC6FA82 for ; Tue, 27 Sep 2022 16:28:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231964AbiI0Q2v (ORCPT ); Tue, 27 Sep 2022 12:28:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50110 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232940AbiI0Q2X (ORCPT ); Tue, 27 Sep 2022 12:28:23 -0400 Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 490D41D66E6 for ; Tue, 27 Sep 2022 09:28:04 -0700 (PDT) Received: by mail-pf1-x432.google.com with SMTP id d82so10121440pfd.10 for ; Tue, 27 Sep 2022 09:28:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=XuRjx7DiMyFErohU6l5P8iRkgXDYC8Q+F8gmrtJ+ilo=; b=BUZqiRpZc+y9cuKG56oBy+Cfa11WNSqYjMXnosAmPhlk/lSKqYHn4hBSpj/jIf58w5 I9qqOV/xm+s71mwq9zjCCSYffCPx9a50VMkrfSsqXCK+mN/b4e6hEOlKOolaJUTLxIqC VwRC3b9iS7HmqV77y7pGxVVKSrzAOhQgEcl/7dch2nR+pKj1CxC+vlFOZChPdDepcxnv UFDQMokXHNYITZHeIoOXNkO4NIDhzmsx66zSvT78H/d0eIndhhGEjIwAVklKSYQZiD/V 6H7xjsp1lbsiNpMH3hkT8bqlX2SYf2GrGg3f8YRsP/r9wEive/ykWyXiTUmWSniup8Hj W8cQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=XuRjx7DiMyFErohU6l5P8iRkgXDYC8Q+F8gmrtJ+ilo=; b=wTTcOwfVJsgAbqv4d1S3rJKGtyiHZrV7ip4PkOU0BGXImBJZt0xuxM8QbL+iB9ueyu 44RMxPrgUs6YGxERZAAUgYPdyUloTnVKr3jHlBncm+5r8BX8yuKC2cX3J9MAF54aTXSf DFgWrL/FDgdYs5ycPGJA9Tai+VogjH8blDF2qvLUGy1Q5LSEjnuRoAZC6Y/BueQ5QCbx HDgNHKeJ+kRYdcxXxEVrIrqE6JDWz7ZSLkxxf+VNUYZhzbvl2kHN2CYyCJRmoEgatgIM 7UGJFiSO5gnCNdRZz0eCJL+XfXFPG7tAfsLEXiRY6NQiNZaEOoch9njLdEiEy9X4W596 IoRw== X-Gm-Message-State: ACrzQf3AR/ADlO0oVVY9YToVxjpKRyPbwrvN2UPpqIZpwYeSTb12ysu7 UDrvTqavJxdfQqlNuc9vGfc= X-Google-Smtp-Source: AMsMyM4GbFT0PuBeQyFV0l8Zleho1UdvgIIpjeUqz6KjGWEv03YsI5DcOCows8rB9mxqR/VT7zNmkg== X-Received: by 2002:a05:6a00:1253:b0:546:3d50:3284 with SMTP id u19-20020a056a00125300b005463d503284mr29961879pfi.72.1664296083652; Tue, 27 Sep 2022 09:28:03 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.27.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:28:02 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table Date: Wed, 28 Sep 2022 00:29:53 +0800 Message-Id: <20220927162957.270460-6-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Reuse the _refcount in struct page for the page table to maintain the number of process references to COWed PTE table. Before decreasing the refcount, it will check whether refcount is one or not for reusing shared PTE table. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 1 + include/linux/pgtable.h | 28 ++++++++++++++++++++++++++++ mm/memory.c | 1 + 3 files changed, 30 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 965523dcca3b8..bfe6a8c7ab9ed 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2290,6 +2290,7 @@ static inline bool pgtable_pte_page_ctor(struct page = *page) __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); page->cow_pte_owner =3D NULL; + set_page_count(page, 1); return true; } =20 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 25c1e5c42fdf3..8b497d7d800ed 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -9,6 +9,7 @@ #ifdef CONFIG_MMU =20 #include +#include #include #include #include @@ -628,6 +629,33 @@ static inline bool cow_pte_owner_is_same(pmd_t *pmd, p= md_t *owner) return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) =3D=3D owner; } =20 +static inline int pmd_get_pte(pmd_t *pmd) +{ + return page_ref_inc_return(pmd_page(*pmd)); +} + +/* + * If the COW PTE refcount is 1, instead of decreasing the counter, + * clear write protection of the corresponding PMD entry and reset + * the COW PTE owner to reuse the table. + * But if the reuse parameter is false, do not thing. This help us + * to handle the situation that PTE table we already handled. + */ +static inline int pmd_put_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, bool reuse) +{ + if (!page_ref_add_unless(pmd_page(*pmd), -1, 1) && reuse) { + cow_pte_fallback(vma, pmd, addr); + return 1; + } + return 0; +} + +static inline int cow_pte_count(pmd_t *pmd) +{ + return page_count(pmd_page(*pmd)); +} + #ifndef pte_access_permitted #define pte_access_permitted(pte, write) \ (pte_present(pte) && (!(write) || pte_write(pte))) diff --git a/mm/memory.c b/mm/memory.c index d29f84801f3cd..3e66e229f4169 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2875,6 +2875,7 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd= _t *pmd, pmd_t new; =20 VM_WARN_ON(pmd_write(*pmd)); + VM_WARN_ON(cow_pte_count(pmd) !=3D 1); =20 start =3D addr & PMD_MASK; end =3D (addr + PMD_SIZE) & PMD_MASK; --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69387C54EE9 for ; Tue, 27 Sep 2022 16:28:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232798AbiI0Q24 (ORCPT ); Tue, 27 Sep 2022 12:28:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231126AbiI0Q20 (ORCPT ); Tue, 27 Sep 2022 12:28:26 -0400 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD03212613 for ; Tue, 27 Sep 2022 09:28:09 -0700 (PDT) Received: by mail-pl1-x62a.google.com with SMTP id iw17so9597264plb.0 for ; Tue, 27 Sep 2022 09:28:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=4E5FbfG/IuR+E8W1emv8V3grpVK/EnZv9IRpD1bffsk=; b=R6qS6gl1BdIlAy4Pc+DF6Tnk7r09A1Hzh3RhF+jg8a3vWDHO/Z8Z0F/8KLFmNT4UZz p/zFN3RfTqWi2DRtOVtFciqpCdIdmtpT6DJVVtxBQq6UgU6IFoSyjC2EQZwD12QjI2rP N6sKt/2wIihFM0xukfEBIwASir6D+SsWWfgEpZfNYQlep+bU6+y1VL+nU1RUxcpcZbzj HDJDak4eIuiXg+OZKVbWp7xuXqSnIxgyQjH4X/XP702GEzT5h+djQNgxeLZCe8y8DeX2 1QEo27PMnsy6vuvoS48eRrJKETldKeSTA/zrFTpSxL5TIlzzEkP0pCRtT4W1+NWE4eY6 RLZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=4E5FbfG/IuR+E8W1emv8V3grpVK/EnZv9IRpD1bffsk=; b=MQdP8XC1GQoGKtqs2Bchn22pj3SuAWYohzZFX2VIghs6N7zU1I8K0hHeY3N5ydMNIr FoOLZAsOL+vJheLd2Xod24UsFjRU+Bw1lC8jt/viWE0iUcOZbnW3q8CknseO3pfHT9jV Q+LPhEYY1jok3/4sCfHkxXmdNer2cjuBokfMawzBCp1CoeB1K338GLTtAODEWWDhS7+T IQROBRgBhlkVKUdAglp5qqYV3OcBfGzVbVqu+E/zhtbF+tetL3tzPBGVAKU4BIu++B9E ItcApWO57hFNA6xWSfYfjoGysgTitOp4ErwlfAS64kJrznSGdJVqlhORiJnjG0+Gjq1n 8n+g== X-Gm-Message-State: ACrzQf3+2nDPeTFqsFG4XqlSsGqGiRjbrrO0vKirMGzHqYVWFVyJLB07 crv3eJmvi/EHovdYC621jUw= X-Google-Smtp-Source: AMsMyM4f2hvZ08Sspjy753JXqV501MbKNQc+IDnf+X9RFXObXCyHmtTBbnBpVaxK4l6Y+BwjCYmecg== X-Received: by 2002:a17:902:ce81:b0:179:f3fe:7fef with SMTP id f1-20020a170902ce8100b00179f3fe7fefmr2502835plg.119.1664296089240; Tue, 27 Sep 2022 09:28:09 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.28.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:28:08 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 6/9] mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag Date: Wed, 28 Sep 2022 00:29:54 +0800 Message-Id: <20220927162957.270460-7-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For present COW logic (physical page), in some situations (e.g., pinned page), we cannot share those pages. To make the COW PTE consistent with current logic, introduce the COW_PTE_OWNER_EXCLUSIVE flag to avoid doing COW to the PTE table during fork(). The following is a list of the exclusive flag used. - GUP pinnig with COW physical page will get in trouble. Currently, it will not do COW when GUP works. Follow the rule here. Signed-off-by: Chih-En Lin --- include/linux/pgtable.h | 18 ++++++++++++++++++ mm/gup.c | 13 +++++++++++-- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 8b497d7d800ed..9b08a3361d490 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -656,6 +656,24 @@ static inline int cow_pte_count(pmd_t *pmd) return page_count(pmd_page(*pmd)); } =20 +/* Keep the first bit clear. See more detail in the comments of struct pag= e. */ +#define COW_PTE_OWNER_EXCLUSIVE ((pmd_t *) 0x02UL) + +static inline void pmd_cow_pte_mkexclusive(pmd_t *pmd) +{ + set_cow_pte_owner(pmd, COW_PTE_OWNER_EXCLUSIVE); +} + +static inline bool pmd_cow_pte_exclusive(pmd_t *pmd) +{ + return cow_pte_owner_is_same(pmd, COW_PTE_OWNER_EXCLUSIVE); +} + +static inline void pmd_cow_pte_clear_mkexclusive(pmd_t *pmd) +{ + set_cow_pte_owner(pmd, NULL); +} + #ifndef pte_access_permitted #define pte_access_permitted(pte, write) \ (pte_present(pte) && (!(write) || pte_write(pte))) diff --git a/mm/gup.c b/mm/gup.c index 5abdaf4874605..4949c8d42a400 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -634,6 +634,11 @@ static struct page *follow_page_pte(struct vm_area_str= uct *vma, mark_page_accessed(page); } out: + /* + * We don't share the PTE when any other pinned page exists. And + * let the exclusive flag stick around until the table is freed. + */ + pmd_cow_pte_mkexclusive(pmd); pte_unmap_unlock(ptep, ptl); return page; no_page: @@ -932,6 +937,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned= long address, pte =3D pte_offset_map(pmd, address); if (pte_none(*pte)) goto unmap; + pmd_cow_pte_clear_mkexclusive(pmd); *vma =3D get_gate_vma(mm); if (!page) goto out; @@ -2764,8 +2770,11 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, uns= igned long addr, unsigned lo if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, PMD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) - return 0; + } else { + if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) + return 0; + pmd_cow_pte_mkexclusive(&pmd); + } } while (pmdp++, addr =3D next, addr !=3D end); =20 return 1; --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5479C54EE9 for ; Tue, 27 Sep 2022 16:29:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232806AbiI0Q3A (ORCPT ); Tue, 27 Sep 2022 12:29:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233019AbiI0Q2c (ORCPT ); Tue, 27 Sep 2022 12:28:32 -0400 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B6711D6D3E for ; Tue, 27 Sep 2022 09:28:15 -0700 (PDT) Received: by mail-pf1-x42c.google.com with SMTP id e68so10168527pfe.1 for ; Tue, 27 Sep 2022 09:28:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=VnIN7aCnF5zg62leJbyDnNV6liD03aQ32fsEaaJloIU=; b=fsg0KihQvoWiGIhH+ksRX4pZozCAqcUXZJoJDAUOprO4czcq1U39r9FGLdAzAg5g/8 7Mr3k/AeEoktBnacW68xdRXUhpTTZjOo/v6XuvgTVXCvS0TamvuJeCuldOKVjsQks+At 6s7IhGc5QaPrpOUuBrs/vKjstyY+qKvpcDQ4mWOG2idoGXQy1+040YxdD+k7VDrYF/Xf pL0HVUEhBoStMBOrcju6EGrhEGUejxsBN4TSYYvWfmDqBdPGqt8/cbBS5FyZ7BbAjrvl YPIy88e3FkvrGeYVRDmc5FdhevfzNafGweJvM9nC8iLeSDz957d2fkqQEyu5j04z5F38 Xkog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=VnIN7aCnF5zg62leJbyDnNV6liD03aQ32fsEaaJloIU=; b=m6h9XPEwP2qov+HAfMRJwicqvzCMXuaUXQw8LegZeie0heA06HEqkV9ovDYrMjS0Ba LaulnkU2ZSsWd66kAq3flTZ1j4fFPisSIS85NP8TxcLrFT0dOwjGuCb2WmcmjEO/ncKo ueJsujKioAtZP3a7FaFkjxzD3n41oIOc24eJ+toeRa7BGS0uGybWhgzqgcXPaVUbeVUS l0qHje/7soZ7S6JOuMPSHltuflNczTqzSU14W67MelWi30ob79LlCl/PrgqcnwgzwoDX 3UOIu37hJVgi74AoNZMHMN/UVIS5S4t2X9HWr8I1YO/+h41GQAQm2KX9utzA5bHsW0VN j6Tw== X-Gm-Message-State: ACrzQf2KP92/94pkUTu5Vu/cGm+4nCIPQ77oOi5AqQARXL6alOU/uM3B lMCyjdq4y2DZ7K1If6rerYc= X-Google-Smtp-Source: AMsMyM67l7AHsumdCaArKikJNOJaaDkNNe3imQK1Gmo9dsopNJWErpmOsAxQEwlR62zMKUsb8Rld5A== X-Received: by 2002:a63:5a44:0:b0:431:fa3a:f92c with SMTP id k4-20020a635a44000000b00431fa3af92cmr25975839pgm.471.1664296094902; Tue, 27 Sep 2022 09:28:14 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.28.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:28:14 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Date: Wed, 28 Sep 2022 00:29:55 +0800 Message-Id: <20220927162957.270460-8-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" To handle the COW PTE with write fault, introduce the helper function handle_cow_pte(). The function provides two behaviors. One is breaking COW by decreasing the refcount, pgables_bytes, and RSS. Another is copying all the information in the shared PTE table by using copy_pte_page() with a wrapper. Also, add the wrapper functions to help us find out the COWed or COW-available PTE table. Signed-off-by: Chih-En Lin --- include/linux/pgtable.h | 75 +++++++++++++++++ mm/memory.c | 179 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 254 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 9b08a3361d490..85255f5223ae3 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -10,6 +10,7 @@ =20 #include #include +#include /* For MMF_COW_PTE flag */ #include #include #include @@ -674,6 +675,42 @@ static inline void pmd_cow_pte_clear_mkexclusive(pmd_t= *pmd) set_cow_pte_owner(pmd, NULL); } =20 +static inline unsigned long get_pmd_start_edge(struct vm_area_struct *vma, + unsigned long addr) +{ + unsigned long start =3D addr & PMD_MASK; + + if (start < vma->vm_start) + start =3D vma->vm_start; + + return start; +} + +static inline unsigned long get_pmd_end_edge(struct vm_area_struct *vma, + unsigned long addr) +{ + unsigned long end =3D (addr + PMD_SIZE) & PMD_MASK; + + if (end > vma->vm_end) + end =3D vma->vm_end; + + return end; +} + +static inline bool is_cow_pte_available(struct vm_area_struct *vma, pmd_t = *pmd) +{ + if (!vma || !pmd) + return false; + if (!test_bit(MMF_COW_PTE, &vma->vm_mm->flags)) + return false; + if (pmd_cow_pte_exclusive(pmd)) + return false; + return true; +} + +int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long a= ddr, + bool alloc); + #ifndef pte_access_permitted #define pte_access_permitted(pte, write) \ (pte_present(pte) && (!(write) || pte_write(pte))) @@ -1002,6 +1039,44 @@ int cow_pte_handler(struct ctl_table *table, int wri= te, void *buffer, =20 extern int sysctl_cow_pte_pid; =20 +static inline bool __is_pte_table_cowing(struct vm_area_struct *vma, pmd_t= *pmd, + unsigned long addr) +{ + if (!vma) + return false; + if (!pmd) { + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + + if (addr =3D=3D 0) + return false; + + pgd =3D pgd_offset(vma->vm_mm, addr); + if (pgd_none_or_clear_bad(pgd)) + return false; + p4d =3D p4d_offset(pgd, addr); + if (p4d_none_or_clear_bad(p4d)) + return false; + pud =3D pud_offset(p4d, addr); + if (pud_none_or_clear_bad(pud)) + return false; + pmd =3D pmd_offset(pud, addr); + } + if (!test_bit(MMF_COW_PTE, &vma->vm_mm->flags)) + return false; + if (pmd_none(*pmd) || pmd_write(*pmd)) + return false; + if (pmd_cow_pte_exclusive(pmd)) + return false; + return true; +} + +static inline bool is_pte_table_cowing(struct vm_area_struct *vma, pmd_t *= pmd) +{ + return __is_pte_table_cowing(vma, pmd, 0UL); +} + #endif /* CONFIG_MMU */ =20 /* diff --git a/mm/memory.c b/mm/memory.c index 3e66e229f4169..4cf3f74fb183f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2911,6 +2911,185 @@ void cow_pte_fallback(struct vm_area_struct *vma, p= md_t *pmd, set_pmd_at(mm, addr, pmd, new); } =20 +static inline int copy_cow_pte_range(struct vm_area_struct *vma, + pmd_t *dst_pmd, pmd_t *src_pmd, + unsigned long start, unsigned long end) +{ + struct mm_struct *mm =3D vma->vm_mm; + struct mmu_notifier_range range; + int ret; + bool is_cow; + + is_cow =3D is_cow_mapping(vma->vm_flags); + if (is_cow) { + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, + 0, vma, mm, start, end); + mmu_notifier_invalidate_range_start(&range); + mmap_assert_write_locked(mm); + raw_write_seqcount_begin(&mm->write_protect_seq); + } + + ret =3D copy_pte_range(vma, vma, dst_pmd, src_pmd, start, end); + + if (is_cow) { + raw_write_seqcount_end(&mm->write_protect_seq); + mmu_notifier_invalidate_range_end(&range); + } + + return ret; +} + +/* + * Break COW PTE, two state here: + * - After fork : [parent, rss=3D1, ref=3D2, write=3DNO , owner=3Dpare= nt] + * to [parent, rss=3D1, ref=3D1, write=3DYES, owner=3DNULL= ] + * COW PTE become [ref=3D1, write=3DNO , owner=3DNULL ] + * [child , rss=3D0, ref=3D2, write=3DNO , owner=3Dpare= nt] + * to [child , rss=3D1, ref=3D1, write=3DYES, owner=3DNULL= ] + * COW PTE become [ref=3D1, write=3DNO , owner=3Dparent] + * NOTE + * - Copy the COW PTE to new PTE. + * - Clear the owner of COW PTE and set PMD entry writable when it is = owner. + * - Increase RSS if it is not owner. + */ +static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct mm_struct *mm =3D vma->vm_mm; + unsigned long pte_start, pte_end; + unsigned long start, end; + struct vm_area_struct *prev =3D vma->vm_prev; + struct vm_area_struct *next =3D vma->vm_next; + pmd_t cowed_entry =3D *pmd; + + if (cow_pte_count(&cowed_entry) =3D=3D 1) { + cow_pte_fallback(vma, pmd, addr); + return 1; + } + + pte_start =3D start =3D addr & PMD_MASK; + pte_end =3D end =3D (addr + PMD_SIZE) & PMD_MASK; + + pmd_clear(pmd); + /* + * If the vma does not cover the entire address range of the PTE table, + * it should check the previous and next. + */ + if (start < vma->vm_start && prev) { + /* The part of address range is covered by previous. */ + if (start < prev->vm_end) + copy_cow_pte_range(prev, pmd, &cowed_entry, + start, prev->vm_end); + start =3D vma->vm_start; + } + if (end > vma->vm_end && next) { + /* The part of address range is covered by next. */ + if (end > next->vm_start) + copy_cow_pte_range(next, pmd, &cowed_entry, + next->vm_start, end); + end =3D vma->vm_end; + } + if (copy_cow_pte_range(vma, pmd, &cowed_entry, start, end)) + return -ENOMEM; + + /* + * Here, it is the owner, so clear the ownership. To keep RSS state and + * page table bytes correct, it needs to decrease them. + * Also, handle the address range issue here. + */ + if (cow_pte_owner_is_same(&cowed_entry, pmd)) { + set_cow_pte_owner(&cowed_entry, NULL); + if (pte_start < vma->vm_start && prev && + pte_start < prev->vm_end) + cow_pte_rss(mm, vma->vm_prev, pmd, + pte_start, prev->vm_end, false /* dec */); + if (pte_end > vma->vm_end && next && + pte_end > next->vm_start) + cow_pte_rss(mm, vma->vm_next, pmd, + next->vm_start, pte_end, false /* dec */); + cow_pte_rss(mm, vma, pmd, start, end, false /* dec */); + mm_dec_nr_ptes(mm); + } + + /* Already handled it, don't reuse cowed table. */ + pmd_put_pte(vma, &cowed_entry, addr, false); + + VM_BUG_ON(cow_pte_count(pmd) !=3D 1); + + return 0; +} + +static int zap_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct mm_struct *mm =3D vma->vm_mm; + unsigned long start, end; + + if (pmd_put_pte(vma, pmd, addr, true)) { + /* fallback, reuse pgtable */ + return 1; + } + + start =3D addr & PMD_MASK; + end =3D (addr + PMD_SIZE) & PMD_MASK; + + /* + * If PMD entry is owner, clear the ownership, + * and decrease RSS state and pgtable_bytes. + */ + if (cow_pte_owner_is_same(pmd, pmd)) { + set_cow_pte_owner(pmd, NULL); + cow_pte_rss(mm, vma, pmd, start, end, false /* dec */); + mm_dec_nr_ptes(mm); + } + + pmd_clear(pmd); + return 0; +} + +/** + * handle_cow_pte - Break COW PTE, copy/dereference the shared PTE table + * @vma: target vma want to break COW + * @pmd: pmd index that maps to the shared PTE table + * @addr: the address trigger the break COW + * @alloc: copy PTE table if alloc is true, otherwise dereference + * + * The address needs to be in the range of the PTE table that the pmd index + * mapped. If pmd is NULL, it will get the pmd from vma and check it is CO= Wing. + */ +int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long a= ddr, + bool alloc) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + struct mm_struct *mm =3D vma->vm_mm; + int ret =3D 0; + + if (!pmd) { + pgd =3D pgd_offset(mm, addr); + if (pgd_none_or_clear_bad(pgd)) + return 0; + p4d =3D p4d_offset(pgd, addr); + if (p4d_none_or_clear_bad(p4d)) + return 0; + pud =3D pud_offset(p4d, addr); + if (pud_none_or_clear_bad(pud)) + return 0; + pmd =3D pmd_offset(pud, addr); + } + + if (!is_pte_table_cowing(vma, pmd)) + return 0; + + if (alloc) + ret =3D break_cow_pte(vma, pmd, addr); + else + ret =3D zap_cow_pte(vma, pmd, addr); + + return ret; +} + /* * handle_pte_fault chooses page fault handler according to an entry which= was * read non-atomically. Before making any commitment, on those architectu= res --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0EEE4C07E9D for ; Tue, 27 Sep 2022 16:29:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231686AbiI0Q3E (ORCPT ); Tue, 27 Sep 2022 12:29:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50984 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233066AbiI0Q2d (ORCPT ); Tue, 27 Sep 2022 12:28:33 -0400 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 344E51D73C0 for ; Tue, 27 Sep 2022 09:28:21 -0700 (PDT) Received: by mail-pl1-x62a.google.com with SMTP id d24so9564060pls.4 for ; Tue, 27 Sep 2022 09:28:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=AsJY41XMXFwgdgJedGBIEW9iEXdmuZaxFchvtZgLRts=; b=m6GcP6wWedee5WJZJaWbkT5RTdd55uan6Np016DjpR8muG6vIptBwf3QzE4ERNpttX affNl+gUFcSzOYXosWJ/VRNjxYXy6FJfUHO5liFb7f0D5IstX1F597GLJSGrDA7jLhUD agxRfB3pWC7X6zWYvj7pwv0ePgUXr9cf1klPUuTHnNmaQhlQA0/PzaK0mKwVXiZUPXMj OZSsmozTQGpDPKL7eaUGXjsln2FL4L4E1wk/WU0AWbsxNpjOvcdECApM3/JSfUfxUR1B /ivlVLd+zH8EmphX+KW4fYQo4eRP8Tu2zNIh1l1qyg+uW8hsnnAiVXXZYMF2KwGSTiBq rHnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=AsJY41XMXFwgdgJedGBIEW9iEXdmuZaxFchvtZgLRts=; b=vrgCuG15IJzeQU7BBhj6kXrdX1+LCQcXlvunXrVX2V0uARqoKj+eDVbwrrpHIvnhnS DF0I+vFXgK0hV09vcI8/qCD4FQFOufYOqwZX122qx/jRTf6nVyMwmLR1S+LGYIAvS4Pp 2xmraXaKgmX2W7XZGEuM1/QxShK2a6dF0YguBpvci7FAw8U9b/MBNyKEQCnlVf4f2fCk SHL2XZ9Zy3ykTZDmBsvLZRq8iiHX1y0mAK8r8z4RUpftK24rk1wAJRKgGjAGLvWFarRt mobLqTikZyvQoufwOWYdjWO7G06quAoUw4u20ZliIXsBU3QFCSwIFzsceEOm3psGdwa5 nNzg== X-Gm-Message-State: ACrzQf1a1ySS3GtGOZnPW9pRKGGfcrvXC5vt/1lLdVcyru0i8lFgEm8G no05zvyfE7DrrKlRt/jvmg8= X-Google-Smtp-Source: AMsMyM718HLi2PtSjot1gUUVGevKickODf7v55QPmyvyloREZ18G25HdbEpAcyTcqIr1Rfcie5ZhUg== X-Received: by 2002:a17:90b:3e86:b0:205:d88c:616d with SMTP id rj6-20020a17090b3e8600b00205d88c616dmr4387498pjb.78.1664296100553; Tue, 27 Sep 2022 09:28:20 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.28.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:28:19 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 8/9] mm: Handle COW PTE with reclaim algorithm Date: Wed, 28 Sep 2022 00:29:56 +0800 Message-Id: <20220927162957.270460-9-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" To avoid the PFRA reclaiming the page resided in the COWed PTE table, break COW when it using rmap to unmap all the processes. Signed-off-by: Chih-En Lin --- include/linux/rmap.h | 2 ++ mm/page_vma_mapped.c | 5 +++++ mm/rmap.c | 2 +- mm/swapfile.c | 1 + mm/vmscan.c | 1 + 5 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b89b4b86951f8..5c7e3bedc068b 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -312,6 +312,8 @@ int make_device_exclusive_range(struct mm_struct *mm, u= nsigned long start, #define PVMW_SYNC (1 << 0) /* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) +/* Break COW PTE during the walking */ +#define PVMW_COW_PTE (1 << 2) =20 struct page_vma_mapped_walk { unsigned long pfn; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 8e9e574d535aa..5008957bbe4a7 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -251,6 +251,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk = *pvmw) step_forward(pvmw, PMD_SIZE); continue; } + + /* TODO: Does breaking COW PTE here is correct? */ + if (pvmw->flags & PVMW_COW_PTE) + handle_cow_pte(vma, pvmw->pmd, pvmw->address, false); + if (!map_pte(pvmw)) goto next_pte; this_pte: diff --git a/mm/rmap.c b/mm/rmap.c index 93d5a6f793d20..8f737cb44e48a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1477,7 +1477,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, unsigned long address, void *arg) { struct mm_struct *mm =3D vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_COW_PTE); pte_t pteval; struct page *subpage; bool anon_exclusive, ret =3D true; diff --git a/mm/swapfile.c b/mm/swapfile.c index 1fdccd2f1422e..ef4d3d81a824b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1916,6 +1916,7 @@ static inline int unuse_pmd_range(struct vm_area_stru= ct *vma, pud_t *pud, do { cond_resched(); next =3D pmd_addr_end(addr, end); + handle_cow_pte(vma, pmd, addr, false); if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; ret =3D unuse_pte_range(vma, pmd, addr, next, type); diff --git a/mm/vmscan.c b/mm/vmscan.c index b2b1431352dcd..030fad3d310d9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1822,6 +1822,7 @@ static unsigned int shrink_page_list(struct list_head= *page_list, /* * The folio is mapped into the page tables of one or more * processes. Try to unmap it here. + * It will write to the page tables, break COW PTE here. */ if (folio_mapped(folio)) { enum ttu_flags flags =3D TTU_BATCH_FLUSH; --=20 2.37.3 From nobody Mon Apr 6 11:51:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4427C54EE9 for ; Tue, 27 Sep 2022 16:29:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233032AbiI0Q3R (ORCPT ); Tue, 27 Sep 2022 12:29:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51144 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233129AbiI0Q2g (ORCPT ); Tue, 27 Sep 2022 12:28:36 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 749931D73E7 for ; Tue, 27 Sep 2022 09:28:27 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id x1so9554506plv.5 for ; Tue, 27 Sep 2022 09:28:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=UPBfe5+emU2lZV2Ta0H8Gmt2le1D45pCpoZ1AicX3QU=; b=VFnTafmMx7bEEJfYVDIA0Rrh91fZRWNgXbWeRC7qecbBwMh6+y8LhKIbJAkCwZc966 475Bg3KBMO4hzTojVN3B6klCt32d2Et6kL5RpheWAZpl65RO/Puxf8I9xyOrjFeN3pl6 5SxuNykCP5I6qo3oYBgdFP6AJSx5+8nSur0f5kbrUDDy/c29DeBWEnSZ0p61uMc/4wOA CtPDUOQvHwkhYvWlC3+azC65w742QUsW0/RC/x8clzIyw3YDFl1MPael81G4b/pMySc9 NZNebC0kYIRWx3jD2CFP38149zxkE5ya+GauiEui96YKD9G/uZbnAGc59wS2CXLTi4Y6 +VoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=UPBfe5+emU2lZV2Ta0H8Gmt2le1D45pCpoZ1AicX3QU=; b=MwmX+Gq0Rkv4g5KkBZ/1ISAXdeS+kGPd/2s3APf3J54rtkXavmqj9KZi7dZJqjfKOG Uquro/+cg2OcL38rA2iselKi3mZU9Mtdn5k419KotlQvX4bCWS0TZVlbO11KoamLds+x WglS8hE38dOPjUmg15fzsMhiopfw+lyvPnQwIUgxWTxmaSVUKOm2nobLtmBaGV3p3wHP RamqJCY7rKenOSwrvxYRPzKKi2QgoVq/6VpIvQCoKnskQqw08iJQe8NMFrLW7eGyF5i0 K10+5JalbS+xXSrYjWAo/1V6FcQ7mU1G/QTloEf75I/soXRC3AJfkOa7qjTF+Hoi5raZ dJcQ== X-Gm-Message-State: ACrzQf2DlQPsSyCdS0Bc5C78MbtsGLAT0Bc7Irq6UuKJZlZPf8R/F8AP SQCDRqLF+XhQhiwhywDCnuc= X-Google-Smtp-Source: AMsMyM4ONx1zJi90ty+YJenscn3ji4lqd2kzXea94wuzGI9g1HdJMJaGRSKMilkaJeT7h6vj5xnbqg== X-Received: by 2002:a17:902:d485:b0:179:e4a2:18c2 with SMTP id c5-20020a170902d48500b00179e4a218c2mr5282084plg.160.1664296106198; Tue, 27 Sep 2022 09:28:26 -0700 (PDT) Received: from archlinux.localdomain ([140.121.198.213]) by smtp.googlemail.com with ESMTPSA id 9-20020a17090a0f0900b001f333fab3d6sm8602360pjy.18.2022.09.27.09.28.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 09:28:25 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Nadav Amit , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Date: Wed, 28 Sep 2022 00:29:57 +0800 Message-Id: <20220927162957.270460-10-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220927162957.270460-1-shiyn.lin@gmail.com> References: <20220927162957.270460-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch adds the Copy-On-Write (COW) mechanism to the PTE table. To enable the COW page table use the sysctl vm.cow_pte file with the corresponding PID. It will set the MMF_COW_PTE_READY flag to the process for enabling COW PTE during the next time of fork. It uses the MMF_COW_PTE flag to distinguish the normal page table and the COW one. Moreover, it is difficult to distinguish whether the entire page table is out of COW state. So the MMF_COW_PTE flag won't be disabled after its setup. Since the memory space of the page table is distinctive for each process in kernel space. It uses the address of the PMD index for the PTE table ownership to identify which one of the processes needs to update the page table state. In other words, only the owner will update shared (COWed) PTE table state, like the RSS and pgtable_bytes. Some PTE tables (e.g., pinned pages that reside in the table) still need to be copied immediately for consistency with the current COW logic. As a result, a flag, COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE table is exclusive (i.e., only one task owns it at a time) is added to the table=E2=80=99s owner pointer. Every time a PTE table is copied during = the fork, the owner pointer (and thus the exclusive flag) will be checked to determine whether the PTE table can be shared across processes. It uses a reference count to track the lifetime of COWed PTE table. Doing the fork with COW PTE will increase the refcount. And, when someone writes to the COWed PTE table, it will cause the write fault to break COW PTE. If the COWed PTE table's refcount is one, the process that triggers the fault will reuse the COWed PTE table. Otherwise, the process will decrease the refcount, copy the information to a new PTE table or dereference all the information and change the owner if they have the COWed PTE table. If doing the COW to the PTE table once as the time touching the PMD entry, it cannot preserves the reference count of the COWed PTE table. Since the address range of VMA may overlap the PTE table, the copying function will use VMA to travel the page table for copying it. So it may increase the reference count of the COWed PTE table multiple times in one COW page table forking. Generically it will only increase once time as the child reference it. To solve this problem, it needs to check the destination of PMD entry does exist. And the reference count of the source PTE table is more than one before doing the COW. This patch modifies the part of the copy page table to do the basic COW. For the break COW, it modifies the part of a page fault, zaps page table , unmapping, and remapping. Signed-off-by: Chih-En Lin --- mm/memory.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++---- mm/mmap.c | 3 ++ mm/mremap.c | 3 ++ 3 files changed, 87 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 4cf3f74fb183f..c532448b5e086 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -250,6 +250,9 @@ static inline void free_pmd_range(struct mmu_gather *tl= b, pud_t *pud, next =3D pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) continue; + VM_BUG_ON(cow_pte_count(pmd) !=3D 1); + if (!pmd_cow_pte_exclusive(pmd)) + VM_BUG_ON(!cow_pte_owner_is_same(pmd, NULL)); free_pte_range(tlb, pmd, addr); } while (pmd++, addr =3D next, addr !=3D end); =20 @@ -1006,7 +1009,12 @@ copy_present_pte(struct vm_area_struct *dst_vma, str= uct vm_area_struct *src_vma, * in the parent and the child */ if (is_cow_mapping(vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); + /* + * If parent's PTE table is COWing, keep it as it is. + * Don't set wrprotect to that table. + */ + if (!__is_pte_table_cowing(src_vma, NULL, addr)) + ptep_set_wrprotect(src_mm, addr, src_pte); pte =3D pte_wrprotect(pte); } VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page)); @@ -1197,11 +1205,64 @@ copy_pmd_range(struct vm_area_struct *dst_vma, stru= ct vm_area_struct *src_vma, continue; /* fall through */ } - if (pmd_none_or_clear_bad(src_pmd)) - continue; - if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, - addr, next)) - return -ENOMEM; + + if (is_cow_pte_available(src_vma, src_pmd)) { + /* + * Setting wrprotect to pmd entry will trigger + * pmd_bad() for normal PTE table. Skip the bad + * checking here. + */ + if (pmd_none(*src_pmd)) + continue; + + /* Skip if the PTE already COW this time. */ + if (!pmd_none(*dst_pmd) && !pmd_write(*dst_pmd)) + continue; + + /* + * If PTE doesn't have an owner, the parent needs to + * take this PTE. + */ + if (cow_pte_owner_is_same(src_pmd, NULL)) { + set_cow_pte_owner(src_pmd, src_pmd); + /* + * XXX: The process may COW PTE fork two times. + * But in some situations, owner has cleared. + * Previously Child (This time is the parent) + * COW PTE forking, but previously parent, the + * owner , break COW. So it needs to add back + * the RSS state and pgtable bytes. + */ + if (!pmd_write(*src_pmd)) { + cow_pte_rss(src_mm, src_vma, src_pmd, + get_pmd_start_edge(src_vma, + addr), + get_pmd_end_edge(src_vma, + addr), + true /* inc */); + /* Do we need pt lock here? */ + mm_inc_nr_ptes(src_mm); + /* See the comments in pmd_install(). */ + smp_wmb(); + pmd_populate(src_mm, src_pmd, + pmd_page(*src_pmd)); + } + } + + pmdp_set_wrprotect(src_mm, addr, src_pmd); + + /* Child reference count */ + pmd_get_pte(src_pmd); + + /* COW for PTE table */ + set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd); + } else { + if (pmd_none_or_clear_bad(src_pmd)) + continue; + if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, + addr, next)) + return -ENOMEM; + } } while (dst_pmd++, src_pmd++, addr =3D next, addr !=3D end); return 0; } @@ -1594,6 +1655,10 @@ static inline unsigned long zap_pmd_range(struct mmu= _gather *tlb, spin_unlock(ptl); } =20 + /* TODO: Does TLB needs to flush page info in COWed table? */ + if (is_pte_table_cowing(vma, pmd)) + handle_cow_pte(vma, pmd, addr, false); + /* * Here there can be other concurrent MADV_DONTNEED or * trans huge page faults running, and if the pmd is @@ -5321,6 +5386,16 @@ static vm_fault_t __handle_mm_fault(struct vm_area_s= truct *vma, return 0; } } + + /* + * When the PMD entry is set with write protection, it needs to + * handle the on-demand PTE. It will allocate a new PTE and copy + * the old one, then set this entry writeable and decrease the + * reference count at COW PTE. + */ + if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address, + cow_pte_count(&vmf.orig_pmd) > 1) < 0) + return VM_FAULT_OOM; } =20 return handle_pte_fault(&vmf); diff --git a/mm/mmap.c b/mm/mmap.c index 9d780f415be3c..463359292f8a9 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2685,6 +2685,9 @@ int __split_vma(struct mm_struct *mm, struct vm_area_= struct *vma, return err; } =20 + if (handle_cow_pte(vma, NULL, addr, true) < 0) + return -ENOMEM; + new =3D vm_area_dup(vma); if (!new) return -ENOMEM; diff --git a/mm/mremap.c b/mm/mremap.c index b522cd0259a0f..14f6ad250289c 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -532,6 +532,9 @@ unsigned long move_page_tables(struct vm_area_struct *v= ma, old_pmd =3D get_old_pmd(vma->vm_mm, old_addr); if (!old_pmd) continue; + + handle_cow_pte(vma, old_pmd, old_addr, true); + new_pmd =3D alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; --=20 2.37.3