From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03C31EB64DA for ; Wed, 12 Jul 2023 04:31:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230180AbjGLEbC (ORCPT ); Wed, 12 Jul 2023 00:31:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50686 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231512AbjGLEa6 (ORCPT ); Wed, 12 Jul 2023 00:30:58 -0400 Received: from mail-yb1-xb2f.google.com (mail-yb1-xb2f.google.com [IPv6:2607:f8b0:4864:20::b2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2344E1738 for ; Tue, 11 Jul 2023 21:30:55 -0700 (PDT) Received: by mail-yb1-xb2f.google.com with SMTP id 3f1490d57ef6-c5f98fc4237so6097721276.2 for ; Tue, 11 Jul 2023 21:30:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136254; x=1691728254; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=Lih86tpYDD3J7zG12H/yvKEagSCLBGtcWN3xnGIcrQ8=; b=RBETceKTxUIdE5Wthq5VyXl1iuewXFCZtvj7DAV2aR3XnxKua3oR4hmOqLeBfiK+Bo QX5kHPh+nmkLT9qkicOsu2l1qxImFf+42zqUi71fI6EVfsz6Wd8j8qUrewGd84C27fQF z08YkGvSrNifDm+H1bjPb9jqoZlNdzHFdJ/PGyx79rM0dM13MzFCo6oGzmwGF3H2DHxK jhmvYEXUAQ46tN3PfeVxdIIezDCwRgqWW3fiTQ5gpTM5B8KutupuJmp6cYEM5cSz9f5W fLAc5DNp+0Wj3x5T2KO5EPXVkVyBxsI6lOPcm2rzPIqTF2kKG/dKpgqnKRkoKETZyp95 sIwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136254; x=1691728254; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Lih86tpYDD3J7zG12H/yvKEagSCLBGtcWN3xnGIcrQ8=; b=NeFuw/BeqgLXk2rj3ylVoMMKapl2ydtKrMjd7jPlx4v5z4U5pFy00VXQmuEnXx0Wms T/dhO6s5HKAwplALaIcaqQRa8LS6xK4Kmkwo+p528lJQalJKREhoki7bcLXuNhsUb366 ris59f/WCyCG1GXV4Uv/Hmv2/a/wfrJQKK9zEuJRlK7/D566zm1br23gtgOBEW6prdlT igKVhfOHhVLAST5NJPbG0mHLwJA1QM2dklQCTp04LNfbJPoIMfH2yA5yHS2zCw28pbeN BMHq1BnhURCX9UlRzXqvO4Aha3CFX75MUDiu5jNGuG+m5z+MnIWFgiLkE3HL047VGiC8 F0Hw== X-Gm-Message-State: ABy/qLYGJwVQB8H8XuuNXwx92go0mFkc/VLY2gi9Wn/zIohVLfi5OIR+ JjYaDryqMvp+TDuPf6/MliI6zQ== X-Google-Smtp-Source: APBJJlG+aqkZgSinL0H+vGhJ8SPTaMZTSfjnzUjRq5ezvCFkxr+RV56UV3ToUF/QNi8B1m8dggx9Pw== X-Received: by 2002:a25:8a06:0:b0:c3d:25af:45ec with SMTP id g6-20020a258a06000000b00c3d25af45ecmr14975706ybl.41.1689136254058; Tue, 11 Jul 2023 21:30:54 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id b8-20020a252e48000000b00c61af359b15sm750774ybn.43.2023.07.11.21.30.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:30:53 -0700 (PDT) Date: Tue, 11 Jul 2023 21:30:40 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Before putting them to use (several commits later), add rcu_read_lock() to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a separate commit, since it risks exposing imbalances: prior commits have fixed all the known imbalances, but we may find some have been missed. Signed-off-by: Hugh Dickins --- include/linux/pgtable.h | 4 ++-- mm/pgtable-generic.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5063b482e34f..5134edcec668 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsign= ed long address) ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address))) #define pte_unmap(pte) do { \ kunmap_local((pte)); \ - /* rcu_read_unlock() to be added later */ \ + rcu_read_unlock(); \ } while (0) #else static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address) @@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned lon= g address) } static inline void pte_unmap(pte_t *pte) { - /* rcu_read_unlock() to be added later */ + rcu_read_unlock(); } #endif =20 diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4d454953046f..400e5a045848 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr,= pmd_t *pmdvalp) { pmd_t pmdval; =20 - /* rcu_read_lock() to be added later */ + rcu_read_lock(); pmdval =3D pmdp_get_lockless(pmd); if (pmdvalp) *pmdvalp =3D pmdval; @@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr,= pmd_t *pmdvalp) } return __pte_map(&pmdval, addr); nomap: - /* rcu_read_unlock() to be added later */ + rcu_read_unlock(); return NULL; } =20 --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 128B6C001DC for ; Wed, 12 Jul 2023 04:32:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230402AbjGLEcP (ORCPT ); Wed, 12 Jul 2023 00:32:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229718AbjGLEcM (ORCPT ); Wed, 12 Jul 2023 00:32:12 -0400 Received: from mail-yb1-xb2f.google.com (mail-yb1-xb2f.google.com [IPv6:2607:f8b0:4864:20::b2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5DE82139 for ; Tue, 11 Jul 2023 21:32:11 -0700 (PDT) Received: by mail-yb1-xb2f.google.com with SMTP id 3f1490d57ef6-c4dfe2a95fbso7100692276.3 for ; Tue, 11 Jul 2023 21:32:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136330; x=1691728330; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=GLbz1RkNVDPwgd+c2uKjWshONgMFueYUThzaGJTcWVc=; b=Qvc/PkHD69BAU4sP582IfM4RLWXgBIDFZuk6nWDdns1yA6pipuOhfQIqdzEwTgr4pe 3hHPVwZEX/rTN82QEjOQqTQ59HEMCYwMwWNy+6/fZQSqaZ7WuN8quxZ0OH1xSZb0nnw1 GQ9yLeSU3CHcxotbWJKDNNaXtCszqsP28pboCNNUqI77cmHp5FQjlA8XyqklDB9TnulT WeWBKDY+5/x9n63RzAqMKfAoGe4PZlQiubwU8Q6nAwMM5fg4y0pO4eBHwzFNgIbvMg4s 7C3QDcHD56vICRPhekeNquZ8dALU2oQhR8OSFrsqDoj5IKRFZFww7bl+bK1ug+1bWYfZ rBRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136330; x=1691728330; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GLbz1RkNVDPwgd+c2uKjWshONgMFueYUThzaGJTcWVc=; b=Bxx2n2FcA+uHEYUtFLXM+o2JHAWub/EQ+Lt+PDamfpGSKqYejLVaw7pobuRkLe3DIc 9soY4Ni6qBbctzN1praEZE4C8Bj1ODmC1jKiP/jErZXKj8rwQBP7aWZy5197BhV7y/8z BFgu152Tp2riFQXPSZAw2xeGbwyiEfchpirfQvhhdjZ3O2XCHi0b99BfzDuvPVSa+rx0 JpZqQ0UL7hO3/54I1uBP/XdTx+WXO1ta+Oamv3ShFWn4toVpuzV8gQMBfij5aXsIvYEa Ud9xUbBGleeR/0nsdIlqrUKtpu/QKy0WDFw4R4uggK7Ar26a9oHUNQdk410ai8MntHB5 PJHA== X-Gm-Message-State: ABy/qLZ9kaSKohRC41hlrA+5t95eVqrFAD9SMpLmtpOMWTXgDHaXbr3B AwUtFqdSiD8pvWpairmaEDot5A== X-Google-Smtp-Source: APBJJlHuHSECgdU31VYjlaNkYgclr1t2srEz3h6VpNsV9zTecpCyWa2tDNjiRBAwCFbUxJ3Q8EAdUw== X-Received: by 2002:a25:dd85:0:b0:c18:dc8b:8582 with SMTP id u127-20020a25dd85000000b00c18dc8b8582mr16530943ybg.22.1689136330397; Tue, 11 Jul 2023 21:32:10 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id z14-20020a5b0e8e000000b00ca5d05d4d1csm90210ybr.28.2023.07.11.21.32.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:32:10 -0700 (PDT) Date: Tue, 11 Jul 2023 21:32:05 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map() In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: <3adcd8f-9191-2df1-d7ea-c4877698aad@google.com> References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is a faint risk that __pte_offset_map(), on a 32-bit architecture with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=3Dy, would succeed on a pmdval assembled from a pmd_low and a pmd_high which never belonged together: their combination not pointing to a page table at all, perhaps not even a valid pfn. pmdp_get_lockless() is not enough to prevent that. Guard against that (on such configs) by local_irq_save() blocking TLB flush between present updates, as linux/pgtable.h suggests. It's only needed around the pmdp_get_lockless() in __pte_offset_map(): a race when __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the lock, would just send it back to __pte_offset_map() again. Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(), used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync() synonym for tlb_remove_table_sync_one(): to send the necessary interrupt at the right moment on those configs which do not already send it. CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86. It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit. Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a little more work, to retry if pmd_low good for page table, but pmd_high non-zero from THP (and that might be making x86-specific assumptions). Signed-off-by: Hugh Dickins --- include/linux/pgtable.h | 4 ++++ mm/pgtable-generic.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 33 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5134edcec668..7f2db400f653 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) return pmd; } #define pmdp_get_lockless pmdp_get_lockless +#define pmdp_get_lockless_sync() tlb_remove_table_sync_one() #endif /* CONFIG_PGTABLE_LEVELS > 2 */ #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ =20 @@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) { return pmdp_get(pmdp); } +static inline void pmdp_get_lockless_sync(void) +{ +} #endif =20 #ifdef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 400e5a045848..b9a0c2137cc1 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,= unsigned long address, #endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ + (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU)) +/* + * See the comment above ptep_get_lockless() in include/linux/pgtable.h: + * the barriers in pmdp_get_lockless() cannot guarantee that the value in + * pmd_high actually belongs with the value in pmd_low; but holding interr= upts + * off blocks the TLB flush between present updates, which guarantees that= a + * successful __pte_offset_map() points to a page from matched halves. + */ +static unsigned long pmdp_get_lockless_start(void) +{ + unsigned long irqflags; + + local_irq_save(irqflags); + return irqflags; +} +static void pmdp_get_lockless_end(unsigned long irqflags) +{ + local_irq_restore(irqflags); +} +#else +static unsigned long pmdp_get_lockless_start(void) { return 0; } +static void pmdp_get_lockless_end(unsigned long irqflags) { } +#endif + pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) { + unsigned long irqflags; pmd_t pmdval; =20 rcu_read_lock(); + irqflags =3D pmdp_get_lockless_start(); pmdval =3D pmdp_get_lockless(pmd); + pmdp_get_lockless_end(irqflags); + if (pmdvalp) *pmdvalp =3D pmdval; if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval))) --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D1020EB64D9 for ; Wed, 12 Jul 2023 04:33:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231241AbjGLEdX (ORCPT ); Wed, 12 Jul 2023 00:33:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51772 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229843AbjGLEdP (ORCPT ); Wed, 12 Jul 2023 00:33:15 -0400 Received: from mail-yb1-xb29.google.com (mail-yb1-xb29.google.com [IPv6:2607:f8b0:4864:20::b29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9E843133 for ; Tue, 11 Jul 2023 21:33:14 -0700 (PDT) Received: by mail-yb1-xb29.google.com with SMTP id 3f1490d57ef6-c01e1c0402cso5788271276.0 for ; Tue, 11 Jul 2023 21:33:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136394; x=1691728394; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=QJ3N9RNUACQk0rp/xTQEOUK2bEy9LceuxJNDbuu7Hdo=; b=oKpaowvQeaQUncOOMCvTBHmXQ/FtZl+zZp5+Fo04CzoCy9LRKSCy43wNQCF2w2abHG jYi2GRVb9SUA+cOJwBdOtClSXiZ2Z7OrQDz8b0C3ZKWCpyBSxxh9g0MDPDTlwWm1wpni MQq6dZQMaat5mVW33VG/Z/sJCOzOtM5JcugYkrHF0OjaZarDla2EabzCECYeLkptPTFA kvVOAdnkA+dARAIwQYkZ5QFG5BKKdFwsILzvRn8L6hIPQtsvNfCEpljLsKMHiBJ7Tb9i vvB9S1uM8vmPWQ9MN4Bos/RgNtKXVdLVIPeeQrun3EY+oUKy7MowWj11CHLrlgzQNYzJ ikwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136394; x=1691728394; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QJ3N9RNUACQk0rp/xTQEOUK2bEy9LceuxJNDbuu7Hdo=; b=BcOYIitB0mTOp3H1Y3/+zzOIPinFEO75DUjEAkIwKnKoqUPQR0hAeTTFL6W0Ecew/C SxvCnDHdwqW2wZGAzXjyiN/U26K0GNIisAmWFepQ72+fiSMMfhFYn4EFLrf/ICdR5Lj5 7iyqakUe/FWWk2oNGEUhkzQUWf1t5Jk+x+lqcVszbp43WbDNixmcBjMQd1k3Vno2TnrK bTgV2WmaVB6IETKIiIXv34cNkbUj/BvZxiX2StT4aD163g7DGKLllhQZvlF2mRz0H1gt n7DnSosiCXeUS6FOoDJ/qduOJLPzD2kXG272Bo1BKVFh1khN2NscLvS6AHBxfoeUcrm+ ynsA== X-Gm-Message-State: ABy/qLa/Uw+btzjOna6MMTEyIKkZb71R5iz4IO/QpqnHVJ+PUDepC1P+ jYTBhE/Y0HrPTXLJBWZvOa+Xsw== X-Google-Smtp-Source: APBJJlHtqAOgfTEvtz6FEYsO4J7wfUGe8UB05ARnUaznX7yxHAtr1BNRbw9aWEHw4YhpidgoA6BFwQ== X-Received: by 2002:a5b:70d:0:b0:c01:308a:44f2 with SMTP id g13-20020a5b070d000000b00c01308a44f2mr13992957ybq.57.1689136393735; Tue, 11 Jul 2023 21:33:13 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id b8-20020a252e48000000b00c61af359b15sm751502ybn.43.2023.07.11.21.33.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:33:13 -0700 (PDT) Date: Tue, 11 Jul 2023 21:33:08 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock() In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: <4d5258bd-ffa0-018-253a-25f2c9b783f7@google.com> References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Instead of pte_lockptr(), use the recently added pte_offset_map_nolock() in adjust_pte(): because it gives the not-locked ptl for precisely that pte, which the caller can then safely lock; whereas pte_lockptr() is not so tightly coupled, because it dereferences the pmd pointer again. Signed-off-by: Hugh Dickins --- arch/arm/mm/fault-armv.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c index ca5302b0b7ee..7cb125497976 100644 --- a/arch/arm/mm/fault-armv.c +++ b/arch/arm/mm/fault-armv.c @@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, uns= igned long address, * must use the nested version. This also means we need to * open-code the spin-locking. */ - pte =3D pte_offset_map(pmd, address); + pte =3D pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl); if (!pte) return 0; =20 - ptl =3D pte_lockptr(vma->vm_mm, pmd); do_pte_lock(ptl); =20 ret =3D do_adjust_pte(vma, address, pfn, pte); --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9820CEB64DD for ; Wed, 12 Jul 2023 04:34:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231260AbjGLEef (ORCPT ); Wed, 12 Jul 2023 00:34:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52154 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229772AbjGLEec (ORCPT ); Wed, 12 Jul 2023 00:34:32 -0400 Received: from mail-yw1-x1134.google.com (mail-yw1-x1134.google.com [IPv6:2607:f8b0:4864:20::1134]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 810FB136 for ; Tue, 11 Jul 2023 21:34:31 -0700 (PDT) Received: by mail-yw1-x1134.google.com with SMTP id 00721157ae682-57d24970042so11457567b3.2 for ; Tue, 11 Jul 2023 21:34:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136470; x=1691728470; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=7tbEFbFk5pzLjwVra+37SCseEW8dJexk2Np2YDtZH+c=; b=wnCVuIXuiaG32yazd1jaE3W5ihs2aq10U+0VnqC0A4M5VW9f75uDvYaKtJTiOmyaue Sk3Hz9lSjZsRg5VYNIu7Hn2yE63ieOx4N3x7YzoU3Buew+NClS1OfLuPLWy/xU1Be4F4 Mj93usiX7kHcOZV19+ldiZyyL/p+CNJFiKm0gEtrogp1KtBAN0j2Dy4CImKpM6En2tcK aezkkRAxSKmsFJ7X4v2alSgKHi31+oMGpb01SlGprmt8mWgh7ncZkS4hJxBgagnaXOse tz8TLiOWqvvCKNWEav/8zM1LSgHqAZXlnZ87YVamvaqNDIExsUdGGqSiLd9h9BTV1Nu3 hsEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136470; x=1691728470; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=7tbEFbFk5pzLjwVra+37SCseEW8dJexk2Np2YDtZH+c=; b=MRWk4BFqaAvxXDW8cxD03kVkxDmiStVwtM7NgqNB6YEY4quHKjYN3dx2y/5QoGPS/4 fI7udSC+1S+BOUAS8Hyp+dO+KJUZAZTBryG0hr0gKhPn0GcvexwWI0b26hTGccU3BRPR t7iVXX0Ogk1gb++9O89lfLONYtt3goKEc7AirClMCXHPyznk1ekMEEnt7yhL47PMZ75+ xz2I6fVg30nfYdBPd1tKvzpk24sBS2hGsFQ/DUDyLUvbyp9GqvRIcEIMXgpicjKrpPgu Sjzd7Iy+IDOcBaoyeG5tymi/n+9ArVZgrG6K+fxo9LokUY1uhhuLfGc4n0c85xM+oZvQ ry0Q== X-Gm-Message-State: ABy/qLbmmrqH4jNZglhMLhQpLBvu4WfVFF5kNjvMQr9LxmItNChak4B6 I10und7O/5JPQ0CIIv3SttuoiQ== X-Google-Smtp-Source: APBJJlGbNO1AMOmdt4OPi/Xng0KoIYGmRlccUjicbL3FXzg2JdRnfsqSYXaILeBHKbZKCy/ktJ28ZQ== X-Received: by 2002:a81:4ecb:0:b0:57d:24e9:e7f3 with SMTP id c194-20020a814ecb000000b0057d24e9e7f3mr2850117ywb.38.1689136470494; Tue, 11 Jul 2023 21:34:30 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id c124-20020a0dc182000000b0057a05834754sm974979ywd.75.2023.07.11.21.34.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:34:30 -0700 (PDT) Date: Tue, 11 Jul 2023 21:34:25 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock() In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Instead of pte_lockptr(), use the recently added pte_offset_map_nolock() in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is stricter than the previous implementation, which skipped when pmd_none() (with a comment on khugepaged collapse transitions): but wouldn't we want to know, if an assert_pte_locked() caller can be racing such transitions? This mod might cause new crashes: which either expose my ignorance, or indicate issues to be fixed, or limit the usage of assert_pte_locked(). Signed-off-by: Hugh Dickins --- arch/powerpc/mm/pgtable.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index cb2dcdb18f8e..16b061af86d7 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned l= ong addr) p4d_t *p4d; pud_t *pud; pmd_t *pmd; + pte_t *pte; + spinlock_t *ptl; =20 if (mm =3D=3D &init_mm) return; @@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned= long addr) pud =3D pud_offset(p4d, addr); BUG_ON(pud_none(*pud)); pmd =3D pmd_offset(pud, addr); - /* - * khugepaged to collapse normal pages to hugepage, first set - * pmd to none to force page fault/gup to take mmap_lock. After - * pmd is set to none, we do a pte_clear which does this assertion - * so if we find pmd none, return. - */ - if (pmd_none(*pmd)) - return; - BUG_ON(!pmd_present(*pmd)); - assert_spin_locked(pte_lockptr(mm, pmd)); + pte =3D pte_offset_map_nolock(mm, pmd, addr, &ptl); + BUG_ON(!pte); + assert_spin_locked(ptl); + pte_unmap(pte); } #endif /* CONFIG_DEBUG_VM */ =20 --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6335EB64D9 for ; Wed, 12 Jul 2023 04:36:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231528AbjGLEgM (ORCPT ); Wed, 12 Jul 2023 00:36:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52600 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231373AbjGLEgH (ORCPT ); Wed, 12 Jul 2023 00:36:07 -0400 Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C71DC139 for ; Tue, 11 Jul 2023 21:36:05 -0700 (PDT) Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-5701eaf0d04so72803017b3.2 for ; Tue, 11 Jul 2023 21:36:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136565; x=1691728565; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=rNkuOnMiPA5uKMNmzDSZhkkEhVwG+08I0Sc1INz+wy8=; b=FFbzGiP1Hn/cx8QOOnqnR11nIjqgT83mtNZkbCUQnSq5LrblamhVfpvB+UeJbyZW1s y4J6XTFnK8fA2dLUAHHtBip6qAhzoi/1hVVGt2YPDhBzcNtIdNEwDD9zadVtdk/xueRH ZKLjXaeVhHWhvtWyuu/YKxYZxvMKkRUX8OZHqVf6Z3jPZZZJRTzTUXJQXMXyRqkMOflu Wsa6cYIz8XLnnJEBGqzhXR/oYVqaIyb5OBPuZhbuAADdjSajptwPQTErMBx4XyNaSU8b s7pjxVInyPyp25nFk59wK/5ndqw3B7p/EJhYwYVyO8aZGxHmDzIQpJw9VHF7rxxbYyvE Gp3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136565; x=1691728565; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rNkuOnMiPA5uKMNmzDSZhkkEhVwG+08I0Sc1INz+wy8=; b=Giy9qcg1vdnW6WJi9BwmnvWaxclcKVa+/itlMst0yVSZzvrZ8xDHUZOpMejeyNu0/B aarFTkXNYOAIUViDWLUW5YuqxmWqurw4DpvkAxQPCksexLHOCje6dNesN0H0nFhWJAq9 iMoIgXLNVb0Qg6HARdNxlB8QKWHvaoOPGY6Kzlc/wZPnp3qCIp8Sr7ySlAeEfJDn4Wcm Qbm6V4pvVPzK2oI1Uu63AQ5gVMKFIJDLrqGwDjJvwFvwaBUHS0DAbxoxw58vu7GSfwV+ p+/iZICLX7PngyjhjryVO/frVAEDrWxjyZg8XSjhleWTx8vrYE+hsOv2Pf5JowksFLEE eyXA== X-Gm-Message-State: ABy/qLbaVNQFNuRMRqRIKGQJqYeF1eHtVNDKBY9b/Rmm0kz0gTeU2ghE 8Xcqm9RTCCki/C/muho4YzSW3A== X-Google-Smtp-Source: APBJJlFtIJlcayIhMEpj3vHnt3xsOXLGUeoHM39mGyyz8wsLsKOn5HDkQrRum1Ly6OBtEql+IV2D3A== X-Received: by 2002:a0d:e88d:0:b0:577:606c:284b with SMTP id r135-20020a0de88d000000b00577606c284bmr15146586ywe.16.1689136564850; Tue, 11 Jul 2023 21:36:04 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id m124-20020a817182000000b0056d0709e0besm968571ywc.129.2023.07.11.21.36.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:36:04 -0700 (PDT) Date: Tue, 11 Jul 2023 21:35:59 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: <6e3ca5f1-334d-4b14-b92d-fc8e99914fcb@google.com> References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add powerpc-specific pte_free_defer(), to free table page via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. This is awkward because the struct page contains only one rcu_head, but that page may be shared between PTE_FRAG_NR pagetables, each wanting to use the rcu_head at the same time. But powerpc never reuses a fragment once it has been freed: so mark the page Active in pte_free_defer(), before calling pte_fragment_free() directly; and there call_rcu() to pte_free_now() when last fragment is freed and the page is PageActive. Suggested-by: Jason Gunthorpe Signed-off-by: Hugh Dickins --- arch/powerpc/include/asm/pgalloc.h | 4 ++++ arch/powerpc/mm/pgtable-frag.c | 29 ++++++++++++++++++++++++++--- 2 files changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/= pgalloc.h index 3360cad78ace..3a971e2a8c73 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtabl= e_t ptepage) pte_fragment_free((unsigned long *)ptepage, 0); } =20 +/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-fra= g.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + /* * Functions that deal with pagetables that could be at any level of * the table need to be passed an "index_size" so they know how to diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 20652daa1d7e..0c6b68130025 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int ke= rnel) return __alloc_for_ptecache(mm, kernel); } =20 +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page =3D container_of(head, struct page, rcu_head); + pgtable_pte_page_dtor(page); + __free_page(page); +} + void pte_fragment_free(unsigned long *table, int kernel) { struct page *page =3D virt_to_page(table); @@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kerne= l) =20 BUG_ON(atomic_read(&page->pt_frag_refcount) <=3D 0); if (atomic_dec_and_test(&page->pt_frag_refcount)) { - if (!kernel) - pgtable_pte_page_dtor(page); - __free_page(page); + if (kernel) + __free_page(page); + else if (TestClearPageActive(page)) + call_rcu(&page->rcu_head, pte_free_now); + else + pte_free_now(&page->rcu_head); } } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page =3D virt_to_page(pgtable); + SetPageActive(page); + pte_fragment_free((unsigned long *)pgtable, 0); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B6E3EB64DD for ; Wed, 12 Jul 2023 04:37:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231418AbjGLEhf (ORCPT ); Wed, 12 Jul 2023 00:37:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52986 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231252AbjGLEhc (ORCPT ); Wed, 12 Jul 2023 00:37:32 -0400 Received: from mail-yw1-x1134.google.com (mail-yw1-x1134.google.com [IPv6:2607:f8b0:4864:20::1134]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7DC06195 for ; Tue, 11 Jul 2023 21:37:30 -0700 (PDT) Received: by mail-yw1-x1134.google.com with SMTP id 00721157ae682-579ef51428eso77954387b3.2 for ; Tue, 11 Jul 2023 21:37:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136649; x=1691728649; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=JyT8tmsQKzrYxiMI/rVOY2yPaVbA/RNeIBfphNdBMF4=; b=w4BLpHymsh+u1p3yABD7PPkGOKG9qb0kMjMuVvRnYQ5i8Gg0vx7GaXUKU8Z11taA0f Y4235M6bgawLW3yo0EcdaghjRbkYWx3VYEN5rlBhdKXgQyAR0H6Df5aIv3U42+ge0H0F 7Am9Buup/0Dpi4X9uzRYEGHLpGhKSvbsAL94r8S6LMijx5+VZ1cBcWPoxL+3vjMaeqvp Syw06LlLqDuKwM2KmcAX8wLO81GL1X+wMS5GoRcNmiGU7mUrXq/vTwqL7+C0j6oQTqbM DMVJzOclqYpiwu/+ACLlh1W9ZEU8tNf6sf9FmXwoF/hxcf+swcDTLMrmftfNrY6dfBOv P5/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136649; x=1691728649; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=JyT8tmsQKzrYxiMI/rVOY2yPaVbA/RNeIBfphNdBMF4=; b=KwM0MCF5fSgaiCqLTjV3w93r0CGwxx/fPMLMucKC+yY335xqNgoVzQm9JQ23yUUPLW h41Fo2aKvgjvDdvuCjcZwfMoxtbCsVWPywR/9XcAYj/PQK1qv8LT6VJJ1apwpnxGXU0T hRFCXZfc6hIp6nq0pcGiQlNNryb8kcOtSYmAnTRNnATXGgp0WHom0AWzK0VpbsW+m3zg Y3/6KgZyYDo+rgFLm9Gbyo8KP/WDHSUkhEpPcICkand934fByIRyMdr6ByJPwQwWpbW+ 28g6ruk3WBSf5msVtBxPe39iWFVHGwcQqTQ+GWSsguEjW9CeDzwgwYCRY94JlKvc9Rlb /8Mg== X-Gm-Message-State: ABy/qLbOuw/9reSRQlB0dBsn9kVpjfwEQaOQ9JPJmB9kuDCyIAaXSq6m L03kHIQDom/3bWb2yWg0j+PEfw== X-Google-Smtp-Source: APBJJlFmxJSV0QKRdoqMkfxLV2wHU7j1RLUd/OKPTI5wJFFgGlsPMBZVm2ZQMAKmswTDjAPHLTyqAQ== X-Received: by 2002:a0d:d841:0:b0:577:189b:ad4 with SMTP id a62-20020a0dd841000000b00577189b0ad4mr18260881ywe.48.1689136649589; Tue, 11 Jul 2023 21:37:29 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id c124-20020a0dc182000000b0057a05834754sm975992ywd.75.2023.07.11.21.37.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:37:29 -0700 (PDT) Date: Tue, 11 Jul 2023 21:37:24 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. sparc32 supports pagetables sharing a page, but does not support THP; sparc64 supports THP, but does not support pagetables sharing a page. So the sparc-specific pte_free_defer() is as simple as the generic one, except for converting between pte_t *pgtable_t and struct page *. Signed-off-by: Hugh Dickins --- arch/sparc/include/asm/pgalloc_64.h | 4 ++++ arch/sparc/mm/init_64.c | 16 ++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/p= galloc_64.h index 7b5561d17ab1..caa7632be4c2 100644 --- a/arch/sparc/include/asm/pgalloc_64.h +++ b/arch/sparc/include/asm/pgalloc_64.h @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm); void pte_free_kernel(struct mm_struct *mm, pte_t *pte); void pte_free(struct mm_struct *mm, pgtable_t ptepage); =20 +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE) #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE) =20 diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 04f9db0c3111..0d7fd793924c 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page) } =20 #ifdef CONFIG_TRANSPARENT_HUGEPAGE +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page =3D container_of(head, struct page, rcu_head); + __pte_free((pgtable_t)page_address(page)); +} + +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page =3D virt_to_page(pgtable); + call_rcu(&page->rcu_head, pte_free_now); +} + void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) { --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83143EB64DD for ; Wed, 12 Jul 2023 04:38:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231373AbjGLEis (ORCPT ); Wed, 12 Jul 2023 00:38:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53378 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229742AbjGLEin (ORCPT ); Wed, 12 Jul 2023 00:38:43 -0400 Received: from mail-yw1-x1130.google.com (mail-yw1-x1130.google.com [IPv6:2607:f8b0:4864:20::1130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 49623136 for ; Tue, 11 Jul 2023 21:38:42 -0700 (PDT) Received: by mail-yw1-x1130.google.com with SMTP id 00721157ae682-57026f4bccaso75165387b3.2 for ; Tue, 11 Jul 2023 21:38:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136721; x=1691728721; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=1rHJ9EBJseZz5NJPKd2al59RdAa4H3OEyFlgk3ZwFK8=; b=VScKSufrGoiqmfI5RZZYxJxerglEQwWGwaRIv4A99gzC5wJzMKOiB64T2ejMrKgdKr uIsr2UxYwE3cYtc0YUlrcv6XMngYfWOsRl2pmomrgY/4Edb1ge3kiot7EKm/tncs7eVO cviypSqX9Re7SHqz2WACixffhKyzowkQkyHwSCtoVoDAd6Q8rADqqRRBehW9D7VhpQdy jfgIrKIpQWTLP3aZ5s4wAOm/AaPgQQr3FnxMyz0tHHSwgjN2BrSIYEujRK8n8SFaFHmr K+rhj1orR1amsxhQKit13Cu51WvZx8T3KsnjlJzFYQueSSTzV4TQi444dJi6Q49LPgkz NF3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136721; x=1691728721; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1rHJ9EBJseZz5NJPKd2al59RdAa4H3OEyFlgk3ZwFK8=; b=P0B1MktLqu7/nfh2oN3wjB7r6+jccQ95KqUcxyzFAFIlVU836fpXCidl6weXR0Utqy TQYQDqbEb2wcHYS0bHunUNell0uO8hRMLafRevRURJZPlKj7nZSkWPmwmm9TRwt+ReTs PqfdFweygX8g4s1/BNzVron7FC2L9b5UThbaEBJU/GnEOuC6oPq4re4wPyl0/oR2iBR0 7bNhT5rFK9RN5GUnqNgP4AnCOC7KGc6CLSfYsNzCff4ifnTD1uLiq2i28O8H32o8BikI VccDffHvRKidCpf6WIPzINQvResD1cGgL4pw+gaMy8iZJUvZANl5GDYOGfivmb+XByvP 2+wA== X-Gm-Message-State: ABy/qLar4Jd/xJS83L+hOfe5htysa/caaRAwb9sLTFHbRXxkBnHn3hhV 1fO/PQb75mG0pXltpwdLtm/r9A== X-Google-Smtp-Source: APBJJlETkafvYhmOQMedWmY36OTxgXTWTltj88mfBOb4d+VWzX0jPtePMxWmYiYJC8wVDqt9xdrh6g== X-Received: by 2002:a81:7105:0:b0:573:d3cd:3d2a with SMTP id m5-20020a817105000000b00573d3cd3d2amr17575683ywc.28.1689136721012; Tue, 11 Jul 2023 21:38:41 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id b126-20020a0dc084000000b0056d443372f0sm977038ywd.119.2023.07.11.21.38.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:38:40 -0700 (PDT) Date: Tue, 11 Jul 2023 21:38:35 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: <94eccf5f-264c-8abe-4567-e77f4b4e14a@google.com> References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add s390-specific pte_free_defer(), to free table page via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. This version is more complicated than others: because s390 fits two 2K page tables into one 4K page (so page->rcu_head must be shared between both halves), and already uses page->lru (which page->rcu_head overlays) to list any free halves; with clever management by page->_refcount bits. Build upon the existing management, adjusted to follow a new rule: that a page is never on the free list if pte_free_defer() was used on either half (marked by PageActive). And for simplicity, delay calling RCU until both halves are freed. Not adding back unallocated fragments to the list in pte_free_defer() can result in wasting some amount of memory for pagetables, depending on how long the allocated fragment will stay in use. In practice, this effect is expected to be insignificant, and not justify a far more complex approach, which might allow to add the fragments back later in __tlb_remove_table(), where we might not have a stable mm any more. Signed-off-by: Hugh Dickins Reviewed-by: Gerald Schaefer Acked-by: Alexander Gordeev Tested-by: Alexander Gordeev --- arch/s390/include/asm/pgalloc.h | 4 ++ arch/s390/mm/pgalloc.c | 80 +++++++++++++++++++++++++++++------ 2 files changed, 72 insertions(+), 12 deletions(-) diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgallo= c.h index 17eb618f1348..89a9d5ef94f8 100644 --- a/arch/s390/include/asm/pgalloc.h +++ b/arch/s390/include/asm/pgalloc.h @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm, #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte) #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte) =20 +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + void vmem_map_init(void); void *vmem_crst_alloc(unsigned long val); pte_t *vmem_pte_alloc(void); diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c index 66ab68db9842..760b4ace475e 100644 --- a/arch/s390/mm/pgalloc.c +++ b/arch/s390/mm/pgalloc.c @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page) * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable * while the PP bits are never used, nor such a page is added to or removed * from mm_context_t::pgtable_list. + * + * pte_free_defer() overrides those rules: it takes the page off pgtable_l= ist, + * and prevents both 2K fragments from being reused. pte_free_defer() has = to + * guarantee that its pgtable cannot be reused before the RCU grace period + * has elapsed (which page_table_free_rcu() does not actually guarantee). + * But for simplicity, because page->rcu_head overlays page->lru, and beca= use + * the RCU callback might not be called before the mm_context_t has been f= reed, + * pte_free_defer() in this implementation prevents both fragments from be= ing + * reused, and delays making the call to RCU until both fragments are free= d. */ unsigned long *page_table_alloc(struct mm_struct *mm) { @@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm) table +=3D PTRS_PER_PTE; atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24)); - list_del(&page->lru); + list_del_init(&page->lru); } } spin_unlock_bh(&mm->context.lock); @@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm) table =3D (unsigned long *) page_to_virt(page); if (mm_alloc_pgste(mm)) { /* Return 4K page table with PGSTEs */ + INIT_LIST_HEAD(&page->lru); atomic_xor_bits(&page->_refcount, 0x03U << 24); memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE); memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE); @@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page,= void *table, { char msg[128]; =20 - if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask) + if (!IS_ENABLED(CONFIG_DEBUG_VM)) + return; + if (!mask && list_empty(&page->lru)) return; snprintf(msg, sizeof(msg), "Invalid pgtable %p release half 0x%02x mask 0x%02x", @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page= , void *table, dump_page(page, msg); } =20 +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page =3D container_of(head, struct page, rcu_head); + pgtable_pte_page_dtor(page); + __free_page(page); +} + void page_table_free(struct mm_struct *mm, unsigned long *table) { unsigned int mask, bit, half; @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned l= ong *table) */ mask =3D atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); mask >>=3D 24; - if (mask & 0x03U) + if ((mask & 0x03U) && !PageActive(page)) { + /* + * Other half is allocated, and neither half has had + * its free deferred: add page to head of list, to make + * this freed half available for immediate reuse. + */ list_add(&page->lru, &mm->context.pgtable_list); - else - list_del(&page->lru); + } else { + /* If page is on list, now remove it. */ + list_del_init(&page->lru); + } spin_unlock_bh(&mm->context.lock); mask =3D atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24)); mask >>=3D 24; @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned lo= ng *table) } =20 page_table_release_check(page, table, half, mask); - pgtable_pte_page_dtor(page); - __free_page(page); + if (TestClearPageActive(page)) + call_rcu(&page->rcu_head, pte_free_now); + else + pte_free_now(&page->rcu_head); } =20 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table, @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsi= gned long *table, */ mask =3D atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); mask >>=3D 24; - if (mask & 0x03U) + if ((mask & 0x03U) && !PageActive(page)) { + /* + * Other half is allocated, and neither half has had + * its free deferred: add page to end of list, to make + * this freed half available for reuse once its pending + * bit has been cleared by __tlb_remove_table(). + */ list_add_tail(&page->lru, &mm->context.pgtable_list); - else - list_del(&page->lru); + } else { + /* If page is on list, now remove it. */ + list_del_init(&page->lru); + } spin_unlock_bh(&mm->context.lock); table =3D (unsigned long *) ((unsigned long) table | (0x01U << bit)); tlb_remove_table(tlb, table); @@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table) } =20 page_table_release_check(page, table, half, mask); - pgtable_pte_page_dtor(page); - __free_page(page); + if (TestClearPageActive(page)) + call_rcu(&page->rcu_head, pte_free_now); + else + pte_free_now(&page->rcu_head); } =20 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page =3D virt_to_page(pgtable); + SetPageActive(page); + page_table_free(mm, (unsigned long *)pgtable); + /* + * page_table_free() does not do the pgste gmap_unlink() which + * page_table_free_rcu() does: warn us if pgste ever reaches here. + */ + WARN_ON_ONCE(mm_alloc_pgste(mm)); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + /* * Base infrastructure required to generate basic asces, region, segment, * and page tables that do not make use of enhanced features like EDAT1. --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA187EB64D9 for ; Wed, 12 Jul 2023 04:40:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231534AbjGLEj7 (ORCPT ); Wed, 12 Jul 2023 00:39:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53964 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231255AbjGLEj4 (ORCPT ); Wed, 12 Jul 2023 00:39:56 -0400 Received: from mail-yw1-x112c.google.com (mail-yw1-x112c.google.com [IPv6:2607:f8b0:4864:20::112c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 76678195 for ; Tue, 11 Jul 2023 21:39:54 -0700 (PDT) Received: by mail-yw1-x112c.google.com with SMTP id 00721157ae682-5703d12ab9aso76389277b3.2 for ; Tue, 11 Jul 2023 21:39:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136793; x=1691728793; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=2t2oc+yeedMzdUFXYKBxRHbKcZMzrhHMUkscNN4bTko=; b=XYfimDGolFEty51TCKUyAZmm69LKogtUNWLIyMCuw81Si1ZjaboZJMyonyZe8BlA1H 4dw7ywAuqviGzR+cqYGZtxV+ytY3jiQkgFAvjXIkhqrYEAwOpEMsZRaoOyR5rEd3GDD+ 6LnUt02lOICM7P8CDg5UfhiUAjvip1TQsKWkNZqXUfOXl95Q2+l607lxyvF7WfoMEkup n8srQ8rKA7PHkhNMQEN57jpX0Q3gk0ZBR2Liws8TUdU4okAi97PytXAsy84/ZZk7I96p LrZbc0ZAQKs4VNt1L3qD7duolPC2jOZozxLhx0DiCnOdCct9HdR1WwipR08docsFv3CN 0s6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136793; x=1691728793; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2t2oc+yeedMzdUFXYKBxRHbKcZMzrhHMUkscNN4bTko=; b=Fb1tgEEfZET9Tz5YkiCQqdq2ti9vjzR5GnB7E8ka3KlO6teEglWtzZqholIlsqwRwS KT+DTmbJWSL9ICgYccIdbaaYzlxaN4ho/DsU8y9u4HJxd3aMrgQe/u99zlRtgBkXdU2w WbBH60Yo5HqOqU0NTuKEY/lnmt4aKBAZohUIBmDmRXMFIc3RIUGt/FGIuXsLGARzYbCG nmZmcXDm24kSu36lxoLaNdzErsPGnrMhcpdqZHt33J7u1tH3HIwygcMH7PLw5x/fWJ3M Hf3pUnBIgKENMne+r6Te3ev8y+IkhkzVHFwCZMLoj5++iffhZc33fBA6PmLCXtMk+pQX iWBw== X-Gm-Message-State: ABy/qLZmDYTOdqNDb1SeqsS2YKR9c/l4IjFv0sX6BzHmfHr51Ut4YKbh S7SDzzRxZ8kC9a8WeuiPBiqe/A== X-Google-Smtp-Source: APBJJlH7ohd7+cM8rUQREK94h0qZle8YpJFolayyIig71Hbg47eK0cJuvgfSIYJvI0oB6MYKn3tkcQ== X-Received: by 2002:a81:a187:0:b0:579:efb4:7d08 with SMTP id y129-20020a81a187000000b00579efb47d08mr15922621ywg.27.1689136793558; Tue, 11 Jul 2023 21:39:53 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id f130-20020a816a88000000b005702bfb19bfsm974843ywc.130.2023.07.11.21.39.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:39:53 -0700 (PDT) Date: Tue, 11 Jul 2023 21:39:48 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: <78e921b0-b681-a1b0-dc20-44c9efa4ef3c@google.com> References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the generic pte_free_defer(), to call pte_free() via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This version suits all those architectures which use an unfragmented page for one page table (none of whose pte_free()s use the mm arg which was passed to it). Signed-off-by: Hugh Dickins --- include/linux/mm_types.h | 4 ++++ include/linux/pgtable.h | 2 ++ mm/pgtable-generic.c | 20 ++++++++++++++++++++ 3 files changed, 26 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index de10fc797c8e..17a7868f00bd 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -144,6 +144,10 @@ struct page { struct { /* Page table pages */ unsigned long _pt_pad_1; /* compound_head */ pgtable_t pmd_huge_pte; /* protected by page->ptl */ + /* + * A PTE page table page might be freed by use of + * rcu_head: which overlays those two fields above. + */ unsigned long _pt_pad_2; /* mapping */ union { struct mm_struct *pt_mm; /* x86 pgds only */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 7f2db400f653..9fa34be65159 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte) } #endif =20 +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + /* Find an entry in the second-level page table.. */ #ifndef pmd_offset static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index b9a0c2137cc1..fa9d4d084291 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -13,6 +13,7 @@ #include #include #include +#include #include =20 /* @@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, = unsigned long address, return pmd; } #endif + +/* arch define pte_free_defer in asm/pgalloc.h for its own implementation = */ +#ifndef pte_free_defer +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page =3D container_of(head, struct page, rcu_head); + pte_free(NULL /* mm not passed and not used */, (pgtable_t)page); +} + +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page =3D pgtable; + call_rcu(&page->rcu_head, pte_free_now); +} +#endif /* pte_free_defer */ #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 644FAEB64D9 for ; Wed, 12 Jul 2023 04:41:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230464AbjGLElQ (ORCPT ); Wed, 12 Jul 2023 00:41:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54708 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229843AbjGLElN (ORCPT ); Wed, 12 Jul 2023 00:41:13 -0400 Received: from mail-oi1-x233.google.com (mail-oi1-x233.google.com [IPv6:2607:f8b0:4864:20::233]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A6B5F195 for ; Tue, 11 Jul 2023 21:41:10 -0700 (PDT) Received: by mail-oi1-x233.google.com with SMTP id 5614622812f47-3a3b7f992e7so4539393b6e.2 for ; Tue, 11 Jul 2023 21:41:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136870; x=1691728870; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=pdyvejeRbaj0R0BaC9Cc3znYRUv3s+ZMzSpoEQD9sfY=; b=61XLdsWLlwNNltnknx8ZjoR3EUlUK1Bj/tcmgFfsWXA5aEpxm/PqyvJFzxt5BuWIF5 zvkWTe7pBGsdce8DeFpFHH4Wv/K1wf5gUEvTUmFwZ+ZEj2uTMPjaV81PW6un2ztL4cJ5 9Ojyupv1pMoBHR+B3ZOxdkHaWiIu1oytKgz668XEKRlUKWdagmZpxzNOsz/QaT8WYz4C 9ahjf6cCwFgvF6+Dv8kjsNK8eR0ApMsjPotTekDn9sMm+XfiULYXYS62qM1PSv68kQHL u/ggVV8M7nVTjthYMrmq77CcFpe53snWzgS02pgsIvmeEXYcD9Ow8tPml9gf+ldm9rwg OqTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136870; x=1691728870; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pdyvejeRbaj0R0BaC9Cc3znYRUv3s+ZMzSpoEQD9sfY=; b=kyeIYP+cETlcP/ipH6S1FM2trTbH+brkpeLOgScRBviL33H8HeXj9tY7tsfjNIVp6u l2qAe+U5lAqyBBb6mpCCxj1Qdh2ks1SymvyLL67jZE4ui0bkxlw/Jr5CMWP+JmNiEpp0 HM8y4CygRXJ1EcULV6q8f587AWLAlDfqEXts9eMDa9yoGTm1uCpI3Cad+Xpk5a4XbJGb zwOpMw+lmvd4oWnxJc44DPaB6CU7h1RyYIA2XlIHFO+DNfsMWQ/IV3xzmtD8b9+jPlQA k1lbkKwVk0xzK/nOHB3sQdfOTTqCDqtyNm3HkaZqOvL6UccH9lRQvXskQGzq7w4ERDDa /PPQ== X-Gm-Message-State: ABy/qLYR37uNPiJRFW/c+rr/uebnA5jwciWwLaWaYkWA5SpQc4piGvqo 1h8SB8HGEA6aiy7JOS/f2rIzFQ== X-Google-Smtp-Source: APBJJlGPp9vKO9NKxpObH1s2Z1lE514jxJK2RUhjTAY7j52rpXq2fwG1pZB3jtlZM66ZqR38ppt1QA== X-Received: by 2002:a05:6808:1689:b0:3a3:64a3:b5a1 with SMTP id bb9-20020a056808168900b003a364a3b5a1mr17094671oib.7.1689136869723; Tue, 11 Jul 2023 21:41:09 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id u203-20020a8184d4000000b005772abf6234sm970493ywf.11.2023.07.11.21.41.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:41:09 -0700 (PDT) Date: Tue, 11 Jul 2023 21:41:04 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Simplify shmem and file THP collapse's retract_page_tables(), and relax its locking: to improve its success rate and to lessen impact on others. Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of target_mm, leave that part of the work to madvise_collapse() calling collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result code to arrange for that. That spares retract_page_tables() four arguments; and since it will be successful in retracting all of the page tables expected of it, no need to track and return a result code itself. It needs i_mmap_lock_read(mapping) for traversing the vma interval tree, but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk() allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for THPs. retract_page_tables() just needs to use those same spinlocks to exclude it briefly, while transitioning pmd from page table to none: so restore its use of pmd_lock() inside of which pte lock is nested. Users of pte_offset_map_lock() etc all now allow for them to fail: so retract_page_tables() now has no use for mmap_write_trylock() or vma_try_start_write(). In common with rmap and page_vma_mapped_walk(), it does not even need the mmap_read_lock(). But those users do expect the page table to remain a good page table, until they unlock and rcu_read_unlock(): so the page table cannot be freed immediately, but rather by the recently added pte_free_defer(). Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt when PAE, and pmdp_collapse_flush() did not already do so: to make sure that the start,pmdp_get_lockless(),end sequence in __pte_offset_map() cannot pick up a pmd entry with mismatched pmd_low and pmd_high. retract_page_tables() can be enhanced to replace_page_tables(), which inserts the final huge pmd without mmap lock: going through an invalid state instead of pmd_none() followed by fault. But that enhancement does raise some more questions: leave it until a later release. Signed-off-by: Hugh Dickins --- mm/khugepaged.c | 184 ++++++++++++++++++++------------------------------ 1 file changed, 75 insertions(+), 109 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 78c8d5d8b628..3bb05147961b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1615,9 +1615,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, break; case SCAN_PMD_NONE: /* - * In MADV_COLLAPSE path, possible race with khugepaged where - * all pte entries have been removed and pmd cleared. If so, - * skip all the pte checks and just update the pmd mapping. + * All pte entries have been removed and pmd cleared. + * Skip all the pte checks and just update the pmd mapping. */ goto maybe_install_pmd; default: @@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(st= ruct khugepaged_mm_slot *mm_sl mmap_write_unlock(mm); } =20 -static int retract_page_tables(struct address_space *mapping, pgoff_t pgof= f, - struct mm_struct *target_mm, - unsigned long target_addr, struct page *hpage, - struct collapse_control *cc) +static void retract_page_tables(struct address_space *mapping, pgoff_t pgo= ff) { struct vm_area_struct *vma; - int target_result =3D SCAN_FAIL; =20 - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { - int result =3D SCAN_FAIL; - struct mm_struct *mm =3D NULL; - unsigned long addr =3D 0; - pmd_t *pmd; - bool is_target =3D false; + struct mmu_notifier_range range; + struct mm_struct *mm; + unsigned long addr; + pmd_t *pmd, pgt_pmd; + spinlock_t *pml; + spinlock_t *ptl; + bool skipped_uffd =3D false; =20 /* * Check vma->anon_vma to exclude MAP_PRIVATE mappings that - * got written to. These VMAs are likely not worth investing - * mmap_write_lock(mm) as PMD-mapping is likely to be split - * later. - * - * Note that vma->anon_vma check is racy: it can be set up after - * the check but before we took mmap_lock by the fault path. - * But page lock would prevent establishing any new ptes of the - * page, so we are safe. - * - * An alternative would be drop the check, but check that page - * table is clear before calling pmdp_collapse_flush() under - * ptl. It has higher chance to recover THP for the VMA, but - * has higher cost too. It would also probably require locking - * the anon_vma. + * got written to. These VMAs are likely not worth removing + * page tables from, as PMD-mapping is likely to be split later. */ - if (READ_ONCE(vma->anon_vma)) { - result =3D SCAN_PAGE_ANON; - goto next; - } + if (READ_ONCE(vma->anon_vma)) + continue; + addr =3D vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); if (addr & ~HPAGE_PMD_MASK || - vma->vm_end < addr + HPAGE_PMD_SIZE) { - result =3D SCAN_VMA_CHECK; - goto next; - } - mm =3D vma->vm_mm; - is_target =3D mm =3D=3D target_mm && addr =3D=3D target_addr; - result =3D find_pmd_or_thp_or_none(mm, addr, &pmd); - if (result !=3D SCAN_SUCCEED) - goto next; - /* - * We need exclusive mmap_lock to retract page table. - * - * We use trylock due to lock inversion: we need to acquire - * mmap_lock while holding page lock. Fault path does it in - * reverse order. Trylock is a way to avoid deadlock. - * - * Also, it's not MADV_COLLAPSE's job to collapse other - * mappings - let khugepaged take care of them later. - */ - result =3D SCAN_PTE_MAPPED_HUGEPAGE; - if ((cc->is_khugepaged || is_target) && - mmap_write_trylock(mm)) { - /* trylock for the same lock inversion as above */ - if (!vma_try_start_write(vma)) - goto unlock_next; - - /* - * Re-check whether we have an ->anon_vma, because - * collapse_and_free_pmd() requires that either no - * ->anon_vma exists or the anon_vma is locked. - * We already checked ->anon_vma above, but that check - * is racy because ->anon_vma can be populated under the - * mmap lock in read mode. - */ - if (vma->anon_vma) { - result =3D SCAN_PAGE_ANON; - goto unlock_next; - } - /* - * When a vma is registered with uffd-wp, we can't - * recycle the pmd pgtable because there can be pte - * markers installed. Skip it only, so the rest mm/vma - * can still have the same file mapped hugely, however - * it'll always mapped in small page size for uffd-wp - * registered ranges. - */ - if (hpage_collapse_test_exit(mm)) { - result =3D SCAN_ANY_PROCESS; - goto unlock_next; - } - if (userfaultfd_wp(vma)) { - result =3D SCAN_PTE_UFFD_WP; - goto unlock_next; - } - collapse_and_free_pmd(mm, vma, addr, pmd); - if (!cc->is_khugepaged && is_target) - result =3D set_huge_pmd(vma, addr, pmd, hpage); - else - result =3D SCAN_SUCCEED; - -unlock_next: - mmap_write_unlock(mm); - goto next; - } - /* - * Calling context will handle target mm/addr. Otherwise, let - * khugepaged try again later. - */ - if (!is_target) { - khugepaged_add_pte_mapped_thp(mm, addr); + vma->vm_end < addr + HPAGE_PMD_SIZE) continue; + + mm =3D vma->vm_mm; + if (find_pmd_or_thp_or_none(mm, addr, &pmd) !=3D SCAN_SUCCEED) + continue; + + if (hpage_collapse_test_exit(mm)) + continue; + /* + * When a vma is registered with uffd-wp, we cannot recycle + * the page table because there may be pte markers installed. + * Other vmas can still have the same file mapped hugely, but + * skip this one: it will always be mapped in small page size + * for uffd-wp registered ranges. + */ + if (userfaultfd_wp(vma)) + continue; + + /* PTEs were notified when unmapped; but now for the PMD? */ + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + addr, addr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + pml =3D pmd_lock(mm, pmd); + ptl =3D pte_lockptr(mm, pmd); + if (ptl !=3D pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + /* + * Huge page lock is still held, so normally the page table + * must remain empty; and we have already skipped anon_vma + * and userfaultfd_wp() vmas. But since the mmap_lock is not + * held, it is still possible for a racing userfaultfd_ioctl() + * to have inserted ptes or markers. Now that we hold ptlock, + * repeating the anon_vma check protects from one category, + * and repeating the userfaultfd_wp() check from another. + */ + if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) { + skipped_uffd =3D true; + } else { + pgt_pmd =3D pmdp_collapse_flush(vma, addr, pmd); + pmdp_get_lockless_sync(); + } + + if (ptl !=3D pml) + spin_unlock(ptl); + spin_unlock(pml); + + mmu_notifier_invalidate_range_end(&range); + + if (!skipped_uffd) { + mm_dec_nr_ptes(mm); + page_table_check_pte_clear_range(mm, addr, pgt_pmd); + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); } -next: - if (is_target) - target_result =3D result; } - i_mmap_unlock_write(mapping); - return target_result; + i_mmap_unlock_read(mapping); } =20 /** @@ -2259,9 +2223,11 @@ static int collapse_file(struct mm_struct *mm, unsig= ned long addr, =20 /* * Remove pte page tables, so we can re-fault the page as huge. + * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp(). */ - result =3D retract_page_tables(mapping, start, mm, addr, hpage, - cc); + retract_page_tables(mapping, start); + if (cc && !cc->is_khugepaged) + result =3D SCAN_PTE_MAPPED_HUGEPAGE; unlock_page(hpage); =20 /* --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFE7DEB64D9 for ; Wed, 12 Jul 2023 04:42:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231620AbjGLEma (ORCPT ); Wed, 12 Jul 2023 00:42:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55160 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229843AbjGLEm1 (ORCPT ); Wed, 12 Jul 2023 00:42:27 -0400 Received: from mail-yb1-xb2c.google.com (mail-yb1-xb2c.google.com [IPv6:2607:f8b0:4864:20::b2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2D0211731 for ; Tue, 11 Jul 2023 21:42:25 -0700 (PDT) Received: by mail-yb1-xb2c.google.com with SMTP id 3f1490d57ef6-c17534f4c63so7714780276.0 for ; Tue, 11 Jul 2023 21:42:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136944; x=1691728944; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=Su/l5hM860ozuqpCkqwtzzC0FfEEdtSDi6e1iwuYpug=; b=GZoRZcbxUGx8JmnbydjZ65tUfTG/RCwN/jSORiifmfhsG46/OVpitJnVjI7Tvfrx66 yQKLrj9HSgcav8y1NjkbsLVKwlBLhiB07ffHvy0W5j9nFzsbypZd0htmKlFz47fp9Brd vb96MPWyt92vLrnMOzyhlT2f9c0j2h+mhC/orwGiKeqNf8uE52GrM39x9xkEKYzFJz/J Y7nqbs80ZE9xDlpAf1eoisZIqt1BHL1m+rcz6cqjFJmL7Dpw/Q2zTxvLuFLsfewvvmcc oBMm70r0MkpKOdO4rfwiGTYxRwmWgDWr98VsaQ1GlpPMjr10FdTViBtnADZdkEupuAWp FhZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136944; x=1691728944; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Su/l5hM860ozuqpCkqwtzzC0FfEEdtSDi6e1iwuYpug=; b=SWY6IWjEQUi6KQ0puzCyP0m1P0y7EY5axXmWfmNrlu4M0XW7t5FdMfskqQRFSwhOdY PstdY/+88taMXs92hRtU7t/CG+GhWRAZHs5Bo4b8JEKPk3TKa+w3ANzI+RJu6LlULff/ y3Tk+d8DU5JJbqWo/9Z5QWoh8XDdINAhEYpKzfzDlLQC1PrzctDBgQlT/Ff8Of6V/fsf CfDg33Q9l3ivrnDOF3ANrhobTAucqbYU1ozWQ7QcHdxTzRIoPEB6pCZCWUW0iyvfKgxT rnl5ofsH2aa2MHTnX443nZz4ARu21cIfd52zFbLzDXP1Qp2wlOoph2tIzAiBrbbllak6 6KlA== X-Gm-Message-State: ABy/qLaihqrI52CRHeRkPhKmjojlcXn8s80GMEsfSqgqDg5Kjg883qhY 225lWS+IWGBPpd6BCaJBJlb90g== X-Google-Smtp-Source: APBJJlFyWWTfqTiLrJedDsm7ZDw08POtyd5zLp0uOv6BSUnlUd1ofH5+6axcSto3EVAkbgdVFZKvLA== X-Received: by 2002:a25:c343:0:b0:c16:8d80:228b with SMTP id t64-20020a25c343000000b00c168d80228bmr14716106ybf.37.1689136944086; Tue, 11 Jul 2023 21:42:24 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id m9-20020a258c89000000b00c4ec3a3f695sm752090ybl.46.2023.07.11.21.42.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:42:23 -0700 (PDT) Date: Tue, 11 Jul 2023 21:42:19 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp(). It does need mmap_read_lock(), but it does not need mmap_write_lock(), nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing paths are relying on pte_offset_map_lock() and pmd_lock(), so use those. Follow the pattern in retract_page_tables(); and using pte_free_defer() removes most of the need for tlb_remove_table_sync_one() here; but call pmdp_get_lockless_sync() to use it in the PAE case. First check the VMA, in case page tables are being torn down: from JannH. Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been acquired and the page looks suitable: from then on its state is stable. However, collapse_pte_mapped_thp() was doing something others don't: freeing a page table still containing "valid" entries. i_mmap lock did stop a racing truncate from double-freeing those pages, but we prefer collapse_pte_mapped_thp() to clear the entries as usual. Their TLB flush can wait until the pmdp_collapse_flush() which follows, but the mmu_notifier_invalidate_range_start() has to be done earlier. Do the "step 1" checking loop without mmu_notifier: it wouldn't be good for khugepaged to keep on repeatedly invalidating a range which is then found unsuitable e.g. contains COWs. "step 2", which does the clearing, must then be more careful (after dropping ptl to do mmu_notifier), with abort prepared to correct the accounting like "step 3". But with those entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept safe by the huge page lock, which stops new PTEs from being faulted in. Signed-off-by: Hugh Dickins Reviewed-by: Qi Zheng --- mm/khugepaged.c | 172 ++++++++++++++++++++++---------------------------- 1 file changed, 77 insertions(+), 95 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3bb05147961b..46986eb4eebb 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_s= truct *mm, return ret; } =20 -/* hpage must be locked, and mmap_lock must be held in write */ +/* hpage must be locked, and mmap_lock must be held */ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct page *hpage) { @@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, u= nsigned long addr, }; =20 VM_BUG_ON(!PageTransHuge(hpage)); - mmap_assert_write_locked(vma->vm_mm); + mmap_assert_locked(vma->vm_mm); =20 if (do_set_pmd(&vmf, hpage)) return SCAN_FAIL; @@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, = unsigned long addr, return SCAN_SUCCEED; } =20 -/* - * A note about locking: - * Trying to take the page table spinlocks would be useless here because t= hose - * are only used to synchronize: - * - * - modifying terminal entries (ones that point to a data page, not to a= nother - * page table) - * - installing *new* non-terminal entries - * - * Instead, we need roughly the same kind of protection as free_pgtables()= or - * mm_take_all_locks() (but only for a single VMA): - * The mmap lock together with this VMA's rmap locks covers all paths towa= rds - * the page table entries we're messing with here, except for hardware page - * table walks and lockless_pages_from_mm(). - */ -static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_str= uct *vma, - unsigned long addr, pmd_t *pmdp) -{ - pmd_t pmd; - struct mmu_notifier_range range; - - mmap_assert_write_locked(mm); - if (vma->vm_file) - lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem); - /* - * All anon_vmas attached to the VMA have the same root and are - * therefore locked by the same lock. - */ - if (vma->anon_vma) - lockdep_assert_held_write(&vma->anon_vma->root->rwsem); - - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, - addr + HPAGE_PMD_SIZE); - mmu_notifier_invalidate_range_start(&range); - pmd =3D pmdp_collapse_flush(vma, addr, pmdp); - tlb_remove_table_sync_one(); - mmu_notifier_invalidate_range_end(&range); - mm_dec_nr_ptes(mm); - page_table_check_pte_clear_range(mm, addr, pmd); - pte_free(mm, pmd_pgtable(pmd)); -} - /** * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at * address haddr. @@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct = *mm, struct vm_area_struct *v int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd) { + struct mmu_notifier_range range; + bool notified =3D false; unsigned long haddr =3D addr & HPAGE_PMD_MASK; struct vm_area_struct *vma =3D vma_lookup(mm, haddr); struct page *hpage; pte_t *start_pte, *pte; - pmd_t *pmd; - spinlock_t *ptl; - int count =3D 0, result =3D SCAN_FAIL; + pmd_t *pmd, pgt_pmd; + spinlock_t *pml, *ptl; + int nr_ptes =3D 0, result =3D SCAN_FAIL; int i; =20 - mmap_assert_write_locked(mm); + mmap_assert_locked(mm); + + /* First check VMA found, in case page tables are being torn down */ + if (!vma || !vma->vm_file || + !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) + return SCAN_VMA_CHECK; =20 /* Fast check before locking page if already PMD-mapped */ result =3D find_pmd_or_thp_or_none(mm, haddr, &pmd); if (result =3D=3D SCAN_PMD_MAPPED) return result; =20 - if (!vma || !vma->vm_file || - !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) - return SCAN_VMA_CHECK; - /* * If we are here, we've succeeded in replacing all the native pages * in the page cache with a single hugepage. If a mm were to fault-in @@ -1610,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, goto drop_hpage; } =20 + result =3D find_pmd_or_thp_or_none(mm, haddr, &pmd); switch (result) { case SCAN_SUCCEED: break; @@ -1623,27 +1585,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, u= nsigned long addr, goto drop_hpage; } =20 - /* Lock the vma before taking i_mmap and page table locks */ - vma_start_write(vma); - - /* - * We need to lock the mapping so that from here on, only GUP-fast and - * hardware page walks can access the parts of the page tables that - * we're operating on. - * See collapse_and_free_pmd(). - */ - i_mmap_lock_write(vma->vm_file->f_mapping); - - /* - * This spinlock should be unnecessary: Nobody else should be accessing - * the page tables under spinlock protection here, only - * lockless_pages_from_mm() and the hardware page walker can access page - * tables while all the high-level locks are held in write mode. - */ result =3D SCAN_FAIL; start_pte =3D pte_offset_map_lock(mm, pmd, haddr, &ptl); - if (!start_pte) - goto drop_immap; + if (!start_pte) /* mmap_lock + page lock should prevent this */ + goto drop_hpage; =20 /* step 1: check all mapped PTEs are to the right huge page */ for (i =3D 0, addr =3D haddr, pte =3D start_pte; @@ -1670,10 +1615,18 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, u= nsigned long addr, */ if (hpage + i !=3D page) goto abort; - count++; } =20 - /* step 2: adjust rmap */ + pte_unmap_unlock(start_pte, ptl); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + haddr, haddr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + notified =3D true; + start_pte =3D pte_offset_map_lock(mm, pmd, haddr, &ptl); + if (!start_pte) /* mmap_lock + page lock should prevent this */ + goto abort; + + /* step 2: clear page table and adjust rmap */ for (i =3D 0, addr =3D haddr, pte =3D start_pte; i < HPAGE_PMD_NR; i++, addr +=3D PAGE_SIZE, pte++) { struct page *page; @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, u= nsigned long addr, =20 if (pte_none(ptent)) continue; - page =3D vm_normal_page(vma, addr, ptent); - if (WARN_ON_ONCE(page && is_zone_device_page(page))) + /* + * We dropped ptl after the first scan, to do the mmu_notifier: + * page lock stops more PTEs of the hpage being faulted in, but + * does not stop write faults COWing anon copies from existing + * PTEs; and does not stop those being swapped out or migrated. + */ + if (!pte_present(ptent)) { + result =3D SCAN_PTE_NON_PRESENT; goto abort; + } + page =3D vm_normal_page(vma, addr, ptent); + if (hpage + i !=3D page) + goto abort; + + /* + * Must clear entry, or a racing truncate may re-remove it. + * TLB flush can be left until pmdp_collapse_flush() does it. + * PTE dirty? Shmem page is already dirty; file is read-only. + */ + pte_clear(mm, addr, pte); page_remove_rmap(page, vma, false); + nr_ptes++; } =20 pte_unmap_unlock(start_pte, ptl); =20 /* step 3: set proper refcount and mm_counters. */ - if (count) { - page_ref_sub(hpage, count); - add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count); + if (nr_ptes) { + page_ref_sub(hpage, nr_ptes); + add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); } =20 - /* step 4: remove pte entries */ - /* we make no change to anon, but protect concurrent anon page lookup */ - if (vma->anon_vma) - anon_vma_lock_write(vma->anon_vma); + /* step 4: remove page table */ =20 - collapse_and_free_pmd(mm, vma, haddr, pmd); + /* Huge page lock is still held, so page table must remain empty */ + pml =3D pmd_lock(mm, pmd); + if (ptl !=3D pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + pgt_pmd =3D pmdp_collapse_flush(vma, haddr, pmd); + pmdp_get_lockless_sync(); + if (ptl !=3D pml) + spin_unlock(ptl); + spin_unlock(pml); =20 - if (vma->anon_vma) - anon_vma_unlock_write(vma->anon_vma); - i_mmap_unlock_write(vma->vm_file->f_mapping); + mmu_notifier_invalidate_range_end(&range); + + mm_dec_nr_ptes(mm); + page_table_check_pte_clear_range(mm, haddr, pgt_pmd); + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); =20 maybe_install_pmd: /* step 5: install pmd entry */ result =3D install_pmd ? set_huge_pmd(vma, haddr, pmd, hpage) : SCAN_SUCCEED; - + goto drop_hpage; +abort: + if (nr_ptes) { + flush_tlb_mm(mm); + page_ref_sub(hpage, nr_ptes); + add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); + } + if (start_pte) + pte_unmap_unlock(start_pte, ptl); + if (notified) + mmu_notifier_invalidate_range_end(&range); drop_hpage: unlock_page(hpage); put_page(hpage); return result; - -abort: - pte_unmap_unlock(start_pte, ptl); -drop_immap: - i_mmap_unlock_write(vma->vm_file->f_mapping); - goto drop_hpage; } =20 static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot = *mm_slot) @@ -2855,9 +2837,9 @@ int madvise_collapse(struct vm_area_struct *vma, stru= ct vm_area_struct **prev, case SCAN_PTE_MAPPED_HUGEPAGE: BUG_ON(mmap_locked); BUG_ON(*prev); - mmap_write_lock(mm); + mmap_read_lock(mm); result =3D collapse_pte_mapped_thp(mm, addr, true); - mmap_write_unlock(mm); + mmap_locked =3D true; goto handle_result; /* Whitelisted set of results where continuing OK */ case SCAN_PMD_NULL: --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B166EB64DD for ; Wed, 12 Jul 2023 04:43:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231669AbjGLEns (ORCPT ); Wed, 12 Jul 2023 00:43:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55562 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229843AbjGLEno (ORCPT ); Wed, 12 Jul 2023 00:43:44 -0400 Received: from mail-yw1-x1136.google.com (mail-yw1-x1136.google.com [IPv6:2607:f8b0:4864:20::1136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 681CD1736 for ; Tue, 11 Jul 2023 21:43:42 -0700 (PDT) Received: by mail-yw1-x1136.google.com with SMTP id 00721157ae682-56fff21c2ebso73766197b3.3 for ; Tue, 11 Jul 2023 21:43:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689137021; x=1691729021; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=32J2R02jsERy50AfXzN+Fv7mkbkTcayWxoVtpmMI0mo=; b=kQgLOireKqLbQKrB+SyHisW6tHedPZ4G73Quib/4dgQ29ZxZF0aUUC9/AB1YAXJTmW YfN2VLL4oxjBytasR0DDmdsA/5SwE00Qr2eUa6/BDHduLqwOHYcIxbv7cRXIJlWzEw2m uoqeOZ8DbWFaiY63frmb1j9LBXR1ks+8b73KgPMot86t+OVw+MvtSmilEn/5eKnnRo9K SAxgb45RUrxIr6smOzqkeoGVSl5EJPs5usxApXvZRRMwRcK/6dkO1aWOTsyUkX+UXkby QrSLYbHURK0PWj/JH/BKH9qrF8fgxguY0Wuc8PElVG57pF2x2Uxoa29tGTGp97vsnB2j jLjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689137021; x=1691729021; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=32J2R02jsERy50AfXzN+Fv7mkbkTcayWxoVtpmMI0mo=; b=lU2GiNzJgh0GOXk3I/7U2eRqI//rnYiTs8kAlnBzBPJOuaCVcKQNgyFbogCohm0JlP 95VKNDFI6xRNG+hU272csYx4cv1v6/E4p4pbTJCP9L2vIjvzvQUE9BZ132A4SdsDHZlF Y5059GPkkUxngajxHEYRt+BEtfpcJYVIBDE9RNyGd4plxgeWkHpibYvSF6F0TQ7+Lydy aLUJCgljgD5PGrhYuN5I63hiia6W8dfAOp2BGbbEqEBmXL+ChNIza7oZ9wfOlvg76jrw wDEsNKr4TrgqtT8zAFlD0h9xACVbwtukt/iVzbJ+RNxUIYUVlo7F0saBt34uNq7SlFxB 8eJw== X-Gm-Message-State: ABy/qLaxLNvsxsYeslWfP2I2VDWVGX3Ts+XzFkAa7uN+tvKyJfsQSPLA HbffveDLdlTC5L1rHv0WMKxagw== X-Google-Smtp-Source: APBJJlEKLBPT5DnCx42X0tK5DSIxj7Xpwg5AizJaj2C86j321LP85ClVc2RWHCxTTmThFbhjxFV/0w== X-Received: by 2002:a81:4e46:0:b0:577:42be:1804 with SMTP id c67-20020a814e46000000b0057742be1804mr17031953ywb.29.1689137021419; Tue, 11 Jul 2023 21:43:41 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id h62-20020a815341000000b0057682d3f95fsm981159ywb.136.2023.07.11.21.43.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:43:41 -0700 (PDT) Date: Tue, 11 Jul 2023 21:43:36 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that retract_page_tables() can retract page tables reliably, without depending on trylocks, delete all the apparatus for khugepaged to try again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the per-mm memory which was set aside for that in the khugepaged_mm_slot. But one part of that is worth keeping: when hpage_collapse_scan_file() found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot to be tried for retraction later - catching, for example, page tables where a reversible mprotect() of a portion had required splitting the pmd, but now it can be recollapsed. Call collapse_pte_mapped_thp() directly in this case (why was it deferred before? I assume an issue with needing mmap_lock for write, but now it's only needed for read). Signed-off-by: Hugh Dickins --- mm/khugepaged.c | 125 +++++++------------------------------------------- 1 file changed, 16 insertions(+), 109 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 46986eb4eebb..7c7aaddbe130 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -92,8 +92,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLO= TS_HASH_BITS); =20 static struct kmem_cache *mm_slot_cache __read_mostly; =20 -#define MAX_PTE_MAPPED_THP 8 - struct collapse_control { bool is_khugepaged; =20 @@ -107,15 +105,9 @@ struct collapse_control { /** * struct khugepaged_mm_slot - khugepaged information per mm that is being= scanned * @slot: hash lookup from mm to mm_slot - * @nr_pte_mapped_thp: number of pte mapped THP - * @pte_mapped_thp: address array corresponding pte mapped THP */ struct khugepaged_mm_slot { struct mm_slot slot; - - /* pte-mapped THP in this mm */ - int nr_pte_mapped_thp; - unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP]; }; =20 /** @@ -1439,50 +1431,6 @@ static void collect_mm_slot(struct khugepaged_mm_slo= t *mm_slot) } =20 #ifdef CONFIG_SHMEM -/* - * Notify khugepaged that given addr of the mm is pte-mapped THP. Then - * khugepaged should try to collapse the page table. - * - * Note that following race exists: - * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struc= t A, - * emptying the A's ->pte_mapped_thp[] array. - * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, a= nd - * retract_page_tables() finds a VMA in mm_struct A mapping the same e= xtent - * (at virtual address X) and adds an entry (for X) into mm_struct A's - * ->pte-mapped_thp[] array. - * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at= X, - * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry - * (for X) into mm_struct A's ->pte-mapped_thp[] array. - * Thus, it's possible the same address is added multiple times for the sa= me - * mm_struct. Should this happen, we'll simply attempt - * collapse_pte_mapped_thp() multiple times for the same address, under th= e same - * exclusive mmap_lock, and assuming the first call is successful, subsequ= ent - * attempts will return quickly (without grabbing any additional locks) wh= en - * a huge pmd is found in find_pmd_or_thp_or_none(). Since this is a cheap - * check, and since this is a rare occurrence, the cost of preventing this - * "multiple-add" is thought to be more expensive than just handling it, s= hould - * it occur. - */ -static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, - unsigned long addr) -{ - struct khugepaged_mm_slot *mm_slot; - struct mm_slot *slot; - bool ret =3D false; - - VM_BUG_ON(addr & ~HPAGE_PMD_MASK); - - spin_lock(&khugepaged_mm_lock); - slot =3D mm_slot_lookup(mm_slots_hash, mm); - mm_slot =3D mm_slot_entry(slot, struct khugepaged_mm_slot, slot); - if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) { - mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] =3D addr; - ret =3D true; - } - spin_unlock(&khugepaged_mm_lock); - return ret; -} - /* hpage must be locked, and mmap_lock must be held */ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct page *hpage) @@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, un= signed long addr, return result; } =20 -static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot = *mm_slot) -{ - struct mm_slot *slot =3D &mm_slot->slot; - struct mm_struct *mm =3D slot->mm; - int i; - - if (likely(mm_slot->nr_pte_mapped_thp =3D=3D 0)) - return; - - if (!mmap_write_trylock(mm)) - return; - - if (unlikely(hpage_collapse_test_exit(mm))) - goto out; - - for (i =3D 0; i < mm_slot->nr_pte_mapped_thp; i++) - collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false); - -out: - mm_slot->nr_pte_mapped_thp =3D 0; - mmap_write_unlock(mm); -} - static void retract_page_tables(struct address_space *mapping, pgoff_t pgo= ff) { struct vm_area_struct *vma; @@ -2370,16 +2295,6 @@ static int hpage_collapse_scan_file(struct mm_struct= *mm, unsigned long addr, { BUILD_BUG(); } - -static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot = *mm_slot) -{ -} - -static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, - unsigned long addr) -{ - return false; -} #endif =20 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *resul= t, @@ -2409,7 +2324,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, khugepaged_scan.mm_slot =3D mm_slot; } spin_unlock(&khugepaged_mm_lock); - khugepaged_collapse_pte_mapped_thps(mm_slot); =20 mm =3D slot->mm; /* @@ -2462,36 +2376,29 @@ static unsigned int khugepaged_scan_mm_slot(unsigne= d int pages, int *result, khugepaged_scan.address); =20 mmap_read_unlock(mm); - *result =3D hpage_collapse_scan_file(mm, - khugepaged_scan.address, - file, pgoff, cc); mmap_locked =3D false; + *result =3D hpage_collapse_scan_file(mm, + khugepaged_scan.address, file, pgoff, cc); + if (*result =3D=3D SCAN_PTE_MAPPED_HUGEPAGE) { + mmap_read_lock(mm); + mmap_locked =3D true; + if (hpage_collapse_test_exit(mm)) { + fput(file); + goto breakouterloop; + } + *result =3D collapse_pte_mapped_thp(mm, + khugepaged_scan.address, false); + if (*result =3D=3D SCAN_PMD_MAPPED) + *result =3D SCAN_SUCCEED; + } fput(file); } else { *result =3D hpage_collapse_scan_pmd(mm, vma, - khugepaged_scan.address, - &mmap_locked, - cc); + khugepaged_scan.address, &mmap_locked, cc); } - switch (*result) { - case SCAN_PTE_MAPPED_HUGEPAGE: { - pmd_t *pmd; =20 - *result =3D find_pmd_or_thp_or_none(mm, - khugepaged_scan.address, - &pmd); - if (*result !=3D SCAN_SUCCEED) - break; - if (!khugepaged_add_pte_mapped_thp(mm, - khugepaged_scan.address)) - break; - } fallthrough; - case SCAN_SUCCEED: + if (*result =3D=3D SCAN_SUCCEED) ++khugepaged_pages_collapsed; - break; - default: - break; - } =20 /* move to next address */ khugepaged_scan.address +=3D HPAGE_PMD_SIZE; --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68350EB64DA for ; Wed, 12 Jul 2023 04:45:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231322AbjGLEpS (ORCPT ); Wed, 12 Jul 2023 00:45:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231758AbjGLEpI (ORCPT ); Wed, 12 Jul 2023 00:45:08 -0400 Received: from mail-yb1-xb31.google.com (mail-yb1-xb31.google.com [IPv6:2607:f8b0:4864:20::b31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E3D01980 for ; Tue, 11 Jul 2023 21:45:03 -0700 (PDT) Received: by mail-yb1-xb31.google.com with SMTP id 3f1490d57ef6-c50c797c31bso7317863276.0 for ; Tue, 11 Jul 2023 21:45:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689137103; x=1691729103; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=FwWrXdLh4e+ZZMmHSG9GahXduxLzpg1IKzk9+Pois54=; b=gzISCkNIkoYstaiURssLgcAP0NHrER6I1tuZ8rc0Qvm4Mprsdiz1G+fVe2F78Gadly Zd6Zv7R4mMYit6gDmv3ASM0ewnC89JI2TxXgi3FWSUoDcd6h1cjVrlXHTbjoLAQWnXuN ZQUv1RJ1RAUpKOWClvwrVKEYf90QtijHOwpAofLxIipQ4ubW+8saarTIudeMK/r5lyew pHuqVs8/tIX7WT6ZYHIjSpcSZfFYwHc9ruaHZLElCwezBNkxHQx4zaVjKMhWpdolCgWJ CTIkHHX4v8I9yJRfGqJT3RXI/emCB7p8zIkR9VuDxyTUEDm4M6z0824h8U/TQtrRCylu LH5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689137103; x=1691729103; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FwWrXdLh4e+ZZMmHSG9GahXduxLzpg1IKzk9+Pois54=; b=BSP5AA3FTFI9JsJDV8hgRjembLS+YSdbw4InQUQB3vKIVcdyFbRu/ll3yTfu5awhlg neK1B7YBrCMkXUTG2EUMINonmDqm5YSHEoKIclmHBxTs7UMMeI1ad6IxI5j6/gWR7HKX tDYe9LbOhlkVZWA+G2fNJIn5iNW+zCA2GIpiMg+qjeqNRnEqvzDTkz93s3TQOXAAjdb0 lkEegYzSnSpAg4nL7j/Ym6pkQBjLpzupM5m7W1wdTC6H/ci0n4ZhfcID9RpspJdFZaFZ veYFD1QARFNgReZNc/QjHa+4kDZ5QaRjjmXA/wpahLPAgWVsq4weMZti2rH2MHOXscH3 M2bA== X-Gm-Message-State: ABy/qLZRkV8NB8LMXaHffMo/Fckz+baciv3gw2nwvB4RB9fdJdp8M7YZ ZKjpztMI8O3PrhU6FSX0r05nSw== X-Google-Smtp-Source: APBJJlHRrkvIp6Bta0OiFd96ROrSVoBT2nJnQYkUAOeyOnPlbKa4aui5kARvVpJZZxshvdKBo9N7Xg== X-Received: by 2002:a81:ab51:0:b0:577:bc6:6f8c with SMTP id d17-20020a81ab51000000b005770bc66f8cmr17906536ywk.26.1689137102617; Tue, 11 Jul 2023 21:45:02 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id r11-20020a0de80b000000b00561949f713fsm993186ywe.39.2023.07.11.21.44.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:45:02 -0700 (PDT) Date: Tue, 11 Jul 2023 21:44:57 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write() In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: <728dae79-5110-e3c4-df27-ce3df525aaef@google.com> References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" mmap_write_trylock() and vma_try_start_write() were added just for khugepaged, but now it has no use for them: delete. Signed-off-by: Hugh Dickins --- include/linux/mm.h | 17 ----------------- include/linux/mmap_lock.h | 10 ---------- 2 files changed, 27 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..b7b45be616ad 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -692,21 +692,6 @@ static inline void vma_start_write(struct vm_area_stru= ct *vma) up_write(&vma->vm_lock->lock); } =20 -static inline bool vma_try_start_write(struct vm_area_struct *vma) -{ - int mm_lock_seq; - - if (__is_vma_write_locked(vma, &mm_lock_seq)) - return true; - - if (!down_write_trylock(&vma->vm_lock->lock)) - return false; - - vma->vm_lock_seq =3D mm_lock_seq; - up_write(&vma->vm_lock->lock); - return true; -} - static inline void vma_assert_write_locked(struct vm_area_struct *vma) { int mm_lock_seq; @@ -731,8 +716,6 @@ static inline bool vma_start_read(struct vm_area_struct= *vma) { return false; } static inline void vma_end_read(struct vm_area_struct *vma) {} static inline void vma_start_write(struct vm_area_struct *vma) {} -static inline bool vma_try_start_write(struct vm_area_struct *vma) - { return true; } static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) {} diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index aab8f1b28d26..d1191f02c7fa 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_s= truct *mm) return ret; } =20 -static inline bool mmap_write_trylock(struct mm_struct *mm) -{ - bool ret; - - __mmap_lock_trace_start_locking(mm, true); - ret =3D down_write_trylock(&mm->mmap_lock) !=3D 0; - __mmap_lock_trace_acquire_returned(mm, true, ret); - return ret; -} - static inline void mmap_write_unlock(struct mm_struct *mm) { __mmap_lock_trace_released(mm, true); --=20 2.35.3 From nobody Sat Feb 7 18:20:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80184EB64DD for ; Wed, 12 Jul 2023 04:46:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231532AbjGLEqg (ORCPT ); Wed, 12 Jul 2023 00:46:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230235AbjGLEqc (ORCPT ); Wed, 12 Jul 2023 00:46:32 -0400 Received: from mail-yw1-x112e.google.com (mail-yw1-x112e.google.com [IPv6:2607:f8b0:4864:20::112e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 27AC81733 for ; Tue, 11 Jul 2023 21:46:30 -0700 (PDT) Received: by mail-yw1-x112e.google.com with SMTP id 00721157ae682-579ed2829a8so68751737b3.1 for ; Tue, 11 Jul 2023 21:46:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689137189; x=1691729189; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=4LOZtvGEF5g0eQvRdR0fiwT6tMgm62bdW0dz/Hhyqlk=; b=J1OyC/PzRxAcAnxReOGyZoimdKHV4sne6EV+ZF5GnSzIGkj49DIxf8CJZXSiYaLo8O rQDVnXnu0y21FRcOCZP0OeEhb8w1Ae51u8RxjVwTlnPwlHyRZu5pHWxjcdG3sLbNOkgm l9DRu2V+r91JT6sg8uxZ0xdueMEWKHVeYHw/ucTn/rvDkwQeKDUMIXdd66/4pn+ZPDqc bbexzf/2xXArSFmFzEjH5dtsWbOgASWzeyhuG3Z+bgzkYYJreHWjygSN0MaZVhpwfXUc /vz1Ww9YOcG9dsJAekHocNmUdgaeY/Jq/K6dI5tFUgzSXd3eubGhE9SWhC2zJPkylOJH UPbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689137189; x=1691729189; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4LOZtvGEF5g0eQvRdR0fiwT6tMgm62bdW0dz/Hhyqlk=; b=Gy2k/umRA/uSZTczpd1B7Y9Xg6E6JI1RON16q3XMRsjElIgcoPieT5Bxa+XMKtUmk3 qvndgF5mYoBxjgrnZd63IM+o0SJyVIJQrzzFKp2k6wCnJrD7ynUnRVLt16O8URZ6sfOM dXLAlTv/vUcsuuGyfb/bB1tJKyGAADiawKKfC2tuDz3ebDSkTHTzvL9mpyGJ7dDKMtaP MDnyX4nQffpLhlicqLSsC17wONygmQvhVj60E/dU+Fah6os3R10w0HQgRgvAMr2Mtqsn u9vDMBZh+JBbaOUo/5LJnimLxa5vusE/0Xw9cGgGcb1pCLvXRO479V7Tu4nYGtUGkYVC 0rAA== X-Gm-Message-State: ABy/qLZAJPf1+Dt3ylrErTNmNrkZnkUao96l2bbC20wT/XCnHZKzFkan HsRMrlOZTrbCMcSEDRVWTuekZQ== X-Google-Smtp-Source: APBJJlGajCIPfwdsjGp7sM9eUSR1neGCrgzZWT479LGOvJujy6hRuJqugeEnCaM3mBFakTQjRwJQ+w== X-Received: by 2002:a81:5dd6:0:b0:562:16d7:e6eb with SMTP id r205-20020a815dd6000000b0056216d7e6ebmr17894738ywb.40.1689137189022; Tue, 11 Jul 2023 21:46:29 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id j126-20020a0df984000000b005772e9388cdsm969335ywf.62.2023.07.11.21.46.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:46:28 -0700 (PDT) Date: Tue, 11 Jul 2023 21:46:23 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]() In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a block of comments on pte_offset_map_lock(), pte_offset_map() and pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them. Signed-off-by: Hugh Dickins --- mm/pgtable-generic.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index fa9d4d084291..4fcd959dcc4d 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -315,6 +315,50 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd= _t *pmd, return pte; } =20 +/* + * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementati= on + * __pte_offset_map_lock() below, is usually called with the pmd pointer f= or + * addr, reached by walking down the mm's pgd, p4d, pud for addr: either w= hile + * holding mmap_lock or vma lock for read or for write; or in truncate or = rmap + * context, while holding file's i_mmap_lock or anon_vma lock for read (or= for + * write). In a few cases, it may be used with pmd pointing to a pmd_t alr= eady + * copied to or constructed on the stack. + * + * When successful, it returns the pte pointer for addr, with its page tab= le + * kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurre= nt + * modification by software, with a pointer to that spinlock in ptlp (in s= ome + * configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in tabl= e's + * struct page). pte_unmap_unlock(pte, ptl) to unlock and unmap afterward= s. + * + * But it is unsuccessful, returning NULL with *ptlp unchanged, if there i= s no + * page table at *pmd: if, for example, the page table has just been remov= ed, + * or replaced by the huge pmd of a THP. (When successful, *pmd is rechec= ked + * after acquiring the ptlock, and retried internally if it changed: so th= at a + * page table can be safely removed or replaced by THP while holding its l= ock.) + * + * pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() a= bove, + * just returns the pte pointer for addr, its page table kmapped if necess= ary; + * or NULL if there is no page table at *pmd. It does not attempt to lock= the + * page table, so cannot normally be used when the page table is to be upd= ated, + * or when entries read must be stable. But it does take rcu_read_lock():= so + * that even when page table is racily removed, it remains a valid though = empty + * and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unloc= k()s + * afterwards. + * + * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_m= ap(); + * but when successful, it also outputs a pointer to the spinlock in ptlp = - as + * pte_offset_map_lock() does, but in this case without locking it. This = helps + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that = time + * act on a changed *pmd: pte_offset_map_nolock() provides the correct spi= nlock + * pointer for the page table that it returns. In principle, the caller s= hould + * recheck *pmd once the lock is taken; in practice, no callsite needs tha= t - + * either the mmap_lock for write, or pte_same() check on contents, is eno= ugh. + * + * Note that free_pgtables(), used after unmapping detached vmas, or when + * exiting the whole mm, does not take page table lock before freeing a pa= ge + * table, and may not use RCU at all: "outsiders" like khugepaged should a= void + * pte_offset_map() and co once the vma is detached from mm or mm_users is= zero. + */ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, spinlock_t **ptlp) { --=20 2.35.3