From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B179C2BA4C for ; Wed, 26 Jan 2022 09:57:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239450AbiAZJ5u (ORCPT ); Wed, 26 Jan 2022 04:57:50 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:25464 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239492AbiAZJ5q (ORCPT ); Wed, 26 Jan 2022 04:57:46 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191065; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NAAIIFATEqd80W11R9L9MXwM4GCrPUNUZiPXyoAGr5M=; b=jMKqRpRmJE0M0ohH7TnybmXwLZcmtoGC/wCpoIK4wczvUXFsRXTbz4VMUTuuDLM7XzIyAd S2b4ZmR+J/1MhypSmVKL8nLjes0JMciMmc1uj9VYsgqz9cbS1WyUHiv+XHxL7dmNac1mcw C+2D06Ki61gO3jPv+2+hG4wBCsUtFA4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-672-B2JjIRltNDiDbXZcfa4JDQ-1; Wed, 26 Jan 2022 04:57:42 -0500 X-MC-Unique: B2JjIRltNDiDbXZcfa4JDQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E5DE2100C62A; Wed, 26 Jan 2022 09:57:39 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id B3BA21F2FB; Wed, 26 Jan 2022 09:56:57 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand , Nadav Amit Subject: [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache Date: Wed, 26 Jan 2022 10:55:49 +0100 Message-Id: <20220126095557.32392-2-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Liang Zhang reported [1] that the current COW logic in do_wp_page() is sub-optimal when it comes to swap+read fault+write fault of anonymous pages that have a single user, visible via a performance degradation in the redis benchmark. Something similar was previously reported [2] by Nadav with a simple reproducer. Let's optimize for pages that have been added to the swapcache but only have an exclusive owner. Try removing the swapcache reference if there is hope that we're the exclusive user. We will fail removing the swapcache reference in two scenarios: (1) There are additional swap entries referencing the page: copying instead of reusing is the right thing to do. (2) The page is under writeback: theoretically we might be able to reuse in some cases, however, we cannot remove the additional reference and will have to copy. Further, we might have additional references from the LRU pagevecs, which will force us to copy instead of being able to reuse. We'll try handling such references for some scenarios next. Concurrent writeback cannot be handled easily and we'll always have to copy. While at it, remove the superfluous page_mapcount() check: it's implicitly covered by the page_count() for ordinary anon pages. [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com Reported-by: Liang Zhang Reported-by: Nadav Amit Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Reviewed-by: Matthew Wilcox (Oracle) --- mm/memory.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index c125c4969913..bcd3b7c50891 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3291,19 +3291,27 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (PageAnon(vmf->page)) { struct page *page =3D vmf->page; =20 - /* PageKsm() doesn't necessarily raise the page refcount */ - if (PageKsm(page) || page_count(page) !=3D 1) + /* + * We have to verify under page lock: these early checks are + * just an optimization to avoid locking the page and freeing + * the swapcache if there is little hope that we can reuse. + * + * PageKsm() doesn't necessarily raise the page refcount. + */ + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page)) goto copy; if (!trylock_page(page)) goto copy; - if (PageKsm(page) || page_mapcount(page) !=3D 1 || page_count(page) !=3D= 1) { + if (PageSwapCache(page)) + try_to_free_swap(page); + if (PageKsm(page) || page_count(page) !=3D 1) { unlock_page(page); goto copy; } /* - * Ok, we've got the only map reference, and the only - * page count reference, and the page is locked, - * it's dark out, and we're wearing sunglasses. Hit it. + * Ok, we've got the only page reference from our mapping + * and the page is locked, it's dark out, and we're wearing + * sunglasses. Hit it. */ unlock_page(page); wp_page_reuse(vmf); --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80386C2BA4C for ; Wed, 26 Jan 2022 09:58:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239478AbiAZJ61 (ORCPT ); Wed, 26 Jan 2022 04:58:27 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:23140 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239456AbiAZJ60 (ORCPT ); Wed, 26 Jan 2022 04:58:26 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191106; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fIx2OqkV5g2J0YsQllaAKYz2gQUQCnh10CIndxq7uoU=; b=CHzekelLAM/jKHnnLo6LmUOwdeP2b2VU4AatQ50CDhbNLZeIoQ/Nc8iGbBkXj6J3Rb/Srn Tj5WxTxQ8y0l7pgbZWZ5TWl615HRtvFy9gUrUi02w2ERUCs6DRddFZB4PsSQXKbRA3v8G7 9x1lzZ+EOajt9O0aG72ZW+NyBmarIFM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-422-6fqFOokFNRunatfjIhy5Bw-1; Wed, 26 Jan 2022 04:58:22 -0500 X-MC-Unique: 6fqFOokFNRunatfjIhy5Bw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id D6E481091DA5; Wed, 26 Jan 2022 09:58:19 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 684AE1F2E8; Wed, 26 Jan 2022 09:57:40 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs Date: Wed, 26 Jan 2022 10:55:50 +0100 Message-Id: <20220126095557.32392-3-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For example, if a page just got swapped in via a read fault, the LRU pagevecs might still hold a reference to the page. If we trigger a write fault on such a page, the additional reference from the LRU pagevecs will prohibit reusing the page. Let's conditionally drain the local LRU pagevecs when we stumble over a !PageLRU() page. We cannot easily drain remote LRU pagevecs and it might not be desirable performance-wise. Consequently, this will only avoid copying in some cases. We cannot easily handle the following cases and we will always have to copy: (1) The page is referenced in the LRU pagevecs of other CPUs. We really would have to drain the LRU pagevecs of all CPUs -- most probably copying is much cheaper. (2) The page is already PageLRU() but is getting moved between LRU lists, for example, for activation (e.g., mark_page_accessed()), deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to drain mostly unconditionally, which might be bad performance-wise. Most probably this won't happen too often in practice. Note that there are other reasons why an anon page might temporarily not be PageLRU(): for example, compaction and migration have to isolate LRU pages from the LRU lists first (isolate_lru_page()), moving them to temporary local lists and clearing PageLRU() and holding an additional reference on the page. In that case, we'll always copy. This change seems to be fairly effective with the reproducer [1] shared by Nadav, as long as writeback is done synchronously, for example, using zram. However, with asynchronous writeback, we'll usually fail to free the swapcache because the page is still under writeback: something we cannot easily optimize for, and maybe it's not really relevant in practice. [1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com Signed-off-by: David Hildenbrand --- mm/memory.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index bcd3b7c50891..61d67ceef734 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3298,7 +3298,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * * PageKsm() doesn't necessarily raise the page refcount. */ - if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page)) + if (PageKsm(page)) + goto copy; + if (page_count(page) > 1 + PageSwapCache(page) + !PageLRU(page)) + goto copy; + if (!PageLRU(page)) + /* + * Note: We cannot easily detect+handle references from + * remote LRU pagevecs or references to PageLRU() pages. + */ + lru_add_drain(); + if (page_count(page) > 1 + PageSwapCache(page)) goto copy; if (!trylock_page(page)) goto copy; --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E400C2BA4C for ; Wed, 26 Jan 2022 09:59:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239499AbiAZJ7M (ORCPT ); Wed, 26 Jan 2022 04:59:12 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:31260 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239488AbiAZJ7J (ORCPT ); Wed, 26 Jan 2022 04:59:09 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191149; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rwvlrvac61lF3gQ6F6uZri0xKYAPflsQBOJgJdyNeH0=; b=UTXMtdGSs/8bh4nKI94w0xAW1AB8363Ein9qtOysvfyfQEeAeCB+KsAeHO7hpMfqAqyGAX gfKBHQ4TL8ZlYksLbhSJHJ8vCxkFQxkGscUWjU4Z4il7ZsIl9MgeBE6uMeZiWYZqEhWCEY 2YHNWK/LYkbXDLUofk0yk9HjigPGEw4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-416-_NkElsNjPvOI0gEAXXjxxg-1; Wed, 26 Jan 2022 04:59:05 -0500 X-MC-Unique: _NkElsNjPvOI0gEAXXjxxg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2F163100C663; Wed, 26 Jan 2022 09:59:03 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3E2211F2FD; Wed, 26 Jan 2022 09:58:20 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 3/9] mm: slightly clarify KSM logic in do_swap_page() Date: Wed, 26 Jan 2022 10:55:51 +0100 Message-Id: <20220126095557.32392-4-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Let's make it clearer that KSM might only have to copy a page in case we have a page in the swapcache, not if we allocated a fresh page and bypassed the swapcache. While at it, add a comment why this is usually necessary and merge the two swapcache conditions. Signed-off-by: David Hildenbrand --- mm/memory.c | 38 +++++++++++++++++++++++--------------- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 61d67ceef734..ab3153252cfe 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3617,21 +3617,29 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_release; } =20 - /* - * Make sure try_to_free_swap or reuse_swap_page or swapoff did not - * release the swapcache from under us. The page pin, and pte_same - * test below, are not enough to exclude that. Even if it is still - * swapcache, we need to check that the page's swap has not changed. - */ - if (unlikely((!PageSwapCache(page) || - page_private(page) !=3D entry.val)) && swapcache) - goto out_page; - - page =3D ksm_might_need_to_copy(page, vma, vmf->address); - if (unlikely(!page)) { - ret =3D VM_FAULT_OOM; - page =3D swapcache; - goto out_page; + if (swapcache) { + /* + * Make sure try_to_free_swap or reuse_swap_page or swapoff did + * not release the swapcache from under us. The page pin, and + * pte_same test below, are not enough to exclude that. Even if + * it is still swapcache, we need to check that the page's swap + * has not changed. + */ + if (unlikely(!PageSwapCache(page) || + page_private(page) !=3D entry.val)) + goto out_page; + + /* + * KSM sometimes has to copy on read faults, for example, if + * page->index of !PageKSM() pages would be nonlinear inside the + * anon VMA -- PageKSM() is lost on actual swapout. + */ + page =3D ksm_might_need_to_copy(page, vma, vmf->address); + if (unlikely(!page)) { + ret =3D VM_FAULT_OOM; + page =3D swapcache; + goto out_page; + } } =20 cgroup_throttle_swaprate(page, GFP_KERNEL); --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A01BC28CF5 for ; Wed, 26 Jan 2022 10:00:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239506AbiAZKAH (ORCPT ); Wed, 26 Jan 2022 05:00:07 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:39241 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239497AbiAZKAF (ORCPT ); Wed, 26 Jan 2022 05:00:05 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191205; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8lZCUoGF7vMZxd2uBv8vuAjoIcLxZX5AOgXwi6hd1oU=; b=GMm8vbJC0EmQPTqdRU4trHaxAYPXV2x+8ToJFMVIm7QnmPEbdDga321YDc7g/2q6VEXskT /eAl8P6BE+slS/o5FYXhFjcAsZC82HKeuJEazjV5JFb14DxJiYj3DIoAWe81c2HtRQcfSf vqn6d8Qz35diB0CkUVk8Xoh8R5a/Cxs= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-49-LrhBAHJkN12DwCXYCzWIZA-1; Wed, 26 Jan 2022 05:00:03 -0500 X-MC-Unique: LrhBAHJkN12DwCXYCzWIZA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B88B286A061; Wed, 26 Jan 2022 10:00:00 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8E0E31F2E8; Wed, 26 Jan 2022 09:59:03 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 4/9] mm: streamline COW logic in do_swap_page() Date: Wed, 26 Jan 2022 10:55:52 +0100 Message-Id: <20220126095557.32392-5-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Currently we have a different COW logic when: * triggering a read-fault to swapin first and then trigger a write-fault -> do_swap_page() + do_wp_page() * triggering a write-fault to swapin -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page() The COW logic in do_swap_page() is different than our reuse logic in do_wp_page(). The COW logic in do_wp_page() -- page_count() =3D=3D 1 -- ma= kes currently sure that we certainly don't have a remaining reference, e.g., via GUP, on the target page we want to reuse: if there is any unexpected reference, we have to copy to avoid information leaks. As do_swap_page() behaves differently, in environments with swap enabled we can currently have an unintended information leak from the parent to the child, similar as known from CVE-2020-29374: 1. Parent writes to anonymous page -> Page is mapped writable and modified 2. Page is swapped out -> Page is unmapped and replaced by swap entry 3. fork() -> Swap entries are copied to child 4. Child pins page R/O -> Page is mapped R/O into child 5. Child unmaps page -> Child still holds GUP reference 6. Parent writes to page -> Page is reused in do_swap_page() -> Child can observe changes Exchanging 2. and 3. should have the same effect. Let's apply the same COW logic as in do_wp_page(), conditionally trying to remove the page from the swapcache after freeing the swap entry, however, before actually mapping our page. We can change the order now that we use try_to_free_swap(), which doesn't care about the mapcount, instead of reuse_swap_page(). To handle references from the LRU pagevecs, conditionally drain the local LRU pagevecs when required, however, don't consider the page_count() when deciding whether to drain to keep it simple for now. Signed-off-by: David Hildenbrand --- mm/memory.c | 55 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 43 insertions(+), 12 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index ab3153252cfe..ba23d13b8410 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3499,6 +3499,25 @@ static vm_fault_t remove_device_exclusive_entry(stru= ct vm_fault *vmf) return 0; } =20 +static inline bool should_try_to_free_swap(struct page *page, + struct vm_area_struct *vma, + unsigned int fault_flags) +{ + if (!PageSwapCache(page)) + return false; + if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) || + PageMlocked(page)) + return true; + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try freeing the swapcache to get rid of the swapcache + * reference in case it's likely that we will succeed. + */ + return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) && + page_count(page) =3D=3D 2; +} + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -3640,6 +3659,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) page =3D swapcache; goto out_page; } + + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try removing the extra reference from the local LRU + * pagevecs if required. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && page =3D=3D swapcache && + !PageKsm(page) && !PageLRU(page)) + lru_add_drain(); } =20 cgroup_throttle_swaprate(page, GFP_KERNEL); @@ -3658,19 +3687,25 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } =20 /* - * The page isn't present yet, go ahead with the fault. - * - * Be careful about the sequence of operations here. - * To get its accounting right, reuse_swap_page() must be called - * while the page is counted on swap but not yet in mapcount i.e. - * before page_add_anon_rmap() and swap_free(); try_to_free_swap() - * must be called after the swap_free(), or it will never succeed. + * Remove the swap entry and conditionally try to free up the swapcache. + * We're already holding a reference on the page but haven't mapped it + * yet. */ + swap_free(entry); + if (should_try_to_free_swap(page, vma, vmf->flags)) + try_to_free_swap(page); =20 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); pte =3D mk_pte(page, vma->vm_page_prot); - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { + + /* + * Same logic as in do_wp_page(); however, optimize for fresh pages + * that are certainly not shared because we just allocated them without + * exposing them to the swapcache. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) && + (page !=3D swapcache || page_count(page) =3D=3D 1)) { pte =3D maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &=3D ~FAULT_FLAG_WRITE; ret |=3D VM_FAULT_WRITE; @@ -3696,10 +3731,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); =20 - swap_free(entry); - if (mem_cgroup_swap_full(page) || - (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) - try_to_free_swap(page); unlock_page(page); if (page !=3D swapcache && swapcache) { /* --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 610CAC2BA4C for ; Wed, 26 Jan 2022 10:00:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239525AbiAZKAu (ORCPT ); Wed, 26 Jan 2022 05:00:50 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:22092 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233121AbiAZKAt (ORCPT ); Wed, 26 Jan 2022 05:00:49 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191249; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ZADr9hVxfhKENIr/VRsnRFwtDwaPlig2N4uKOY4hLWc=; b=Ao5saezab47NldOsG1OkzQ9dQzNbdlM4Y6LFfWGKE96ZuG39cuOCMS7UOMUrNg1OpKxzaS 4+x+1Spn7ZRGBoj33XRDx3ASRE6K8e8PqZ6hrJUWwrrTxZb5GmSgzFVkeYq+XmVfmzQXdk waoUtxaunqLu1MnYta5oM0hRRlfyjcA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-589-doMA_xzJMmKguE09hrMqCQ-1; Wed, 26 Jan 2022 05:00:45 -0500 X-MC-Unique: doMA_xzJMmKguE09hrMqCQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9B87486A060; Wed, 26 Jan 2022 10:00:42 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5324D1F2E8; Wed, 26 Jan 2022 10:00:01 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() Date: Wed, 26 Jan 2022 10:55:53 +0100 Message-Id: <20220126095557.32392-6-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We currently have a different COW logic for anon THP than we have for ordinary anon pages in do_wp_page(): the effect is that the issue reported in CVE-2020-29374 is currently still possible for anon THP: an unintended information leak from the parent to the child. Let's apply the same logic (page_count() =3D=3D 1), with similar optimizations to remove additional references first as we really want to avoid PTE-mapping the THP and copying individual pages best we can. If we end up with a page that has page_count() !=3D 1, we'll have to PTE-map the THP and fallback to do_wp_page(), which will always copy the page. Note that KSM does not apply to THP. I. Interaction with the swapcache and writeback While a THP is in the swapcache, the swapcache holds one reference on each subpage of the THP. So with PageSwapCache() set, we expect as many additional references as we have subpages. If we manage to remove the THP from the swapcache, all these references will be gone. Usually, a THP is not split when entered into the swapcache and stays a compound page. However, try_to_unmap() will PTE-map the THP and use PTE swap entries. There are no PMD swap entries for that purpose, consequently, we always only swapin subpages into PTEs. Removing a page from the swapcache can fail either when there are remaining swap entries (in which case COW is the right thing to do) or if the page is currently under writeback. Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be possible only in corner cases, for example, if try_to_unmap() failed after adding the page to the swapcache. However, it's comparatively easy to handle. As we have to fully unmap a THP before starting writeback, and swapin is always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in the swapcache that is under writeback. This should at least leave writeback out of the picture. II. Interaction with GUP references Having a R/O PMD-mapped THP with GUP references (i.e., R/O references) will result in PTE-mapping the THP on a write fault. Similar to ordinary anon pages, do_wp_page() will have to copy sub-pages and result in a disconnect between the GUP references and the pages actually mapped into the page tables. To improve the situation in the future, we'll need additional handling to mark anonymous pages as definitely exclusive to a single process, only allow GUP pins on exclusive anon pages, and disallow sharing of exclusive anon pages with GUP pins e.g., during fork(). III. Interaction with references from LRU pagevecs Similar to ordinary anon pages, we can have LRU pagevecs referencing our THP. Reliably removing such references requires draining LRU pagevecs on all CPUs -- lru_add_drain_all() -- a possibly expensive operation that can sleep. For now, similar do do_wp_page(), let's conditionally drain the local LRU pagevecs only if we detect !PageLRU(). IV. Interaction with speculative/temporary references Similar to ordinary anon pages, other speculative/temporary references on the THP, for example, from the pagecache or page migration code, will disallow exclusive reuse of the page. We'll have to PTE-map the THP. Signed-off-by: David Hildenbrand --- mm/huge_memory.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..b6ba88a98266 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1286,6 +1286,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) struct page *page; unsigned long haddr =3D vmf->address & HPAGE_PMD_MASK; pmd_t orig_pmd =3D vmf->orig_pmd; + int swapcache_refs =3D 0; =20 vmf->ptl =3D pmd_lockptr(vma->vm_mm, vmf->pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); @@ -1303,7 +1304,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) page =3D pmd_page(orig_pmd); VM_BUG_ON_PAGE(!PageHead(page), page); =20 - /* Lock page for reuse_swap_page() */ if (!trylock_page(page)) { get_page(page); spin_unlock(vmf->ptl); @@ -1319,10 +1319,20 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) } =20 /* - * We can only reuse the page if nobody else maps the huge page or it's - * part. + * See do_wp_page(): we can only map the page writable if there are + * no additional references. */ - if (reuse_swap_page(page)) { + if (PageSwapCache(page)) + swapcache_refs =3D thp_nr_pages(page); + if (page_count(page) > 1 + swapcache_refs + !PageLRU(page)) + goto unlock_fallback; + if (!PageLRU(page)) + lru_add_drain(); + if (page_count(page) > 1 + swapcache_refs) + goto unlock_fallback; + if (swapcache_refs) + try_to_free_swap(page); + if (page_count(page) =3D=3D 1) { pmd_t entry; entry =3D pmd_mkyoung(orig_pmd); entry =3D maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -1333,6 +1343,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) return VM_FAULT_WRITE; } =20 +unlock_fallback: unlock_page(page); spin_unlock(vmf->ptl); fallback: --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4480FC63682 for ; Wed, 26 Jan 2022 10:00:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239541AbiAZKA6 (ORCPT ); Wed, 26 Jan 2022 05:00:58 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:20522 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233121AbiAZKA4 (ORCPT ); Wed, 26 Jan 2022 05:00:56 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191256; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uz7WrEefhkRzVFGmcsfP6Zp/Ap9shXYEfGhZft5IAs0=; b=YqvlLzUfQHsqzFehari2wAbKLXh9PLhDQPwTNpJ6ZQTdMhYcVPBUiAy4ek+qYzE33mdBZJ PjjeXqtAhUjcp9Lsb0TT57ICNWxf7O0dg1wQhlERCx6LtZKzNvrFY4daqSVKk2hnhw38Wm DtaEoJf1OoCyuTvBKkvSP8WC7K9WHTU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-455-Z5P5QTJXMmC-UmZSLEVHQg-1; Wed, 26 Jan 2022 05:00:52 -0500 X-MC-Unique: Z5P5QTJXMmC-UmZSLEVHQg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 914831006AA6; Wed, 26 Jan 2022 10:00:49 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 199521F2F3; Wed, 26 Jan 2022 10:00:42 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage Date: Wed, 26 Jan 2022 10:55:54 +0100 Message-Id: <20220126095557.32392-7-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" reuse_swap_page() currently indicates if we can write to an anon page without COW. A COW is required if the page is shared by multiple processes (either already mapped or via swap entries) or if there is concurrent writeback that cannot tolerate concurrent page modifications. reuse_swap_page() doesn't check for pending references from other processes that already unmapped the page, however, is_refcount_suitable() essentially does the same thing in the context of khugepaged. khugepaged is the last remaining user of reuse_swap_page() and we want to remove that function. In the context of khugepaged, we are not actually going to write to the page and we don't really care about other processes mapping the page: for example, without swap, we don't care about shared pages at all. The current logic seems to be: * Writable: -> Not shared, but might be in the swapcache. Nobody can fault it in from the swapcache as there are no other swap entries. * Readable and not in the swapcache: Might be shared (but nobody can fault it in from the swapcache). * Readable and in the swapcache: Might be shared and someone might be able to fault it in from the swapcache. Make sure we're the exclusive owner via reuse_swap_page(). Having to guess due to lack of comments and documentation, the current logic really only wants to make sure that a page that might be shared cannot be faulted in from the swapcache while khugepaged is active. It's hard to guess why that is that case and if it's really still required, but let's try keeping that logic unmodified. Instead of relying on reuse_swap_page(), let's unconditionally try_to_free_swap(), special casing PageKsm(). try_to_free_swap() will fail if there are still swap entries targeting the page or if the page is under writeback. After a successful try_to_free_swap() that page cannot be readded to the swapcache because we're keeping the page locked and removed from the LRU until we actually perform the copy. So once we succeeded removing a page from the swapcache, it cannot be re-added until we're done copying. Add a comment stating that. Signed-off-by: David Hildenbrand --- mm/khugepaged.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 35f14d0a00a6..bc0ff598e98f 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -683,10 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_are= a_struct *vma, goto out; } if (!pte_write(pteval) && PageSwapCache(page) && - !reuse_swap_page(page)) { + (PageKsm(page) || !try_to_free_swap(page))) { /* - * Page is in the swap cache and cannot be re-used. - * It cannot be collapsed into a THP. + * Possibly shared page cannot be removed from the + * swapache. It cannot be collapsed into a THP. */ unlock_page(page); result =3D SCAN_SWAP_CACHE_PAGE; @@ -702,6 +702,16 @@ static int __collapse_huge_page_isolate(struct vm_area= _struct *vma, result =3D SCAN_DEL_PAGE_LRU; goto out; } + + /* + * We're holding the page lock and removed the page from the + * LRU. Once done copying, we'll unlock and readd to the + * LRU via release_pte_page(). If the page is still in the + * swapcache, we're the exclusive owner. Due to the page lock + * the page cannot be added to the swapcache until we're done + * and consequently it cannot be faulted in from the swapcache + * into another process. + */ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A21F5C28CF5 for ; Wed, 26 Jan 2022 10:01:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239537AbiAZKBN (ORCPT ); Wed, 26 Jan 2022 05:01:13 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:20761 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233121AbiAZKBL (ORCPT ); Wed, 26 Jan 2022 05:01:11 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191271; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KaiTUjRYTOXVgHeAUn6VE+wj9NpgU3CxyxXikV/W8rE=; b=UnSn/G1JB5WpDhxS1MoUOlOGP2wY/C+JGqT4xAvXs0yO8j72cejYU+mJjYLPi7Ffdd36fQ VkwWgBKf4tCdbWxCRavTK0fVbv0cXFMxBP2hy7suGs/X64zRRyEkpKSw5K7M9CJui6153j EcfeqSpjCR5vuG9FH9Y8ZofKZ8cF9hU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-214-uqqGwpPKPKy7Aa-6tyVRdw-1; Wed, 26 Jan 2022 05:00:59 -0500 X-MC-Unique: uqqGwpPKPKy7Aa-6tyVRdw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id AB08D51090; Wed, 26 Jan 2022 10:00:56 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2B4391F2F8; Wed, 26 Jan 2022 10:00:49 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 7/9] mm/swapfile: remove reuse_swap_page() Date: Wed, 26 Jan 2022 10:55:55 +0100 Message-Id: <20220126095557.32392-8-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" All users are gone, let's remove it. We'll let SWP_STABLE_WRITES stick around for now, as it might come in handy in the near future. Signed-off-by: David Hildenbrand --- include/linux/swap.h | 4 -- mm/swapfile.c | 104 ------------------------------------------- 2 files changed, 108 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1d38d9475c4d..b546e4bd5c5a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -514,7 +514,6 @@ extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); -extern bool reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pag= es); @@ -680,9 +679,6 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } =20 -#define reuse_swap_page(page) \ - (page_trans_huge_mapcount(page) =3D=3D 1) - static inline int try_to_free_swap(struct page *page) { return 0; diff --git a/mm/swapfile.c b/mm/swapfile.c index bf0df7aa7158..a5183315dc58 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1167,16 +1167,6 @@ static struct swap_info_struct *_swap_info_get(swp_e= ntry_t entry) return NULL; } =20 -static struct swap_info_struct *swap_info_get(swp_entry_t entry) -{ - struct swap_info_struct *p; - - p =3D _swap_info_get(entry); - if (p) - spin_lock(&p->lock); - return p; -} - static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry, struct swap_info_struct *q) { @@ -1601,100 +1591,6 @@ static bool page_swapped(struct page *page) return false; } =20 -static int page_trans_huge_map_swapcount(struct page *page, - int *total_swapcount) -{ - int i, map_swapcount, _total_swapcount; - unsigned long offset =3D 0; - struct swap_info_struct *si; - struct swap_cluster_info *ci =3D NULL; - unsigned char *map =3D NULL; - int swapcount =3D 0; - - /* hugetlbfs shouldn't call it */ - VM_BUG_ON_PAGE(PageHuge(page), page); - - if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) { - if (PageSwapCache(page)) - swapcount =3D page_swapcount(page); - if (total_swapcount) - *total_swapcount =3D swapcount; - return swapcount + page_trans_huge_mapcount(page); - } - - page =3D compound_head(page); - - _total_swapcount =3D map_swapcount =3D 0; - if (PageSwapCache(page)) { - swp_entry_t entry; - - entry.val =3D page_private(page); - si =3D _swap_info_get(entry); - if (si) { - map =3D si->swap_map; - offset =3D swp_offset(entry); - } - } - if (map) - ci =3D lock_cluster(si, offset); - for (i =3D 0; i < HPAGE_PMD_NR; i++) { - int mapcount =3D atomic_read(&page[i]._mapcount) + 1; - if (map) { - swapcount =3D swap_count(map[offset + i]); - _total_swapcount +=3D swapcount; - } - map_swapcount =3D max(map_swapcount, mapcount + swapcount); - } - unlock_cluster(ci); - - if (PageDoubleMap(page)) - map_swapcount -=3D 1; - - if (total_swapcount) - *total_swapcount =3D _total_swapcount; - - return map_swapcount + compound_mapcount(page); -} - -/* - * We can write to an anon page without COW if there are no other referenc= es - * to it. And as a side-effect, free up its swap: because the old content - * on disk will never be read, and seeking back there to write new content - * later would only waste time away from clustering. - */ -bool reuse_swap_page(struct page *page) -{ - int count, total_swapcount; - - VM_BUG_ON_PAGE(!PageLocked(page), page); - if (unlikely(PageKsm(page))) - return false; - count =3D page_trans_huge_map_swapcount(page, &total_swapcount); - if (count =3D=3D 1 && PageSwapCache(page) && - (likely(!PageTransCompound(page)) || - /* The remaining swap count will be freed soon */ - total_swapcount =3D=3D page_swapcount(page))) { - if (!PageWriteback(page)) { - page =3D compound_head(page); - delete_from_swap_cache(page); - SetPageDirty(page); - } else { - swp_entry_t entry; - struct swap_info_struct *p; - - entry.val =3D page_private(page); - p =3D swap_info_get(entry); - if (p->flags & SWP_STABLE_WRITES) { - spin_unlock(&p->lock); - return false; - } - spin_unlock(&p->lock); - } - } - - return count <=3D 1; -} - /* * If swap is getting full, or if there are no more mappings of this page, * then try_to_free_swap is called to free its swap space. --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95311C28CF5 for ; Wed, 26 Jan 2022 10:01:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239546AbiAZKBR (ORCPT ); Wed, 26 Jan 2022 05:01:17 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:33748 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233121AbiAZKBO (ORCPT ); Wed, 26 Jan 2022 05:01:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191273; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TtFF3IJUOAziUOkk7Dy8cWYxEVf/NaN8ObVsm7wL2Kc=; b=Wu3+V49BpiFo2g1/HJ2b9jY3yVuLp46afO2I6qb0jkvNXilQMBaKbbsSG5j4bNv0uJ4Djs aPFyO+rUBhR0G60JYedE5o4IbOrIhbit4xs7xaY+2eBApZfuZz6KRX3IhPqC4q7kPPi+Yv V+K/CC0sI8Kj0bY3sYlashv7Rvk1K8A= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-452-DT7c-FHEOW-IwDRh5--odA-1; Wed, 26 Jan 2022 05:01:07 -0500 X-MC-Unique: DT7c-FHEOW-IwDRh5--odA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1F173190B2A0; Wed, 26 Jan 2022 10:01:04 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 00BCCE2CF; Wed, 26 Jan 2022 10:00:56 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 8/9] mm/huge_memory: remove stale page_trans_huge_mapcount() Date: Wed, 26 Jan 2022 10:55:56 +0100 Message-Id: <20220126095557.32392-9-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" All users are gone, let's remove it. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 5 ----- mm/huge_memory.c | 48 ---------------------------------------------- 2 files changed, 53 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e1a84b1e6787..df6b34d66a9b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -820,16 +820,11 @@ static inline int page_mapcount(struct page *page) =20 #ifdef CONFIG_TRANSPARENT_HUGEPAGE int total_mapcount(struct page *page); -int page_trans_huge_mapcount(struct page *page); #else static inline int total_mapcount(struct page *page) { return page_mapcount(page); } -static inline int page_trans_huge_mapcount(struct page *page) -{ - return page_mapcount(page); -} #endif =20 static inline struct page *virt_to_head_page(const void *x) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b6ba88a98266..863c933b3b1e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2529,54 +2529,6 @@ int total_mapcount(struct page *page) return ret; } =20 -/* - * This calculates accurately how many mappings a transparent hugepage - * has (unlike page_mapcount() which isn't fully accurate). This full - * accuracy is primarily needed to know if copy-on-write faults can - * reuse the page and change the mapping to read-write instead of - * copying them. At the same time this returns the total_mapcount too. - * - * The function returns the highest mapcount any one of the subpages - * has. If the return value is one, even if different processes are - * mapping different subpages of the transparent hugepage, they can - * all reuse it, because each process is reusing a different subpage. - * - * The total_mapcount is instead counting all virtual mappings of the - * subpages. If the total_mapcount is equal to "one", it tells the - * caller all mappings belong to the same "mm" and in turn the - * anon_vma of the transparent hugepage can become the vma->anon_vma - * local one as no other process may be mapping any of the subpages. - * - * It would be more accurate to replace page_mapcount() with - * page_trans_huge_mapcount(), however we only use - * page_trans_huge_mapcount() in the copy-on-write faults where we - * need full accuracy to avoid breaking page pinning, because - * page_trans_huge_mapcount() is slower than page_mapcount(). - */ -int page_trans_huge_mapcount(struct page *page) -{ - int i, ret; - - /* hugetlbfs shouldn't call it */ - VM_BUG_ON_PAGE(PageHuge(page), page); - - if (likely(!PageTransCompound(page))) - return atomic_read(&page->_mapcount) + 1; - - page =3D compound_head(page); - - ret =3D 0; - for (i =3D 0; i < thp_nr_pages(page); i++) { - int mapcount =3D atomic_read(&page[i]._mapcount) + 1; - ret =3D max(ret, mapcount); - } - - if (PageDoubleMap(page)) - ret -=3D 1; - - return ret + compound_mapcount(page); -} - /* Racy check whether the huge page can be split */ bool can_split_huge_page(struct page *page, int *pextra_pins) { --=20 2.34.1 From nobody Tue Jun 30 03:36:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0E5DC28CF5 for ; Wed, 26 Jan 2022 10:01:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239557AbiAZKBV (ORCPT ); Wed, 26 Jan 2022 05:01:21 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:38391 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239545AbiAZKBR (ORCPT ); Wed, 26 Jan 2022 05:01:17 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643191276; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WDmpWapjP9Wkuqng2NIRJk/6UxFibcIxNF2eScAVq98=; b=MBHl5fsjl8hdT8EsqBKz9X0XDBVir4mTVYQbpAXp5u931yyyOVZQSVvgpqkWMC/DgZBwIh yvHpYbzBHHjFewFj62OqPyMzetTMNyy7XaM5fjz7VA1osg1tzlP+vG6W6XzaF9uoYKCtTo btTZ9mQUJq7UTtbnsqx+oChGs/aPS7g= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-458-Vz9PWdCbMzKfu4dycLa_hQ-1; Wed, 26 Jan 2022 05:01:13 -0500 X-MC-Unique: Vz9PWdCbMzKfu4dycLa_hQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A63A9190B2AE; Wed, 26 Jan 2022 10:01:10 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7BF5D1F2F3; Wed, 26 Jan 2022 10:01:04 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH RFC v2 9/9] mm/huge_memory: remove stale locking logic from __split_huge_pmd() Date: Wed, 26 Jan 2022 10:55:57 +0100 Message-Id: <20220126095557.32392-10-david@redhat.com> In-Reply-To: <20220126095557.32392-1-david@redhat.com> References: <20220126095557.32392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Let's remove the stale logic that was required for reuse_swap_page(). Signed-off-by: David Hildenbrand --- mm/huge_memory.c | 32 +------------------------------- 1 file changed, 1 insertion(+), 31 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 863c933b3b1e..5cc438f92548 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2158,8 +2158,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd= _t *pmd, { spinlock_t *ptl; struct mmu_notifier_range range; - bool do_unlock_page =3D false; - pmd_t _pmd; =20 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, address & HPAGE_PMD_MASK, @@ -2178,35 +2176,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pm= d_t *pmd, goto out; } =20 -repeat: if (pmd_trans_huge(*pmd)) { - if (!page) { + if (!page) page =3D pmd_page(*pmd); - /* - * An anonymous page must be locked, to ensure that a - * concurrent reuse_swap_page() sees stable mapcount; - * but reuse_swap_page() is not used on shmem or file, - * and page lock must not be taken when zap_pmd_range() - * calls __split_huge_pmd() while i_mmap_lock is held. - */ - if (PageAnon(page)) { - if (unlikely(!trylock_page(page))) { - get_page(page); - _pmd =3D *pmd; - spin_unlock(ptl); - lock_page(page); - spin_lock(ptl); - if (unlikely(!pmd_same(*pmd, _pmd))) { - unlock_page(page); - put_page(page); - page =3D NULL; - goto repeat; - } - put_page(page); - } - do_unlock_page =3D true; - } - } if (PageMlocked(page)) clear_page_mlock(page); } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) @@ -2214,8 +2186,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd= _t *pmd, __split_huge_pmd_locked(vma, pmd, range.start, freeze); out: spin_unlock(ptl); - if (do_unlock_page) - unlock_page(page); /* * No need to double call mmu_notifier->invalidate_range() callback. * They are 3 cases to consider inside __split_huge_pmd_locked(): --=20 2.34.1