From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8DBBCC433F5 for ; Fri, 29 Apr 2022 13:36:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350450AbiD2Njs (ORCPT ); Fri, 29 Apr 2022 09:39:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348019AbiD2Njk (ORCPT ); Fri, 29 Apr 2022 09:39:40 -0400 Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2B026CABB4 for ; Fri, 29 Apr 2022 06:36:22 -0700 (PDT) Received: by mail-pj1-x1031.google.com with SMTP id l11-20020a17090a49cb00b001d923a9ca99so7295291pjm.1 for ; Fri, 29 Apr 2022 06:36:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=L6JfYichFyXFsf3r/4znwUIKw3e8Xt3i1DYSf+5mskE=; b=uiu5RLuEB7qSCoidGIQkJZQf2tXs/EuKorStHUYGrE3Q5/xwMjpvGCmRYxjgJA6Ctu 75nC7BuohPGVHhHiVloQeiByLs8cb+N0Ed8rJ3utK+A6gQToV+GujF6l5qKhJ6c7EtGn NGGslCeqHkUhC439mUKObKU0nONs9Y2XDO0mRRIdYkuP+YXoGmTiT18RjaEEpxX9bPsh AAQJ8XZb1PV/10ty6IAIC9mYDPdZ84ojotFDfykLySr4Rl5A0JazS0rcCBaq64jgtFXH n+rOZQhRPeLRzNBkYnF0c08yntciuvNCXpZK09/v/tg0M7w22eVyLTCmchLVjWICGks9 yfeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=L6JfYichFyXFsf3r/4znwUIKw3e8Xt3i1DYSf+5mskE=; b=UZZS5Y2hy9TOYaNcipVDsaEbj/RpL8Mh8XYxB0FkO1S5ALBFP3Ftd2br05Akbu7tfO bR+JElVenKaUwoVwu85feWhjxfof8+KuhfCDnYNcGBtzvYOq5nSo+Rz1kNEXmwXA8Inr De1o6PiSBFXi3IZp86uDScDV0Q/e/SrqvOy9yT4CpwI7PrOMa8tPMVwBV198Vk+LKMsb /GQFcWqRjFJliquxHqv42P8lM9oiIMeEW5ChEhoql2w7EAzW8BdjXulHTMzO9ZLHE3k0 J560C2dHs8M39Hq4A69tzMUs/zPL8PMq+2Hnhg3c4T7jM3iOOQ/6dgzShmOYEFqB/alS THPA== X-Gm-Message-State: AOAM531TwC8uz0H04N8uEXfo39OU8gvEcUaaEPa0FRrX4E8FG2Qo/YeJ M+/zorCfn1Do3tJX20ESNQKreA== X-Google-Smtp-Source: ABdhPJx+W1PxtRy+jGpbbBvOatDdA/DNvTfTbEwQM2M30hTFb7Isln+pU5bpMFj6BaOrsgTZDYJFuQ== X-Received: by 2002:a17:902:ed89:b0:15a:d3e:ada6 with SMTP id e9-20020a170902ed8900b0015a0d3eada6mr39276616plj.94.1651239381708; Fri, 29 Apr 2022 06:36:21 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:21 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Date: Fri, 29 Apr 2022 21:35:35 +0800 Message-Id: <20220429133552.33768-2-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The paired pte_unmap() call is missing before the sme_populate_pgd() returns. Although this code only runs under the CONFIG_X86_64, for the correctness of the code semantics, it is necessary to add a paired pte_unmap() call. Signed-off-by: Qi Zheng --- arch/x86/mm/mem_encrypt_identity.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_i= dentity.c index b43bc24d2bb6..6d323230320a 100644 --- a/arch/x86/mm/mem_encrypt_identity.c +++ b/arch/x86/mm/mem_encrypt_identity.c @@ -190,6 +190,7 @@ static void __init sme_populate_pgd(struct sme_populate= _pgd_data *ppd) pte =3D pte_offset_map(pmd, ppd->vaddr); if (pte_none(*pte)) set_pte(pte, __pte(ppd->paddr | ppd->pte_flags)); + pte_unmap(pte); } =20 static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd) --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E738C433FE for ; Fri, 29 Apr 2022 13:36:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1352199AbiD2Njw (ORCPT ); Fri, 29 Apr 2022 09:39:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51212 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351569AbiD2Njq (ORCPT ); Fri, 29 Apr 2022 09:39:46 -0400 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 242D3CB002 for ; Fri, 29 Apr 2022 06:36:27 -0700 (PDT) Received: by mail-pj1-x1032.google.com with SMTP id m14-20020a17090a34ce00b001d5fe250e23so7286133pjf.3 for ; Fri, 29 Apr 2022 06:36:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=TMZmZ4AGQljuOcKnvNnlPkQjL1BDLadEZvUt3w1Anw8=; b=iNbXqvglmSrqfFNb3nFzL0dKEc6ia3WcRkU7xNMP/tHBQSotTfWchC6ouLupv7gKG6 mjLWcX7CspmW80y0Qeb/TyGyZZi5yxh/EQGlKJWLIMropwa6Iv2OOK9jr2sYgcXxprZ5 UEr8ImcwooRjwwOIzUNVOu/4dBlKsJNZb0ph4n9C1AkUYJhYk1T7s4n+e/UvC/IBJXiH lelZrCLgW31cqCdXshBxLBAl5D20bnxmQwk5IutcdRzkuh3QaA7GJQaxil8805VZ+bKL LtZQu7le0S6cehGWwU6lFsz62Ehg8Ne/le2UyAqsY2z4SBsgcZzowDkcSOqKwQ72hUWd Ry/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=TMZmZ4AGQljuOcKnvNnlPkQjL1BDLadEZvUt3w1Anw8=; b=B+Dr6/yH3o3nA6kEEDslVGqpaxh81HBf8Ri4XePsFPLRSm4I43IOxAzHbyQKnMGpkZ ySOtin+c3/HmdfgcGICpdxQbRBmB4NYnsKxwIY35mALr9e8hagfAstEH9aR8JZlvzGgH XT2FNup5sdBzTWkQrK2cjmPb0UPXjntD4Q75J83eIV7g2EUFChkJw6QyYZ1FPNAT7A+9 mJZHqpdkIkn4G0sjeNPrKiZGkqhfBN+dQqv0aM7yDbwLSVBAWhd61tEishMaouXAHuhF pMc+PBn4ur6ED3w8haoxwUT12JO4Ml4+Md8hBeLi6T4Rqh2IWVDJg54fnVzDmtj5GM1H mmDQ== X-Gm-Message-State: AOAM530Oa/uhTb0NJtcutXUH2SCSM44GvjSQ2blIEIuvpDEGDui4eT2t BttkiqeBS4qqfOxSAL5GOsr2Wg== X-Google-Smtp-Source: ABdhPJxpxUNyJ4Q8PzIgoCFgG7OivbK5xNlpIRA+PWmaJAKluGlH8xiPo6kD1HPbaMTFamIM9Vs/pw== X-Received: by 2002:a17:90b:3a89:b0:1d9:b448:a932 with SMTP id om9-20020a17090b3a8900b001d9b448a932mr3941757pjb.173.1651239387248; Fri, 29 Apr 2022 06:36:27 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:26 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Date: Fri, 29 Apr 2022 21:35:36 +0800 Message-Id: <20220429133552.33768-3-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In the percpu_ref_call_confirm_rcu(), we call the wake_up_all() before calling percpu_ref_put(), which will cause the value of percpu_ref to be unstable when percpu_ref_switch_to_atomic_sync() returns. CPU0 CPU1 percpu_ref_switch_to_atomic_sync(&ref) --> percpu_ref_switch_to_atomic(&ref) --> percpu_ref_get(ref); /* put after confirmation */ call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu); percpu_ref_switch_to_atomic_rcu --> percpu_ref_call_confirm_rcu --> data->confirm_switch =3D NULL; wake_up_all(&percpu_ref_switch_waitq); /* here waiting to wake up */ wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch); (A)percpu_ref_put(ref); /* The value of &ref is unstable! */ percpu_ref_is_zero(&ref) (B)percpu_ref_put(ref); As shown above, assuming that the counts on each cpu add up to 0 before calling percpu_ref_switch_to_atomic_sync(), we expect that after switching to atomic mode, percpu_ref_is_zero() can return true. But actually it will return different values in the two cases of A and B, which is not what we expected. Now there are two users of percpu_ref_switch_to_atomic_sync() in the kernel: i. mddev->writes_pending in the driver/md/md.c ii. q->q_usage_counter in the block/blk-pm.c And they are all used as shown above. In the worst case, percpu_ref_is_zero= () may not hold because of the case B every time. While this is unlikely to oc= cur in a production environment, it is a problem. This patch moves percpu_ref_put() out of the rcu handler and call it after wait_event(), which can makes ref stable after percpu_ref_switch_to_atomic_= sync() returns. Then in the example above, percpu_ref_is_zero() can see a steady 0= value, which is what we would expect. Signed-off-by: Qi Zheng --- include/linux/percpu-refcount.h | 4 ++- lib/percpu-refcount.c | 56 +++++++++++++++++++++++---------- 2 files changed, 43 insertions(+), 17 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcoun= t.h index d73a1c08c3e3..75844939a965 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -98,6 +98,7 @@ struct percpu_ref_data { percpu_ref_func_t *confirm_switch; bool force_atomic:1; bool allow_reinit:1; + bool sync:1; struct rcu_head rcu; struct percpu_ref *ref; }; @@ -123,7 +124,8 @@ int __must_check percpu_ref_init(struct percpu_ref *ref, gfp_t gfp); void percpu_ref_exit(struct percpu_ref *ref); void percpu_ref_switch_to_atomic(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch); + percpu_ref_func_t *confirm_switch, + bool sync); void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref); void percpu_ref_switch_to_percpu(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index af9302141bcf..3a8906715e09 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -99,6 +99,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_fu= nc_t *release, data->release =3D release; data->confirm_switch =3D NULL; data->ref =3D ref; + data->sync =3D false; ref->data =3D data; return 0; } @@ -146,21 +147,33 @@ void percpu_ref_exit(struct percpu_ref *ref) } EXPORT_SYMBOL_GPL(percpu_ref_exit); =20 +static inline void percpu_ref_switch_to_atomic_post(struct percpu_ref *ref) +{ + struct percpu_ref_data *data =3D ref->data; + + if (!data->allow_reinit) + __percpu_ref_exit(ref); + + /* drop ref from percpu_ref_switch_to_atomic() */ + percpu_ref_put(ref); +} + static void percpu_ref_call_confirm_rcu(struct rcu_head *rcu) { struct percpu_ref_data *data =3D container_of(rcu, struct percpu_ref_data, rcu); struct percpu_ref *ref =3D data->ref; + bool need_put =3D true; + + if (data->sync) + need_put =3D data->sync =3D false; =20 data->confirm_switch(ref); data->confirm_switch =3D NULL; wake_up_all(&percpu_ref_switch_waitq); =20 - if (!data->allow_reinit) - __percpu_ref_exit(ref); - - /* drop ref from percpu_ref_switch_to_atomic() */ - percpu_ref_put(ref); + if (need_put) + percpu_ref_switch_to_atomic_post(ref); } =20 static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) @@ -210,14 +223,19 @@ static void percpu_ref_noop_confirm_switch(struct per= cpu_ref *ref) } =20 static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch) + percpu_ref_func_t *confirm_switch, + bool sync) { if (ref->percpu_count_ptr & __PERCPU_REF_ATOMIC) { if (confirm_switch) confirm_switch(ref); + if (sync) + percpu_ref_get(ref); return; } =20 + ref->data->sync =3D sync; + /* switching from percpu to atomic */ ref->percpu_count_ptr |=3D __PERCPU_REF_ATOMIC; =20 @@ -232,13 +250,16 @@ static void __percpu_ref_switch_to_atomic(struct perc= pu_ref *ref, call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu); } =20 -static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) +static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref, bool syn= c) { unsigned long __percpu *percpu_count =3D percpu_count_ptr(ref); int cpu; =20 BUG_ON(!percpu_count); =20 + if (sync) + percpu_ref_get(ref); + if (!(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC)) return; =20 @@ -261,7 +282,8 @@ static void __percpu_ref_switch_to_percpu(struct percpu= _ref *ref) } =20 static void __percpu_ref_switch_mode(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch) + percpu_ref_func_t *confirm_switch, + bool sync) { struct percpu_ref_data *data =3D ref->data; =20 @@ -276,9 +298,9 @@ static void __percpu_ref_switch_mode(struct percpu_ref = *ref, percpu_ref_switch_lock); =20 if (data->force_atomic || percpu_ref_is_dying(ref)) - __percpu_ref_switch_to_atomic(ref, confirm_switch); + __percpu_ref_switch_to_atomic(ref, confirm_switch, sync); else - __percpu_ref_switch_to_percpu(ref); + __percpu_ref_switch_to_percpu(ref, sync); } =20 /** @@ -302,14 +324,15 @@ static void __percpu_ref_switch_mode(struct percpu_re= f *ref, * switching to atomic mode, this function can be called from any context. */ void percpu_ref_switch_to_atomic(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch) + percpu_ref_func_t *confirm_switch, + bool sync) { unsigned long flags; =20 spin_lock_irqsave(&percpu_ref_switch_lock, flags); =20 ref->data->force_atomic =3D true; - __percpu_ref_switch_mode(ref, confirm_switch); + __percpu_ref_switch_mode(ref, confirm_switch, sync); =20 spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } @@ -325,8 +348,9 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic); */ void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref) { - percpu_ref_switch_to_atomic(ref, NULL); + percpu_ref_switch_to_atomic(ref, NULL, true); wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch); + percpu_ref_switch_to_atomic_post(ref); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync); =20 @@ -355,7 +379,7 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref) spin_lock_irqsave(&percpu_ref_switch_lock, flags); =20 ref->data->force_atomic =3D false; - __percpu_ref_switch_mode(ref, NULL); + __percpu_ref_switch_mode(ref, NULL, false); =20 spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } @@ -390,7 +414,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, ref->data->release); =20 ref->percpu_count_ptr |=3D __PERCPU_REF_DEAD; - __percpu_ref_switch_mode(ref, confirm_kill); + __percpu_ref_switch_mode(ref, confirm_kill, false); percpu_ref_put(ref); =20 spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); @@ -470,7 +494,7 @@ void percpu_ref_resurrect(struct percpu_ref *ref) =20 ref->percpu_count_ptr &=3D ~__PERCPU_REF_DEAD; percpu_ref_get(ref); - __percpu_ref_switch_mode(ref, NULL); + __percpu_ref_switch_mode(ref, NULL, false); =20 spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3333C433EF for ; Fri, 29 Apr 2022 13:36:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348155AbiD2NkC (ORCPT ); Fri, 29 Apr 2022 09:40:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359799AbiD2Njx (ORCPT ); Fri, 29 Apr 2022 09:39:53 -0400 Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70613CABB4 for ; Fri, 29 Apr 2022 06:36:33 -0700 (PDT) Received: by mail-pg1-x536.google.com with SMTP id q12so6523798pgj.13 for ; Fri, 29 Apr 2022 06:36:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ieDn4aiMaGLBWCIS3wtwm2RaJ5ulm380NWzDeauNbb4=; b=EBM/v5v9b/++wKJ7UyNtJG3hOZtf0D0P421yMRX4mbDxLWaLjOnVd44J8Kojpk1/Jx wCFrmBGQ/ceCjK2D4mHXGw0dSrIqtbGqjjBrZNAFATQx+fQ3H5noDNFQwBzJ7OOgtWB+ JVr1SHNAxpY01lGGs23Wp0T2G74lyX6XkV1W6acl+SBAclRUO5GvysEsfCuAuaRStabd GfI7JWXoZkXpXKsQkXGJI8UFEcFsomE5V4M3jZm36E83p3L+E9GJIPbxQ/45XJ5HkdU0 inXCNH8Zsk7/VEzeqUaBbswbkgOHbaJakh63ipxlq9pP6hJllz+gzTUgl2B+oN4/WO8g 3ANA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ieDn4aiMaGLBWCIS3wtwm2RaJ5ulm380NWzDeauNbb4=; b=4BKTmDZUZiUvuIid52qyWJtlKhkt4AJTwaUtA1ekOi31R4CurQmT1B04U18vuhAeqx 1lSRtrHVzdAPIEAXzeQvNHHWL7udjhgNhSoJXJhV+V0DFkRPQBJG6YQ+moiDLmjg+kNR sI8UYToNJct6u0exy3cGHdrBXqwjfCThmUYIfC+WWAoE2P8ZNaqu+YJpOwMIjqNYMf9z ptno+F7GVFvbmMuBFMedPBk7jhVholvY6Sf8PpPXdU/x9yM2dLlroRGaHJNONr3aIEES ywdroZyDsUuhjuPLyxmty4q5kFIiPi5k0wAI2dpf/onXRnWaI9COFB+zlzvZayVxK7dC uxsw== X-Gm-Message-State: AOAM530eK5JVa3829p6GtpBR99Ukx1uJSvBSSmdgCme0zGgchbegk6N6 uMnmVVCVMsIGlhp1JsIQnq//rQ== X-Google-Smtp-Source: ABdhPJxItc3PJijXr/92XujdPMZZoCsMTMRrNeqFfimvMzqakWNCQB/zCudyZqo7pQwxbtrJvVy/Xg== X-Received: by 2002:a65:6a56:0:b0:3aa:49b8:ee77 with SMTP id o22-20020a656a56000000b003aa49b8ee77mr32125077pgu.19.1651239392962; Fri, 29 Apr 2022 06:36:32 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:32 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Date: Fri, 29 Apr 2022 21:35:37 +0800 Message-Id: <20220429133552.33768-4-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Currently, percpu_ref uses the global percpu_ref_switch_lock to protect the mode switching operation. When multiple percpu_ref perform mode switching at the same time, the lock may become a performance bottleneck. This patch introduces per percpu_ref percpu_ref_switch_lock to fixes this situation. Signed-off-by: Qi Zheng --- include/linux/percpu-refcount.h | 2 ++ lib/percpu-refcount.c | 30 +++++++++++++++--------------- 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcoun= t.h index 75844939a965..eb8695e578fd 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -110,6 +110,8 @@ struct percpu_ref { */ unsigned long percpu_count_ptr; =20 + spinlock_t percpu_ref_switch_lock; + /* * 'percpu_ref' is often embedded into user structure, and only * 'percpu_count_ptr' is required in fast path, move other fields diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 3a8906715e09..4336fd1bd77a 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -36,7 +36,6 @@ =20 #define PERCPU_COUNT_BIAS (1LU << (BITS_PER_LONG - 1)) =20 -static DEFINE_SPINLOCK(percpu_ref_switch_lock); static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq); =20 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) @@ -95,6 +94,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_fu= nc_t *release, start_count++; =20 atomic_long_set(&data->count, start_count); + spin_lock_init(&ref->percpu_ref_switch_lock); =20 data->release =3D release; data->confirm_switch =3D NULL; @@ -137,11 +137,11 @@ void percpu_ref_exit(struct percpu_ref *ref) if (!data) return; =20 - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); ref->percpu_count_ptr |=3D atomic_long_read(&ref->data->count) << __PERCPU_REF_FLAG_BITS; ref->data =3D NULL; - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); =20 kfree(data); } @@ -287,7 +287,7 @@ static void __percpu_ref_switch_mode(struct percpu_ref = *ref, { struct percpu_ref_data *data =3D ref->data; =20 - lockdep_assert_held(&percpu_ref_switch_lock); + lockdep_assert_held(&ref->percpu_ref_switch_lock); =20 /* * If the previous ATOMIC switching hasn't finished yet, wait for @@ -295,7 +295,7 @@ static void __percpu_ref_switch_mode(struct percpu_ref = *ref, * isn't in progress, this function can be called from any context. */ wait_event_lock_irq(percpu_ref_switch_waitq, !data->confirm_switch, - percpu_ref_switch_lock); + ref->percpu_ref_switch_lock); =20 if (data->force_atomic || percpu_ref_is_dying(ref)) __percpu_ref_switch_to_atomic(ref, confirm_switch, sync); @@ -329,12 +329,12 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *r= ef, { unsigned long flags; =20 - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); =20 ref->data->force_atomic =3D true; __percpu_ref_switch_mode(ref, confirm_switch, sync); =20 - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic); =20 @@ -376,12 +376,12 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *r= ef) { unsigned long flags; =20 - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); =20 ref->data->force_atomic =3D false; __percpu_ref_switch_mode(ref, NULL, false); =20 - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu); =20 @@ -407,7 +407,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, { unsigned long flags; =20 - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); =20 WARN_ONCE(percpu_ref_is_dying(ref), "%s called more than once on %ps!", __func__, @@ -417,7 +417,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, __percpu_ref_switch_mode(ref, confirm_kill, false); percpu_ref_put(ref); =20 - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm); =20 @@ -438,12 +438,12 @@ bool percpu_ref_is_zero(struct percpu_ref *ref) return false; =20 /* protect us from being destroyed */ - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); if (ref->data) count =3D atomic_long_read(&ref->data->count); else count =3D ref->percpu_count_ptr >> __PERCPU_REF_FLAG_BITS; - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); =20 return count =3D=3D 0; } @@ -487,7 +487,7 @@ void percpu_ref_resurrect(struct percpu_ref *ref) unsigned long __percpu *percpu_count; unsigned long flags; =20 - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); =20 WARN_ON_ONCE(!percpu_ref_is_dying(ref)); WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count)); @@ -496,6 +496,6 @@ void percpu_ref_resurrect(struct percpu_ref *ref) percpu_ref_get(ref); __percpu_ref_switch_mode(ref, NULL, false); =20 - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_resurrect); --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DE9A5C433FE for ; Fri, 29 Apr 2022 13:36:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351569AbiD2NkG (ORCPT ); Fri, 29 Apr 2022 09:40:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51844 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359809AbiD2NkC (ORCPT ); Fri, 29 Apr 2022 09:40:02 -0400 Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3799BCB018 for ; Fri, 29 Apr 2022 06:36:41 -0700 (PDT) Received: by mail-pj1-x1035.google.com with SMTP id cq17-20020a17090af99100b001dc0386cd8fso1902492pjb.5 for ; Fri, 29 Apr 2022 06:36:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=VfzL8+rKvBDkdHAWJz+BdGeabJSdieMoCyrARJf31C8=; b=Dqjg2bggCbl33T/45lA7Ve3w9lBbtFVa1mXx8Ejbfs5Il/fK1BO/bf2J+ibvJhiS+Y l23GVOPL2ey3vEhQd+FUJsO23dOCMwlDnUhlsabAJNQ3g/BLU+l3S8kF5q8xLqzcTLa9 N+2i4eoix1OJOltamNeXZ/RGg3nTxJmJ0nzjpfCx0serQBaJh+6ocZmzJ/QUqMKiG8zZ C7mOsKzMS1pHhUig4gWLkDlR56bIK6XpC/halFlOpppKGwDWVvjdEs/vIuYtRhDEisEk ZvmIPR0cAqycBsaDnGQH53owG2POMRJWUrpPJ9xjBwVSchPUFsRwPJYuD3dyOj2bgPPh wFOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=VfzL8+rKvBDkdHAWJz+BdGeabJSdieMoCyrARJf31C8=; b=tL/w34+FnF2wShsZWyEGHcY1FsaY1CkCymOkMR//wd+h8WqaEqbtuY4liVg1PBbjiS jfdwCbg/rFK0BmNCIWYdO9xow87khZiTUShWyU1attd1Gm77yjYTkhh25muu0NBsQL9z 5jPGvXc762JmULqRqCzHZcEkyl50O5P6ajJRQWdgz3RH0gLlu94vcW9NPM1elp1HEvOU iIObH13LZy2jJ/4+spntcj9MfXoLqlPn5krjY2tWnivjRk+kFDXSUhFtDp+rh46F0rrD o9r3+NWoJLYv/dMW+rgMaFJgHO3rG9Y9SPxQXqGckIW3l91SDMVanyXiE5/V63YXMAdh raog== X-Gm-Message-State: AOAM532zJYSK9+WjunmbR4y3KJ5wG+5QhfPrGxwl7vNO7wJVP0vVFip3 DUsPWxMLGAXuTyhQ0ps8WVo3Sg== X-Google-Smtp-Source: ABdhPJwdG27EhVPvJrwtYvxik6bbTXqrOGosOq9mejsaq+voyl0UgeehoOYZEA+LTuFQKgY2z1Cn4A== X-Received: by 2002:a17:903:1051:b0:15c:f02f:cd0e with SMTP id f17-20020a170903105100b0015cf02fcd0emr31282057plc.81.1651239400770; Fri, 29 Apr 2022 06:36:40 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:40 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Date: Fri, 29 Apr 2022 21:35:38 +0800 Message-Id: <20220429133552.33768-5-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" After commit 08d5b29eac7d ("mm: ptep_clear() page table helper"), the ptep_clear() can be used to track the clearing of PTE page table entries, but pte_clear_not_present_full() is not covered, so also convert it to use ptep_clear(), we will need this call in subsequent patches. Signed-off-by: Qi Zheng --- include/linux/pgtable.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f4f4077b97aa..bed9a559d45b 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -423,7 +423,7 @@ static inline void pte_clear_not_present_full(struct mm= _struct *mm, pte_t *ptep, int full) { - pte_clear(mm, address, ptep); + ptep_clear(mm, address, ptep); } #endif =20 --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46D76C433FE for ; Fri, 29 Apr 2022 13:37:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359807AbiD2NkT (ORCPT ); Fri, 29 Apr 2022 09:40:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52946 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359808AbiD2NkH (ORCPT ); Fri, 29 Apr 2022 09:40:07 -0400 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1608ECABB9 for ; Fri, 29 Apr 2022 06:36:48 -0700 (PDT) Received: by mail-pf1-x431.google.com with SMTP id h1so6921945pfv.12 for ; Fri, 29 Apr 2022 06:36:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6HmhPOYQfQyw96XMQOEpsURE5/Z+FdPGWunRugIZ8bE=; b=IptdK2Nlzt+e9FhAHJZOhyofi5+WuZbLP0Ii/Ru4VJXlBwx3Jtv5GU1CtyjqaFxexS KY8mxkGvEwjwdKyqqQe9TUyh4sAekQ/hfIMYn0f36mdYafrTxAGiuzghQc89UaEolTsO blkiqeRCUyMFCYPfaVufQUh1xBHZr1JubX3N9JmXee3AqD5ax6McY5qa7yfJC7tnId0k Hl4C1AGpoKEpQOyU05qCdMKctzm14UjXlbAukdt+PdkHRxcC/rDScCQ2pyT5/1QzPYDH A5cJG+jlDrM3qMTFyQNF8Hrw3Hjkd3C/gEP++G4BVKJAYOU6+6IajYkXQhkcpIwumF8Z nJ9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=6HmhPOYQfQyw96XMQOEpsURE5/Z+FdPGWunRugIZ8bE=; b=2wE2dqj7c0Pm73Gf090BIjX1cU7ydKXDpNT55OWmm9XNQNRwoyDZi5+JXjsGEa/2Dt KBdyOWNlsQyIfffj0Orb3nSsAG/ezRhhFoW6C3tx8e9zq52yxEn/vKLiSn1c2lmHFHIB 9xAa16R+6e4a6ESe/sc6HjWaP3zQvGbSnkqMuVCDHO4uTmfyAsnsfZlIphQaYJAk7w9G weFYn3lYnjwM73e3zrKZH1XEcfXhgIjt1mgtreHa/V85hE2k+6D4aMOc8a6ocywBTzHu AnLIUxuTjQBOo/pIK76z/2/O9CxLzwpvR3ALme6bxOWBLk8GPrWh00LjGBdJ6j9XU/ab c6zg== X-Gm-Message-State: AOAM530tubnKgV+4xDdbnx94CkzKzeRtwoJwMHUmwGL3XzHSyk+jvKU3 4rjfR4dUQIdTUH6OGi9S7tZTNA== X-Google-Smtp-Source: ABdhPJzqvZjOyyOerClbeLgfXGriI0EREvRRrSKg/2B9wt0ANwQMK0Q+vOCP30x0IB5WyOyWJThPqA== X-Received: by 2002:a05:6a00:21c7:b0:4fd:f89f:ec17 with SMTP id t7-20020a056a0021c700b004fdf89fec17mr40167798pfj.72.1651239406601; Fri, 29 Apr 2022 06:36:46 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:46 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Date: Fri, 29 Apr 2022 21:35:39 +0800 Message-Id: <20220429133552.33768-6-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The pte_offset_map_lock() and its friend pte_offset_map() are in mm.h and pgtable.h respectively, it would be better to have them in one file. Considering that they are all helper functions related to page tables, move pte_offset_map_lock() to pgtable.h. The pte_lockptr() is required for pte_offset_map_lock(), so move it and its friends {pmd,pud}_lockptr() to pgtable.h together. Signed-off-by: Qi Zheng --- include/linux/mm.h | 149 ---------------------------------------- include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 149 insertions(+), 149 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e34edb775334..0afd3b097e90 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2252,70 +2252,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm,= pud_t *pud, unsigned long a } #endif /* CONFIG_MMU */ =20 -#if USE_SPLIT_PTE_PTLOCKS -#if ALLOC_SPLIT_PTLOCKS -void __init ptlock_cache_init(void); -extern bool ptlock_alloc(struct page *page); -extern void ptlock_free(struct page *page); - -static inline spinlock_t *ptlock_ptr(struct page *page) -{ - return page->ptl; -} -#else /* ALLOC_SPLIT_PTLOCKS */ -static inline void ptlock_cache_init(void) -{ -} - -static inline bool ptlock_alloc(struct page *page) -{ - return true; -} - -static inline void ptlock_free(struct page *page) -{ -} - -static inline spinlock_t *ptlock_ptr(struct page *page) -{ - return &page->ptl; -} -#endif /* ALLOC_SPLIT_PTLOCKS */ - -static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return ptlock_ptr(pmd_page(*pmd)); -} - -static inline bool ptlock_init(struct page *page) -{ - /* - * prep_new_page() initialize page->private (and therefore page->ptl) - * with 0. Make sure nobody took it in use in between. - * - * It can happen if arch try to use slab for page table allocation: - * slab code uses page->slab_cache, which share storage with page->ptl. - */ - VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page); - if (!ptlock_alloc(page)) - return false; - spin_lock_init(ptlock_ptr(page)); - return true; -} - -#else /* !USE_SPLIT_PTE_PTLOCKS */ -/* - * We use mm->page_table_lock to guard all pagetable pages of the mm. - */ -static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return &mm->page_table_lock; -} -static inline void ptlock_cache_init(void) {} -static inline bool ptlock_init(struct page *page) { return true; } -static inline void ptlock_free(struct page *page) {} -#endif /* USE_SPLIT_PTE_PTLOCKS */ - static inline void pgtable_init(void) { ptlock_cache_init(); @@ -2338,20 +2274,6 @@ static inline void pgtable_pte_page_dtor(struct page= *page) dec_lruvec_page_state(page, NR_PAGETABLE); } =20 -#define pte_offset_map_lock(mm, pmd, address, ptlp) \ -({ \ - spinlock_t *__ptl =3D pte_lockptr(mm, pmd); \ - pte_t *__pte =3D pte_offset_map(pmd, address); \ - *(ptlp) =3D __ptl; \ - spin_lock(__ptl); \ - __pte; \ -}) - -#define pte_unmap_unlock(pte, ptl) do { \ - spin_unlock(ptl); \ - pte_unmap(pte); \ -} while (0) - #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, = pmd)) =20 #define pte_alloc_map(mm, pmd, address) \ @@ -2365,58 +2287,6 @@ static inline void pgtable_pte_page_dtor(struct page= *page) ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ NULL: pte_offset_kernel(pmd, address)) =20 -#if USE_SPLIT_PMD_PTLOCKS - -static struct page *pmd_to_page(pmd_t *pmd) -{ - unsigned long mask =3D ~(PTRS_PER_PMD * sizeof(pmd_t) - 1); - return virt_to_page((void *)((unsigned long) pmd & mask)); -} - -static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return ptlock_ptr(pmd_to_page(pmd)); -} - -static inline bool pmd_ptlock_init(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - page->pmd_huge_pte =3D NULL; -#endif - return ptlock_init(page); -} - -static inline void pmd_ptlock_free(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - VM_BUG_ON_PAGE(page->pmd_huge_pte, page); -#endif - ptlock_free(page); -} - -#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte) - -#else - -static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return &mm->page_table_lock; -} - -static inline bool pmd_ptlock_init(struct page *page) { return true; } -static inline void pmd_ptlock_free(struct page *page) {} - -#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) - -#endif - -static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) -{ - spinlock_t *ptl =3D pmd_lockptr(mm, pmd); - spin_lock(ptl); - return ptl; -} - static inline bool pgtable_pmd_page_ctor(struct page *page) { if (!pmd_ptlock_init(page)) @@ -2433,25 +2303,6 @@ static inline void pgtable_pmd_page_dtor(struct page= *page) dec_lruvec_page_state(page, NR_PAGETABLE); } =20 -/* - * No scalability reason to split PUD locks yet, but follow the same patte= rn - * as the PMD locks to make it easier if we decide to. The VM should not = be - * considered ready to switch to split PUD locks yet; there may be places - * which need to be converted from page_table_lock. - */ -static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud) -{ - return &mm->page_table_lock; -} - -static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) -{ - spinlock_t *ptl =3D pud_lockptr(mm, pud); - - spin_lock(ptl); - return ptl; -} - extern void __init pagecache_init(void); extern void free_initmem(void); =20 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index bed9a559d45b..0928acca6b48 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -85,6 +85,141 @@ static inline unsigned long pud_index(unsigned long add= ress) #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) #endif =20 +#if USE_SPLIT_PTE_PTLOCKS +#if ALLOC_SPLIT_PTLOCKS +void __init ptlock_cache_init(void); +extern bool ptlock_alloc(struct page *page); +extern void ptlock_free(struct page *page); + +static inline spinlock_t *ptlock_ptr(struct page *page) +{ + return page->ptl; +} +#else /* ALLOC_SPLIT_PTLOCKS */ +static inline void ptlock_cache_init(void) +{ +} + +static inline bool ptlock_alloc(struct page *page) +{ + return true; +} + +static inline void ptlock_free(struct page *page) +{ +} + +static inline spinlock_t *ptlock_ptr(struct page *page) +{ + return &page->ptl; +} +#endif /* ALLOC_SPLIT_PTLOCKS */ + +static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return ptlock_ptr(pmd_page(*pmd)); +} + +static inline bool ptlock_init(struct page *page) +{ + /* + * prep_new_page() initialize page->private (and therefore page->ptl) + * with 0. Make sure nobody took it in use in between. + * + * It can happen if arch try to use slab for page table allocation: + * slab code uses page->slab_cache, which share storage with page->ptl. + */ + VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page); + if (!ptlock_alloc(page)) + return false; + spin_lock_init(ptlock_ptr(page)); + return true; +} + +#else /* !USE_SPLIT_PTE_PTLOCKS */ +/* + * We use mm->page_table_lock to guard all pagetable pages of the mm. + */ +static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return &mm->page_table_lock; +} +static inline void ptlock_cache_init(void) {} +static inline bool ptlock_init(struct page *page) { return true; } +static inline void ptlock_free(struct page *page) {} +#endif /* USE_SPLIT_PTE_PTLOCKS */ + +#if USE_SPLIT_PMD_PTLOCKS + +static struct page *pmd_to_page(pmd_t *pmd) +{ + unsigned long mask =3D ~(PTRS_PER_PMD * sizeof(pmd_t) - 1); + return virt_to_page((void *)((unsigned long) pmd & mask)); +} + +static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return ptlock_ptr(pmd_to_page(pmd)); +} + +static inline bool pmd_ptlock_init(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + page->pmd_huge_pte =3D NULL; +#endif + return ptlock_init(page); +} + +static inline void pmd_ptlock_free(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + VM_BUG_ON_PAGE(page->pmd_huge_pte, page); +#endif + ptlock_free(page); +} + +#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte) + +#else /* !USE_SPLIT_PMD_PTLOCKS */ + +static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return &mm->page_table_lock; +} + +static inline bool pmd_ptlock_init(struct page *page) { return true; } +static inline void pmd_ptlock_free(struct page *page) {} + +#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) + +#endif /* USE_SPLIT_PMD_PTLOCKS */ + +static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) +{ + spinlock_t *ptl =3D pmd_lockptr(mm, pmd); + spin_lock(ptl); + return ptl; +} + +/* + * No scalability reason to split PUD locks yet, but follow the same patte= rn + * as the PMD locks to make it easier if we decide to. The VM should not = be + * considered ready to switch to split PUD locks yet; there may be places + * which need to be converted from page_table_lock. + */ +static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud) +{ + return &mm->page_table_lock; +} + +static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) +{ + spinlock_t *ptl =3D pud_lockptr(mm, pud); + + spin_lock(ptl); + return ptl; +} + #ifndef pte_offset_kernel static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) { @@ -103,6 +238,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, uns= igned long address) #define pte_unmap(pte) ((void)(pte)) /* NOP */ #endif =20 +#define pte_offset_map_lock(mm, pmd, address, ptlp) \ +({ \ + spinlock_t *__ptl =3D pte_lockptr(mm, pmd); \ + pte_t *__pte =3D pte_offset_map(pmd, address); \ + *(ptlp) =3D __ptl; \ + spin_lock(__ptl); \ + __pte; \ +}) + +#define pte_unmap_unlock(pte, ptl) do { \ + spin_unlock(ptl); \ + pte_unmap(pte); \ +} while (0) + /* Find an entry in the second-level page table.. */ #ifndef pmd_offset static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF6D1C433EF for ; Fri, 29 Apr 2022 13:37:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359832AbiD2Nka (ORCPT ); Fri, 29 Apr 2022 09:40:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53568 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359821AbiD2NkQ (ORCPT ); Fri, 29 Apr 2022 09:40:16 -0400 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8159CCB01A for ; Fri, 29 Apr 2022 06:36:52 -0700 (PDT) Received: by mail-pj1-x1029.google.com with SMTP id gj17-20020a17090b109100b001d8b390f77bso10556326pjb.1 for ; Fri, 29 Apr 2022 06:36:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mtrvfydE+wjlh8q+BDcMPZBC4jr9kvSALqoUY6wVrYg=; b=k8n9XhC/kWX+O8zBpqQG6cpY+9Tk08Geziwyjeu7tMaVC30yvyM7eOUm07SBr/4qEX xGqxG4TTH1PT6oqMgQaRTtltsursPhk1i+MdgOQ1+/JIIumQOzWrxXnaddl/A1Tn59BI cgiuxGNjxUI1TWVPKG3eDAjqjWUBdRQBpMufheg8/dbNPMOFvTm1MSRw1X4GaUzRJLAi 68ZJScRb/+LlLB4Z4jQd070JkFSQskomWSs4klopW+l35QRFY/XgBDVCI8aVPLQramu0 VklkPI8+1fbvC+YaQ6exBpg29zS3OdVQh36+iqKyd4OEuTm0Yh9T6sP0mw9P1ytkJmvV B+Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mtrvfydE+wjlh8q+BDcMPZBC4jr9kvSALqoUY6wVrYg=; b=fmHmZdLrqnGoQqA9DjHp8QmfRUwwjguaHx1Cx9C25R3mBwgjhXa6TBa+TEJ9Age0RT 9p6HiKhQ35EE+599O4fSOpPj4S1ITzvCNv/jQtay5Y18hp5iuK+BYhmt7o/QGfhtWfE+ grX7NrW/Ji2epmOxgnXkKb/yE6mSodrUoqExDsGnaedOiLmPBpariIKT6lKLRT9tNDw/ aOvkaMyPGWK7Or/N5DVUqEXsZ8E0bOOQ6CmQlR1ARSVABZq9Sb18oTbcUEr0kqcWqwxd 4nOd/SjcsmSO7JvkQ00RmCD25mYU45mlcUMRKxG7uZ533MFhz3YjRqTXRg4CMfMOVWEb 7ijQ== X-Gm-Message-State: AOAM531DzLdHkRCSfO+Sq039Sdfy/lcFjwWM8fLDZBZL3V4UnUodgnCC OHY8TLnTPgh4Ma/sXCYLTx0yFg== X-Google-Smtp-Source: ABdhPJxkwZhomXI1CyZvZSH/IPeC0YUXjZY29+uiw0fkUp1dwm5zxcysMSj/RCsHmgPONsDBh4zerQ== X-Received: by 2002:a17:90b:1a8b:b0:1d9:971d:4269 with SMTP id ng11-20020a17090b1a8b00b001d9971d4269mr4032863pjb.65.1651239412204; Fri, 29 Apr 2022 06:36:52 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:51 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Date: Fri, 29 Apr 2022 21:35:40 +0800 Message-Id: <20220429133552.33768-7-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This configuration variable will be used to build the code needed to free user PTE page table pages. The PTE page table setting and clearing functions(such as set_pte_at()) are in the architecture's files, and these functions will be hooked to implement FREE_USER_PTE, so the architecture support is needed. Signed-off-by: Qi Zheng --- mm/Kconfig | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index 034d87953600..af99ed626732 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -909,6 +909,16 @@ config ANON_VMA_NAME area from being merged with adjacent virtual memory areas due to the difference in their name. =20 +config ARCH_SUPPORTS_FREE_USER_PTE + def_bool n + +config FREE_USER_PTE + bool "Free user PTE page tables" + default y + depends on ARCH_SUPPORTS_FREE_USER_PTE && MMU && SMP + help + Try to free user PTE page table page when its all entries are none. + source "mm/damon/Kconfig" =20 endmenu --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10045C433FE for ; Fri, 29 Apr 2022 13:37:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359826AbiD2NkZ (ORCPT ); Fri, 29 Apr 2022 09:40:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53676 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359852AbiD2NkR (ORCPT ); Fri, 29 Apr 2022 09:40:17 -0400 Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD55BCC500 for ; Fri, 29 Apr 2022 06:36:58 -0700 (PDT) Received: by mail-pg1-x536.google.com with SMTP id q12so6524743pgj.13 for ; Fri, 29 Apr 2022 06:36:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZTHv+EobGmd8CLqrN8ICsD/Gd2+vakaEN+AZqfml/As=; b=Pp9YaWXF+keaVk+3vcFs3ST1OD3cYkcjwDO13WpXirW65DJ5bV24lriyyZ3gSCtH1t ZrexZP4eqVpM+sJIbrDvAV6PHKPMjCYED8+24O1lxhFn2leYX3+FUdT0xJMzmNoM8vHu R6U2/6kf69dP6xXUieWIdERPGU8JMCKVGGHds1fcYgNFcNtGAn1c/kHvsWaE3iNCkYA5 RqdeoXcJZMaW+VrHTPiE0q4Zx8NLKhMbg/05XMpQP+Fif3YIpIK9cdVZGsJrI8deD15+ Yk+iY23SxB+rz3MFcApWicRPxvVoXsv2GXFFrxlXAOr0Z2k4Bv9kbswt9Pf2UaM2ubmN BZag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZTHv+EobGmd8CLqrN8ICsD/Gd2+vakaEN+AZqfml/As=; b=8QUiGp1zg7rmSIXLjm+Z6hKttwa4B1g2LMqa1N2/WL1fWtNwi5FDM2XiMovnRTycyn ceXP7CrBvmCqxkmYzj6wT5B7KPWVZoK3n/hNMifw6o51Ul/iL9MNM1y4eHk5exWcn/JW CpOWh4vWFXUQCyobkCSlvHd0sBjea1W0woGnfoLmVdGbVK+a2IGzh601GPvBRzlh8yeA ssEHeeZ5G83NsqWPe2CyrwjlUJuOXuXbLyk89kAipShzHg+Hr4ZbJmhHDAAfFjYlsrKj AkKhYqhHHr2LN61ELw1BeN4n8UwLMPdRl/o7SLiBqftpQ1ATejnJYIgMAK/X1EBKOh3v fyEQ== X-Gm-Message-State: AOAM533hQUKJ2d1iwTUbG7OJxbS7dialHiHU9FroIgTS/aldVVmaoD6E Gfnxl1Bmg8BBFtM5chNbi/1l/cFxjj50lQ== X-Google-Smtp-Source: ABdhPJxQ/VkMVyn4Qp+z5Cb9ljljtE8dhgkTNUvvkMORdefL3XrTmyDghPgZGP75wajseTIiF7sh1Q== X-Received: by 2002:a63:290:0:b0:3aa:8b8b:1a3d with SMTP id 138-20020a630290000000b003aa8b8b1a3dmr31635992pgc.208.1651239417928; Fri, 29 Apr 2022 06:36:57 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:57 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 07/18] mm: add pte_to_page() helper Date: Fri, 29 Apr 2022 21:35:41 +0800 Message-Id: <20220429133552.33768-8-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add pte_to_page() helper similar to pmd_to_page(), which will be used to get the struct page of the PTE page table. Signed-off-by: Qi Zheng --- include/linux/pgtable.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 0928acca6b48..d1218cb1013e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -85,6 +85,14 @@ static inline unsigned long pud_index(unsigned long addr= ess) #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) #endif =20 +#ifdef CONFIG_FREE_USER_PTE +static inline struct page *pte_to_page(pte_t *pte) +{ + unsigned long mask =3D ~(PTRS_PER_PTE * sizeof(pte_t) - 1); + return virt_to_page((void *)((unsigned long) pte & mask)); +} +#endif + #if USE_SPLIT_PTE_PTLOCKS #if ALLOC_SPLIT_PTLOCKS void __init ptlock_cache_init(void); --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0558C433EF for ; Fri, 29 Apr 2022 13:37:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376259AbiD2Nkj (ORCPT ); Fri, 29 Apr 2022 09:40:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54136 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359815AbiD2NkX (ORCPT ); Fri, 29 Apr 2022 09:40:23 -0400 Received: from mail-pg1-x531.google.com (mail-pg1-x531.google.com [IPv6:2607:f8b0:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B6804CB00E for ; Fri, 29 Apr 2022 06:37:04 -0700 (PDT) Received: by mail-pg1-x531.google.com with SMTP id q12so6524968pgj.13 for ; Fri, 29 Apr 2022 06:37:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=a0pdWx48nGjNlGaELTzsLa2nKTJtILEQX8NtQw3N2p4=; b=vOzMg0WwZNZJ2aJySF61oOZUFo15rtaSv8ZgCsHQW+x4vUg2gpucSoyYCNUztLrAxj YcWyZiXJ4nMk4ksCagIhUghKnVR27Rn7j5BSZ6aUVnVn3NAPYBai3hPJK2hBYOmr0AY2 gVgLHui6SE2OT16oLJPwJ+QyDgjiWtWpf8tActM2PfHuUuZcatMxnqeHas/8t6014yuM KVp7m4lEzHdo2JHc5dhmn11rh2qarw9f3jRr2hhwGlBagI8nFT6RMAAMvwp/3kKUjbVb VSXDJPDXDGANDA3YkyThMCz5buYTS7SBPVfY+iHQXizpWs0u+IFSRVckBCZyha5Y/rwP zzGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=a0pdWx48nGjNlGaELTzsLa2nKTJtILEQX8NtQw3N2p4=; b=So/y6d+RjWU6auhlwHr7NqGiZa1BpsHdr5+tL+Vim0U2Jqc+VwPtVBQ7pziXqYxqIX DOkZWK5T8Ng2AImRVVZGU0PkT5OKnhhdlkfhmYW7FGH7g0xzhTJY9NHHfKzEOPYhXiNo PJv7l1XwF+RxNxrY3IaYuNvyepiGVAMfGbnIzZNfL2tECZ0iN2dj+DU+lyggHwNPvlsb knwVc/ChE9PY9sMVH3ROZFXv/RR7h0OxorUc4LMnerePjnWQbzUgzCLzUK3QGwK8d9EA 5FqiBY9ni1LMD7M2ydScQ+tWPvdBYK29MeC96FFCWcepZQ5tCBCm3yGDAWNfSnQvuvGm KFyQ== X-Gm-Message-State: AOAM531kNvjv9u/F9YvryZeI1nYpaDC+7JvrHiZRx55thEis0t5s2pFI qeEoPFulb6RfyzhepwlqzuiolJLIFH1EBg== X-Google-Smtp-Source: ABdhPJwB5nqK52ryAKPNOtkptHUR5XR7j8KK8PoAy/NWEG3XiuyosiUypONBAmFBtJovBvthn4Ixbw== X-Received: by 2002:a65:60d3:0:b0:39c:f431:5859 with SMTP id r19-20020a6560d3000000b0039cf4315859mr32421659pgv.442.1651239423704; Fri, 29 Apr 2022 06:37:03 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:03 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Date: Fri, 29 Apr 2022 21:35:42 +0800 Message-Id: <20220429133552.33768-9-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory for the following reasons:: First of all, we should hold as few write locks of mmap_lock as possible, since the mmap_lock semaphore has long been a contention point in the memory management subsystem. The mmap()/munmap() hold the write lock, and the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using madvise() instead of munmap() to released physical memory can reduce the competition of the mmap_lock. Secondly, after using madvise() to release physical memory, there is no need to build vma and allocate page tables again when accessing the same virtual address again, which can also save some time. The following is the largest user PTE page table memory that can be allocated by a single user process in a 32-bit and a 64-bit system. +---------------------------+--------+---------+ | | 32-bit | 64-bit | +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | user PTE page table pages | 3 MiB | 512 GiB | +---------------------------+--------+---------+ | user PMD page table pages | 3 KiB | 1 GiB | +---------------------------+--------+---------+ (for 32-bit, take 3G user address space, 4K page size as an example; for 64-bit, take 48-bit address width, 4K page size as an example.) After using madvise(), everything looks good, but as can be seen from the above table, a single process can create a large number of PTE page tables on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not release page table memory. And before the process exits or calls munmap(), the kernel cannot reclaim these pages even if these PTE page tables do not map anything. To fix the situation, this patchset introduces a percpu_ref for each user PTE page table page. The following people will hold a percpu_ref:: The !pte_none() entry, such as regular page table entry that map physical pages, or swap entry, or migrate entry, etc. Visitor to the PTE page table entries, such as page table walker. Any ``!pte_none()`` entry and visitor can be regarded as the user of its PTE page table page. When the percpu_ref is reduced to 0 (need to switch to atomic mode first to check), it means that no one is using the PTE page table page, then this free PTE page table page can be reclaimed at this time. Signed-off-by: Qi Zheng --- include/linux/mm.h | 9 +++++++- include/linux/mm_types.h | 1 + include/linux/pte_ref.h | 29 +++++++++++++++++++++++++ mm/Makefile | 2 +- mm/pte_ref.c | 47 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 86 insertions(+), 2 deletions(-) create mode 100644 include/linux/pte_ref.h create mode 100644 mm/pte_ref.c diff --git a/include/linux/mm.h b/include/linux/mm.h index 0afd3b097e90..1a6bc79c351b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -28,6 +28,7 @@ #include #include #include +#include =20 struct mempolicy; struct anon_vma; @@ -2260,11 +2261,16 @@ static inline void pgtable_init(void) =20 static inline bool pgtable_pte_page_ctor(struct page *page) { - if (!ptlock_init(page)) + if (!pte_ref_init(page)) return false; + if (!ptlock_init(page)) + goto free_pte_ref; __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); return true; +free_pte_ref: + pte_ref_free(page); + return false; } =20 static inline void pgtable_pte_page_dtor(struct page *page) @@ -2272,6 +2278,7 @@ static inline void pgtable_pte_page_dtor(struct page = *page) ptlock_free(page); __ClearPageTable(page); dec_lruvec_page_state(page, NR_PAGETABLE); + pte_ref_free(page); } =20 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, = pmd)) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8834e38c06a4..650bfb22b0e2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -153,6 +153,7 @@ struct page { union { struct mm_struct *pt_mm; /* x86 pgds only */ atomic_t pt_frag_refcount; /* powerpc */ + struct percpu_ref *pte_ref; /* PTE page only */ }; #if ALLOC_SPLIT_PTLOCKS spinlock_t *ptl; diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h new file mode 100644 index 000000000000..d3963a151ca5 --- /dev/null +++ b/include/linux/pte_ref.h @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2022, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ + +#ifndef _LINUX_PTE_REF_H +#define _LINUX_PTE_REF_H + +#ifdef CONFIG_FREE_USER_PTE + +bool pte_ref_init(pgtable_t pte); +void pte_ref_free(pgtable_t pte); + +#else /* !CONFIG_FREE_USER_PTE */ + +static inline bool pte_ref_init(pgtable_t pte) +{ + return true; +} + +static inline void pte_ref_free(pgtable_t pte) +{ +} + +#endif /* CONFIG_FREE_USER_PTE */ + +#endif /* _LINUX_PTE_REF_H */ diff --git a/mm/Makefile b/mm/Makefile index 4cc13f3179a5..b9711510f84f 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -54,7 +54,7 @@ obj-y :=3D filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o percpu.o slab_common.o \ compaction.o vmacache.o \ interval_tree.o list_lru.o workingset.o \ - debug.o gup.o mmap_lock.o $(mmu-y) + debug.o gup.o mmap_lock.o $(mmu-y) pte_ref.o =20 # Give 'page_alloc' its own module-parameter namespace page-alloc-y :=3D page_alloc.o diff --git a/mm/pte_ref.c b/mm/pte_ref.c new file mode 100644 index 000000000000..52e31be00de4 --- /dev/null +++ b/mm/pte_ref.c @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2022, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ +#include +#include +#include +#include + +#ifdef CONFIG_FREE_USER_PTE + +static void no_op(struct percpu_ref *r) {} + +bool pte_ref_init(pgtable_t pte) +{ + struct percpu_ref *pte_ref; + + pte_ref =3D kmalloc(sizeof(struct percpu_ref), GFP_KERNEL); + if (!pte_ref) + return false; + if (percpu_ref_init(pte_ref, no_op, + PERCPU_REF_ALLOW_REINIT, GFP_KERNEL) < 0) + goto free_ref; + /* We want to start with the refcount at zero */ + percpu_ref_put(pte_ref); + + pte->pte_ref =3D pte_ref; + return true; +free_ref: + kfree(pte_ref); + return false; +} + +void pte_ref_free(pgtable_t pte) +{ + struct percpu_ref *ref =3D pte->pte_ref; + if (!ref) + return; + + pte->pte_ref =3D NULL; + percpu_ref_exit(ref); + kfree(ref); +} + +#endif /* CONFIG_FREE_USER_PTE */ --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C13DC433F5 for ; Fri, 29 Apr 2022 13:37:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359855AbiD2Nkg (ORCPT ); Fri, 29 Apr 2022 09:40:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359864AbiD2Nk3 (ORCPT ); Fri, 29 Apr 2022 09:40:29 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA747CB03D for ; Fri, 29 Apr 2022 06:37:09 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id b12so7170577plg.4 for ; Fri, 29 Apr 2022 06:37:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=DGl50d69T/46DVPYmj2DL1x7ncKmmZ54JyK/lw6nQrc=; b=ARePECzUIuI/o7LaJ6U30K3W4hzXRsvLwyPDJjZN5Ir39F24/V1LJiWa1QSu1EMnH5 A1GdbC0XtJEshbkbFIKnlj/USWQUZJ1ZbMvKe6cExn1vdQNp2EvKrmONYlgc4UfzDqDN NPrE/d77mWvSMe8t+Pj6PUpwZJtO1lp6YSqboATYrrZ7UX43LUsVHKLtdYbl0QShre5S KTKyDJkzYVYEsW858cbU/F6b9nm45sIAiNbvtYGtEbXr8MjkUHvyh6vh9G95S67sc/Qd 5bayvVPY/1MQqIcZ8qKIQkAnNN6I1m7O69zc83csx5Wy9gJ5vcWF0bOjLvSmoOPA1ADx MHpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=DGl50d69T/46DVPYmj2DL1x7ncKmmZ54JyK/lw6nQrc=; b=f2SAKy+KcJ4eo9NxgaKdwsY6/uXJIZJ6OL9L6gj3y1aiuBvPaooSedp9oLPzGqVWVw 50swfm8XRZtUNYLIFuTtV0J49WUpSRi7vMjSy1ajkHyGbfMOZYRSalCEZvIxuRIZF7PT DsYPGCs8vwzOSwtHdFz5p4CWI5+ExGVOWrPrOB+UV/CXesge1ZVwitIPKadxCSkAf/Jf 5Kp++hM/Xm+c3fKOO4Gx2pwwp3sfV9ySpwaCe1DqWNB6ednh12AZ6iOmfK6u5bnjDzaT cPjrnXQj2Q5btZahrMlozYUtnOSI8yfnpLfubPDbYcs1IFPud98zEyWFmooval/3vxOb zFog== X-Gm-Message-State: AOAM533xK15QqRM2q8e1NKDMnEGEWsQrT9xt3xJDGX7RfSya7malVK7P xwtaZQKJ/ksut1ydxzwDFSbtAQ== X-Google-Smtp-Source: ABdhPJwkDEFr/eSiIXsECiLskFgBvhXhvx+Ht6goriTOJRM12m+x++WpdKjE8bfOm4QcHnGGrKiSMQ== X-Received: by 2002:a17:90b:380e:b0:1da:2943:b975 with SMTP id mq14-20020a17090b380e00b001da2943b975mr4006350pjb.42.1651239429455; Fri, 29 Apr 2022 06:37:09 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:08 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Date: Fri, 29 Apr 2022 21:35:43 +0800 Message-Id: <20220429133552.33768-10-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The user PTE page table page may be freed when the last percpu_ref is dropped. So we need to try to get its percpu_ref before accessing the PTE page to prevent it form being freed during the access process. This patch adds pte_tryget() and {__,}pte_put() to help us to get and put the percpu_ref of user PTE page table pages. Signed-off-by: Qi Zheng --- include/linux/pte_ref.h | 23 ++++++++++++++++ mm/pte_ref.c | 58 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 81 insertions(+) diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index d3963a151ca5..bfe620038699 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -12,6 +12,10 @@ =20 bool pte_ref_init(pgtable_t pte); void pte_ref_free(pgtable_t pte); +void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); +bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); +void __pte_put(pgtable_t page); +void pte_put(pte_t *ptep); =20 #else /* !CONFIG_FREE_USER_PTE */ =20 @@ -24,6 +28,25 @@ static inline void pte_ref_free(pgtable_t pte) { } =20 +static inline void free_user_pte(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr) +{ +} + +static inline bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr) +{ + return true; +} + +static inline void __pte_put(pgtable_t page) +{ +} + +static inline void pte_put(pte_t *ptep) +{ +} + #endif /* CONFIG_FREE_USER_PTE */ =20 #endif /* _LINUX_PTE_REF_H */ diff --git a/mm/pte_ref.c b/mm/pte_ref.c index 52e31be00de4..5b382445561e 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -44,4 +44,62 @@ void pte_ref_free(pgtable_t pte) kfree(ref); } =20 +void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) {} + +/* + * pte_tryget - try to get the pte_ref of the user PTE page table page + * @mm: pointer the target address space + * @pmd: pointer to a PMD. + * @addr: virtual address associated with pmd. + * + * Return: true if getting the pte_ref succeeded. And false otherwise. + * + * Before accessing the user PTE page table, we need to hold a refcount to + * protect against the concurrent release of the PTE page table. + * But we will fail in the following case: + * - The content mapped in @pmd is not a PTE page + * - The pte_ref is zero, it may be reclaimed + */ +bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +{ + bool retval =3D true; + pmd_t pmdval; + pgtable_t pte; + + rcu_read_lock(); + pmdval =3D READ_ONCE(*pmd); + pte =3D pmd_pgtable(pmdval); + if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval))) { + retval =3D false; + } else if (!percpu_ref_tryget(pte->pte_ref)) { + rcu_read_unlock(); + /* + * Also do free_user_pte() here to prevent missed reclaim due + * to race condition. + */ + free_user_pte(mm, pmd, addr & PMD_MASK); + return false; + } + rcu_read_unlock(); + + return retval; +} + +void __pte_put(pgtable_t page) +{ + percpu_ref_put(page->pte_ref); +} + +void pte_put(pte_t *ptep) +{ + pgtable_t page; + + if (pte_huge(*ptep)) + return; + + page =3D pte_to_page(ptep); + __pte_put(page); +} +EXPORT_SYMBOL(pte_put); + #endif /* CONFIG_FREE_USER_PTE */ --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2731C433F5 for ; Fri, 29 Apr 2022 13:37:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376263AbiD2Nkx (ORCPT ); Fri, 29 Apr 2022 09:40:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359811AbiD2Nkd (ORCPT ); Fri, 29 Apr 2022 09:40:33 -0400 Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com [IPv6:2607:f8b0:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AF0DCB007 for ; Fri, 29 Apr 2022 06:37:15 -0700 (PDT) Received: by mail-pg1-x533.google.com with SMTP id x12so6545052pgj.7 for ; Fri, 29 Apr 2022 06:37:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Ti47oOTvJFCNXT5uBN/QcWBt/mztyaSPv2YCzaUIEks=; b=3g9d4G3AVSSQh+94A2C6QzQqKt1PRF8js91ZL0nJcPA3Z2pqrsS736awrXrhB5qPNL sUEJ1N/4NDvBYvnmd6ICC5/nmqG4YbNK2tVYSwte7tSPMcq+itTorKQlC5NV96S78vve 49KHlqBtvfS6ikyeLIE98STuW16ozIxTWc4DS49i0ZPiYyllxwnGRYZcxC6V30CKju9o CL9fjfO7XWQ7Yvc9MmOIn0mD/1tlgEEmvZfM2zZieTtyLlokUilgkaABmZykB1TcatHq pjHEGHSkjNa+P+/dvoKyd/UP7gMxeI4LAtJltzdrP0xKNx+P2UQhK6u7RqFWZSdZOLE4 X8LA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Ti47oOTvJFCNXT5uBN/QcWBt/mztyaSPv2YCzaUIEks=; b=M83kmRCxPi9SsXCg/wtG+qh3iMHgSbc6s90512EsBmIxjqHisRcxS1vfEADLTr4tnu LyA9Roooaw5bwS0NRybuDj508eGIZnyNCWk/lOxZybv3ug/E8/3AYZ6Ml1Pt905NiOl1 bBVlJuN+xd7zhU3qnkr6s0JHQjXSLmQcjXiDjURAIXNE+7rvNrQy5zaOSG2mtxRwJrP5 Bw58+f+VAwUNYeLNUDWZlNdnBRSQnrpfwsW2HD+Crcm8Gz7GUcK7QZ/4760wqPob662X IPQlMHC1G2ELXGVQLClnKhteIBpPmuqAvYtgNOBnDRm6dbezStlCigj15YAesc4GW2cK EkDA== X-Gm-Message-State: AOAM531VH5l7SmJBkG5zriDm2e7UKWGqE6hbiLLRFfw9WOJVOiNRkCzI vcvsmqKQLgc2K55UD6NmLU8UNw== X-Google-Smtp-Source: ABdhPJxMIKA/P0RDHv9STrZjiOvbOSFfoH4N5N3pT1umtFqilIbIuxsCtayDjWtKg053GReSSR2iXA== X-Received: by 2002:a65:4006:0:b0:3aa:1cb6:e2f8 with SMTP id f6-20020a654006000000b003aa1cb6e2f8mr32435549pgp.274.1651239435027; Fri, 29 Apr 2022 06:37:15 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:14 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Date: Fri, 29 Apr 2022 21:35:44 +0800 Message-Id: <20220429133552.33768-11-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Now, we usually use pte_offset_map{_lock}() to get the pte_t pointer before accessing the PTE page table page. After adding the FREE_USER_PTE, we also need to call the pte_tryget() before calling pte_offset_map{_lock}(), which is used to try to get the reference count of the PTE to prevent the PTE page table page from being freed during the access process. This patch adds pte_tryget_map{_lock}() to help us to do that. A return value of NULL indicates that we failed to get the percpu_ref, and there is a concurrent thread that is releasing this PTE (or has already been released). It needs to be treated as the case of pte_none(). Signed-off-by: Qi Zheng --- include/linux/pgtable.h | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d1218cb1013e..6f205fee6348 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -228,6 +228,8 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm= , pud_t *pud) return ptl; } =20 +#include + #ifndef pte_offset_kernel static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) { @@ -240,12 +242,38 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, un= signed long address) #define pte_offset_map(dir, address) \ ((pte_t *)kmap_atomic(pmd_page(*(dir))) + \ pte_index((address))) -#define pte_unmap(pte) kunmap_atomic((pte)) +#define __pte_unmap(pte) kunmap_atomic((pte)) #else #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address)) -#define pte_unmap(pte) ((void)(pte)) /* NOP */ +#define __pte_unmap(pte) ((void)(pte)) /* NOP */ #endif =20 +#define pte_tryget_map(mm, pmd, address) \ +({ \ + pte_t *__pte =3D NULL; \ + if (pte_tryget(mm, pmd, address)) \ + __pte =3D pte_offset_map(pmd, address); \ + __pte; \ +}) + +#define pte_unmap(pte) do { \ + pte_put(pte); \ + __pte_unmap(pte); \ +} while (0) + +#define pte_tryget_map_lock(mm, pmd, address, ptlp) \ +({ \ + spinlock_t *__ptl =3D NULL; \ + pte_t *__pte =3D NULL; \ + if (pte_tryget(mm, pmd, address)) { \ + __ptl =3D pte_lockptr(mm, pmd); \ + __pte =3D pte_offset_map(pmd, address); \ + *(ptlp) =3D __ptl; \ + spin_lock(__ptl); \ + } \ + __pte; \ +}) + #define pte_offset_map_lock(mm, pmd, address, ptlp) \ ({ \ spinlock_t *__ptl =3D pte_lockptr(mm, pmd); \ @@ -260,6 +288,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, uns= igned long address) pte_unmap(pte); \ } while (0) =20 +#define __pte_unmap_unlock(pte, ptl) do { \ + spin_unlock(ptl); \ + __pte_unmap(pte); \ +} while (0) + /* Find an entry in the second-level page table.. */ #ifndef pmd_offset static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C778C433FE for ; Fri, 29 Apr 2022 13:37:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359864AbiD2Nk7 (ORCPT ); Fri, 29 Apr 2022 09:40:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54500 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359840AbiD2Nkq (ORCPT ); Fri, 29 Apr 2022 09:40:46 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9D27ECB03D for ; Fri, 29 Apr 2022 06:37:21 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id n14so282194plf.3 for ; Fri, 29 Apr 2022 06:37:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=vfiCmddC+7ZKj47uoVoXhmc2OB+ra/xXEY2BBYUTQ0I=; b=N8g8Y0NlYBcVwLLMJUrGJv7JOI9uCFz5c4LZHV8Ana9n4lqzBs5+xsmPmVEYIbfKsg odMCYNM1+2iRANbOVyDsE4CxFIDC6tfN5Q4KRP+UGYgc76Av12gM0j7qlThiAfAuXs3h /VAXQKUdsKtp70HAo+Dtez0N9tQfzuOKJM24kBsHm3sFWq42S4JPVlh0rLXuWPylEvW4 ZcJ8Qm3Rj10ZVraesw1ftsomuWK7lNtzefDcu8/9mp82Y3WihTOadO9/7h8Zf7lNnun/ /XVssHvGBkLS5je51Whqn3VO6awKzUQbPc9s9/5Mjp//R1lTImRXCyRGVWRCQ1a1s3qH Zk7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=vfiCmddC+7ZKj47uoVoXhmc2OB+ra/xXEY2BBYUTQ0I=; b=HfoUclSnTJdMgs2S7JWc7cbIK/Uwqurm58Fl8wYpM8la0qUV3i0pcWs3tdNiwzUtTg Wmkm2D/Cy638FztpPTZeZTaEMPP8XK2OGr5Z4eVOQ12lnu3+Gr3NbJ+95L26t06crVX4 646Rxm82gOv62LslaQCAnUrhxD7yO/CAJf3XB/KZ/l/JmBKbtCpR8yP7nXk2i115/458 tfpI/7n6QB0A0TTlbtwVPwXYKyEMnIz2+EAbYWFK2sZQ4k0tPYlpvHtAifUqST+SQnrq qriMMHj60S3F7PJKOJsCJ8vv7d7ouX4ZulUGsTa2gi5l31ak1VNDW7Al7/OSRYRqaKzu xGyg== X-Gm-Message-State: AOAM531Jq3pb+nqziNYxsb/RErwaz+sI9XPFoiErhYrCwwaqd3kYR4K/ YTvQLQoiWvW6LAPV70KXw+bwqQ== X-Google-Smtp-Source: ABdhPJxU50DpM9zXS10Ztlg3jVeqDyyHMnBwcSuzKIki0feYMT5Nf8Fd0pD18GCFTDesQcItxVzKcA== X-Received: by 2002:a17:902:e5cd:b0:15d:57c7:b9fb with SMTP id u13-20020a170902e5cd00b0015d57c7b9fbmr12006973plf.101.1651239440662; Fri, 29 Apr 2022 06:37:20 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:20 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Date: Fri, 29 Apr 2022 21:35:45 +0800 Message-Id: <20220429133552.33768-12-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Use pte_tryget_map_lock() to help us to try to get the refcount of the PTE page table page we want to access, which can prevents the page from being freed during access. For the following cases, the PTE page table page is stable: - got the refcount of PTE page table page already - has no concurrent threads(e.g. the write lock of mmap_lock is acquired) - the PTE page table page is not yet visible - turn off the local cpu interrupt or hold the rcu lock (e.g. GUP fast path) - the PTE page table page is kernel PTE page table page So we still keep using pte_offset_map_lock() and replace pte_unmap_unlock() with __pte_unmap_unlock() which doesn't reduce the refcount. Signed-off-by: Qi Zheng --- fs/proc/task_mmu.c | 16 ++++-- include/linux/mm.h | 2 +- mm/damon/vaddr.c | 30 ++++++---- mm/debug_vm_pgtable.c | 2 +- mm/filemap.c | 4 +- mm/gup.c | 4 +- mm/khugepaged.c | 10 +++- mm/ksm.c | 4 +- mm/madvise.c | 30 +++++++--- mm/memcontrol.c | 8 ++- mm/memory-failure.c | 4 +- mm/memory.c | 125 +++++++++++++++++++++++++++++------------- mm/mempolicy.c | 4 +- mm/migrate_device.c | 22 +++++--- mm/mincore.c | 5 +- mm/mlock.c | 5 +- mm/mprotect.c | 4 +- mm/mremap.c | 5 +- mm/pagewalk.c | 4 +- mm/swapfile.c | 13 +++-- mm/userfaultfd.c | 11 +++- 21 files changed, 219 insertions(+), 93 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f46060eb91b5..5fff96659e4f 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -625,7 +625,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long ad= dr, unsigned long end, * keeps khugepaged out of here and from collapsing things * in here. */ - pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + goto out; for (; addr !=3D end; pte++, addr +=3D PAGE_SIZE) smaps_pte_entry(pte, addr, walk); pte_unmap_unlock(pte - 1, ptl); @@ -1178,7 +1180,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned = long addr, if (pmd_trans_unstable(pmd)) return 0; =20 - pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr !=3D end; pte++, addr +=3D PAGE_SIZE) { ptent =3D *pte; =20 @@ -1515,7 +1519,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned lo= ng addr, unsigned long end, * We can assume that @vma always points to a valid one and @end never * goes beyond vma->vm_end. */ - orig_pte =3D pte =3D pte_offset_map_lock(walk->mm, pmdp, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(walk->mm, pmdp, addr, &ptl); + if (!pte) + return 0; for (; addr < end; pte++, addr +=3D PAGE_SIZE) { pagemap_entry_t pme; =20 @@ -1849,7 +1855,9 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long= addr, if (pmd_trans_unstable(pmd)) return 0; #endif - orig_pte =3D pte =3D pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + return 0; do { struct page *page =3D can_gather_numa_stats(*pte, vma, addr); if (!page) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1a6bc79c351b..04f7a6c36dc7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2288,7 +2288,7 @@ static inline void pgtable_pte_page_dtor(struct page = *page) =20 #define pte_alloc_map_lock(mm, pmd, address, ptlp) \ (pte_alloc(mm, pmd) ? \ - NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) + NULL : pte_tryget_map_lock(mm, pmd, address, ptlp)) =20 #define pte_alloc_kernel(pmd, address) \ ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index b2ec0aa1ff45..4aa9e252c081 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -372,10 +372,13 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned= long addr, { pte_t *pte; spinlock_t *ptl; + pmd_t pmdval; =20 - if (pmd_huge(*pmd)) { +retry: + pmdval =3D READ_ONCE(*pmd); + if (pmd_huge(pmdval)) { ptl =3D pmd_lock(walk->mm, pmd); - if (pmd_huge(*pmd)) { + if (pmd_huge(pmdval)) { damon_pmdp_mkold(pmd, walk->mm, addr); spin_unlock(ptl); return 0; @@ -383,9 +386,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned = long addr, spin_unlock(ptl); } =20 - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + if (pmd_none(pmdval) || unlikely(pmd_bad(pmdval))) return 0; - pte =3D pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + goto retry; if (!pte_present(*pte)) goto out; damon_ptep_mkold(pte, walk->mm, addr); @@ -499,18 +504,21 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned= long addr, spinlock_t *ptl; struct page *page; struct damon_young_walk_private *priv =3D walk->private; + pmd_t pmdval; =20 +retry: + pmdval =3D READ_ONCE(*pmd); #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pmd_huge(*pmd)) { + if (pmd_huge(pmdval)) { ptl =3D pmd_lock(walk->mm, pmd); - if (!pmd_huge(*pmd)) { + if (!pmd_huge(pmdval)) { spin_unlock(ptl); goto regular_page; } - page =3D damon_get_page(pmd_pfn(*pmd)); + page =3D damon_get_page(pmd_pfn(pmdval)); if (!page) goto huge_out; - if (pmd_young(*pmd) || !page_is_idle(page) || + if (pmd_young(pmdval) || !page_is_idle(page) || mmu_notifier_test_young(walk->mm, addr)) { *priv->page_sz =3D ((1UL) << HPAGE_PMD_SHIFT); @@ -525,9 +533,11 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned = long addr, regular_page: #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + if (pmd_none(pmdval) || unlikely(pmd_bad(pmdval))) return -EINVAL; - pte =3D pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + goto retry; if (!pte_present(*pte)) goto out; page =3D damon_get_page(pte_pfn(*pte)); diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index db2abd9e415b..91c4400ca13c 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -1303,7 +1303,7 @@ static int __init debug_vm_pgtable(void) * proper page table lock. */ =20 - args.ptep =3D pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl); + args.ptep =3D pte_tryget_map_lock(args.mm, args.pmdp, args.vaddr, &ptl); pte_clear_tests(&args); pte_advanced_tests(&args); pte_unmap_unlock(args.ptep, ptl); diff --git a/mm/filemap.c b/mm/filemap.c index 3a5ffb5587cd..fc156922147b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3368,7 +3368,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, } =20 addr =3D vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); + vmf->pte =3D pte_tryget_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); + if (!vmf->pte) + goto out; do { again: page =3D folio_file_page(folio, xas.xa_index); diff --git a/mm/gup.c b/mm/gup.c index f598a037eb04..d2c24181fb04 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -451,7 +451,9 @@ static struct page *follow_page_pte(struct vm_area_stru= ct *vma, if (unlikely(pmd_bad(*pmd))) return no_page_table(vma, flags); =20 - ptep =3D pte_offset_map_lock(mm, pmd, address, &ptl); + ptep =3D pte_tryget_map_lock(mm, pmd, address, &ptl); + if (!ptep) + return no_page_table(vma, flags); pte =3D *ptep; if (!pte_present(pte)) { swp_entry_t entry; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a4e5eaf3eb01..3776cc315294 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1227,7 +1227,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } =20 memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); - pte =3D pte_offset_map_lock(mm, pmd, address, &ptl); + pte =3D pte_tryget_map_lock(mm, pmd, address, &ptl); + if (!pte) { + result =3D SCAN_PMD_NULL; + goto out; + } for (_address =3D address, _pte =3D pte; _pte < pte+HPAGE_PMD_NR; _pte++, _address +=3D PAGE_SIZE) { pte_t pteval =3D *_pte; @@ -1505,7 +1509,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, un= signed long addr) page_remove_rmap(page, vma, false); } =20 - pte_unmap_unlock(start_pte, ptl); + __pte_unmap_unlock(start_pte, ptl); =20 /* step 3: set proper refcount and mm_counters. */ if (count) { @@ -1521,7 +1525,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, un= signed long addr) return; =20 abort: - pte_unmap_unlock(start_pte, ptl); + __pte_unmap_unlock(start_pte, ptl); goto drop_hpage; } =20 diff --git a/mm/ksm.c b/mm/ksm.c index 063a48eeb5ee..64a5f965cfc5 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1138,7 +1138,9 @@ static int replace_page(struct vm_area_struct *vma, s= truct page *page, addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); =20 - ptep =3D pte_offset_map_lock(mm, pmd, addr, &ptl); + ptep =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out_mn; if (!pte_same(*ptep, orig_pte)) { pte_unmap_unlock(ptep, ptl); goto out_mn; diff --git a/mm/madvise.c b/mm/madvise.c index 1873616a37d2..8123397f14c8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -207,7 +207,9 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned l= ong start, struct page *page; spinlock_t *ptl; =20 - orig_pte =3D pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); + orig_pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, start, &ptl); + if (!orig_pte) + break; pte =3D *(orig_pte + ((index - start) / PAGE_SIZE)); pte_unmap_unlock(orig_pte, ptl); =20 @@ -400,7 +402,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; #endif tlb_change_page_size(tlb, PAGE_SIZE); - orig_pte =3D pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!orig_pte) + return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr < end; pte++, addr +=3D PAGE_SIZE) { @@ -432,12 +436,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *p= md, if (split_huge_page(page)) { unlock_page(page); put_page(page); - pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); break; } unlock_page(page); put_page(page); - pte =3D pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!pte) + break; pte--; addr -=3D PAGE_SIZE; continue; @@ -477,7 +483,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, } =20 arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(orig_pte, ptl); + if (orig_pte) + pte_unmap_unlock(orig_pte, ptl); if (pageout) reclaim_pages(&page_list); cond_resched(); @@ -602,7 +609,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, return 0; =20 tlb_change_page_size(tlb, PAGE_SIZE); - orig_pte =3D pte =3D pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!orig_pte) + return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr !=3D end; pte++, addr +=3D PAGE_SIZE) { @@ -648,12 +657,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigne= d long addr, if (split_huge_page(page)) { unlock_page(page); put_page(page); - pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); goto out; } unlock_page(page); put_page(page); - pte =3D pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte =3D pte =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!pte) + goto out; pte--; addr -=3D PAGE_SIZE; continue; @@ -707,7 +718,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, add_mm_counter(mm, MM_SWAPENTS, nr_swap); } arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(orig_pte, ptl); + if (orig_pte) + pte_unmap_unlock(orig_pte, ptl); cond_resched(); next: return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 725f76723220..ad51ec9043b7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5736,7 +5736,9 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t= *pmd, =20 if (pmd_trans_unstable(pmd)) return 0; - pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr !=3D end; pte++, addr +=3D PAGE_SIZE) if (get_mctgt_type(vma, addr, *pte, NULL)) mc.precharge++; /* increment precharge temporarily */ @@ -5955,7 +5957,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pm= d, if (pmd_trans_unstable(pmd)) return 0; retry: - pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr !=3D end; addr +=3D PAGE_SIZE) { pte_t ptent =3D *(pte++); bool device =3D false; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index dcb6bb9cf731..5247932df3fa 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -637,8 +637,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned lo= ng addr, if (pmd_trans_unstable(pmdp)) goto out; =20 - mapped_pte =3D ptep =3D pte_offset_map_lock(walk->vma->vm_mm, pmdp, + mapped_pte =3D ptep =3D pte_tryget_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl); + if (!mapped_pte) + goto out; for (; addr !=3D end; ptep++, addr +=3D PAGE_SIZE) { ret =3D check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT, hwp->pfn, &hwp->tk); diff --git a/mm/memory.c b/mm/memory.c index 76e3af9639d9..ca03006b32cb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1352,7 +1352,9 @@ static unsigned long zap_pte_range(struct mmu_gather = *tlb, tlb_change_page_size(tlb, PAGE_SIZE); again: init_rss_vec(rss); - start_pte =3D pte_offset_map_lock(mm, pmd, addr, &ptl); + start_pte =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!start_pte) + return end; pte =3D start_pte; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); @@ -1846,7 +1848,9 @@ static int insert_pages(struct vm_area_struct *vma, u= nsigned long addr, int pte_idx =3D 0; const int batch_size =3D min_t(int, pages_to_write_in_pmd, 8); =20 - start_pte =3D pte_offset_map_lock(mm, pmd, addr, &pte_lock); + start_pte =3D pte_tryget_map_lock(mm, pmd, addr, &pte_lock); + if (!start_pte) + break; for (pte =3D start_pte; pte_idx < batch_size; ++pte, ++pte_idx) { int err =3D insert_page_in_batch_locked(vma, pte, addr, pages[curr_page_idx], prot); @@ -2532,9 +2536,13 @@ static int apply_to_pte_range(struct mm_struct *mm, = pmd_t *pmd, if (!pte) return -ENOMEM; } else { - mapped_pte =3D pte =3D (mm =3D=3D &init_mm) ? - pte_offset_kernel(pmd, addr) : - pte_offset_map_lock(mm, pmd, addr, &ptl); + if (mm =3D=3D &init_mm) { + mapped_pte =3D pte =3D pte_offset_kernel(pmd, addr); + } else { + mapped_pte =3D pte =3D pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!mapped_pte) + return err; + } } =20 BUG_ON(pmd_huge(*pmd)); @@ -2787,7 +2795,11 @@ static inline bool cow_user_page(struct page *dst, s= truct page *src, if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { pte_t entry; =20 - vmf->pte =3D pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + vmf->pte =3D pte_tryget_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + if (!vmf->pte) { + ret =3D false; + goto pte_unlock; + } locked =3D true; if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { /* @@ -2815,7 +2827,11 @@ static inline bool cow_user_page(struct page *dst, s= truct page *src, goto warn; =20 /* Re-validate under PTL if the page is still mapped */ - vmf->pte =3D pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + vmf->pte =3D pte_tryget_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + if (!vmf->pte) { + ret =3D false; + goto pte_unlock; + } locked =3D true; if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { /* The PTE changed under us, update local tlb */ @@ -3005,6 +3021,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) pte_t entry; int page_copied =3D 0; struct mmu_notifier_range range; + vm_fault_t ret =3D VM_FAULT_OOM; =20 if (unlikely(anon_vma_prepare(vma))) goto oom; @@ -3048,7 +3065,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) /* * Re-check the pte - we dropped the lock */ - vmf->pte =3D pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); + vmf->pte =3D pte_tryget_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) { + mmu_notifier_invalidate_range_only_end(&range); + ret =3D VM_FAULT_RETRY; + goto uncharge; + } if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { if (old_page) { if (!PageAnon(old_page)) { @@ -3129,12 +3151,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) put_page(old_page); } return page_copied ? VM_FAULT_WRITE : 0; +uncharge: + mem_cgroup_uncharge(page_folio(new_page)); oom_free_new: put_page(new_page); oom: if (old_page) put_page(old_page); - return VM_FAULT_OOM; + return ret; } =20 /** @@ -3156,8 +3180,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf) { WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED)); - vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte =3D pte_tryget_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + return VM_FAULT_NOPAGE; /* * We might have raced with another page fault while we released the * pte_offset_map_lock. @@ -3469,6 +3495,7 @@ static vm_fault_t remove_device_exclusive_entry(struc= t vm_fault *vmf) struct page *page =3D vmf->page; struct vm_area_struct *vma =3D vmf->vma; struct mmu_notifier_range range; + vm_fault_t ret =3D 0; =20 if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) return VM_FAULT_RETRY; @@ -3477,16 +3504,21 @@ static vm_fault_t remove_device_exclusive_entry(str= uct vm_fault *vmf) (vmf->address & PAGE_MASK) + PAGE_SIZE, NULL); mmu_notifier_invalidate_range_start(&range); =20 - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte =3D pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) { + ret =3D VM_FAULT_RETRY; + goto out; + } if (likely(pte_same(*vmf->pte, vmf->orig_pte))) restore_exclusive_pte(vma, page, vmf->address, vmf->pte); =20 pte_unmap_unlock(vmf->pte, vmf->ptl); +out: unlock_page(page); =20 mmu_notifier_invalidate_range_end(&range); - return 0; + return ret; } =20 static inline bool should_try_to_free_swap(struct page *page, @@ -3599,8 +3631,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Back out if somebody else faulted in this pte * while we released the pte lock. */ - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->pte =3D pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) { + ret =3D VM_FAULT_OOM; + goto out; + } if (likely(pte_same(*vmf->pte, vmf->orig_pte))) ret =3D VM_FAULT_OOM; goto unlock; @@ -3666,8 +3702,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* * Back out if somebody else already faulted in this pte. */ - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte =3D pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + goto out_page; if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) goto out_nomap; =20 @@ -3781,6 +3819,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *= vmf) if (vma->vm_flags & VM_SHARED) return VM_FAULT_SIGBUS; =20 +retry: /* * Use pte_alloc() instead of pte_alloc_map(). We can't run * pte_offset_map() on pmds where a huge pmd might be created @@ -3803,8 +3842,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault = *vmf) !mm_forbids_zeropage(vma->vm_mm)) { entry =3D pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), vma->vm_page_prot)); - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->pte =3D pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + goto retry; if (!pte_none(*vmf->pte)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto unlock; @@ -3843,8 +3884,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault = *vmf) if (vma->vm_flags & VM_WRITE) entry =3D pte_mkwrite(pte_mkdirty(entry)); =20 - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte =3D pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + goto uncharge; if (!pte_none(*vmf->pte)) { update_mmu_cache(vma, vmf->address, vmf->pte); goto release; @@ -3875,6 +3918,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *= vmf) release: put_page(page); goto unlock; +uncharge: + mem_cgroup_uncharge(page_folio(page)); oom_free_page: put_page(page); oom: @@ -4112,8 +4157,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; =20 - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->pte =3D pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + return 0; ret =3D 0; /* Re-check under ptl */ if (likely(pte_none(*vmf->pte))) @@ -4340,31 +4387,27 @@ static vm_fault_t do_fault(struct vm_fault *vmf) * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */ if (!vma->vm_ops->fault) { - /* - * If we find a migration pmd entry or a none pmd entry, which - * should never happen, return SIGBUS - */ - if (unlikely(!pmd_present(*vmf->pmd))) - ret =3D VM_FAULT_SIGBUS; - else { - vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, + vmf->pte =3D pte_tryget_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - /* - * Make sure this is not a temporary clearing of pte - * by holding ptl and checking again. A R/M/W update - * of pte involves: take ptl, clearing the pte so that - * we don't have concurrent modification by hardware - * followed by an update. - */ - if (unlikely(pte_none(*vmf->pte))) - ret =3D VM_FAULT_SIGBUS; - else - ret =3D VM_FAULT_NOPAGE; - - pte_unmap_unlock(vmf->pte, vmf->ptl); + if (!vmf->pte) { + ret =3D VM_FAULT_RETRY; + goto out; } + /* + * Make sure this is not a temporary clearing of pte + * by holding ptl and checking again. A R/M/W update + * of pte involves: take ptl, clearing the pte so that + * we don't have concurrent modification by hardware + * followed by an update. + */ + if (unlikely(pte_none(*vmf->pte))) + ret =3D VM_FAULT_SIGBUS; + else + ret =3D VM_FAULT_NOPAGE; + + pte_unmap_unlock(vmf->pte, vmf->ptl); } else if (!(vmf->flags & FAULT_FLAG_WRITE)) ret =3D do_read_fault(vmf); else if (!(vma->vm_flags & VM_SHARED)) @@ -4372,6 +4415,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf) else ret =3D do_shared_fault(vmf); =20 +out: /* preallocated pagetable is unused: free it */ if (vmf->prealloc_pte) { pte_free(vm_mm, vmf->prealloc_pte); @@ -5003,13 +5047,16 @@ int follow_invalidate_pte(struct mm_struct *mm, uns= igned long address, (address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(range); } - ptep =3D pte_offset_map_lock(mm, pmd, address, ptlp); + ptep =3D pte_tryget_map_lock(mm, pmd, address, ptlp); + if (!ptep) + goto invalid; if (!pte_present(*ptep)) goto unlock; *ptepp =3D ptep; return 0; unlock: pte_unmap_unlock(ptep, *ptlp); +invalid: if (range) mmu_notifier_invalidate_range_end(range); out: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 8c74107a2b15..a846666c64c3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -523,7 +523,9 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned l= ong addr, if (pmd_trans_unstable(pmd)) return 0; =20 - mapped_pte =3D pte =3D pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + mapped_pte =3D pte =3D pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!mapped_pte) + return 0; for (; addr !=3D end; pte++, addr +=3D PAGE_SIZE) { if (!pte_present(*pte)) continue; diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 70c7dc05bbfc..260471f37470 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -64,21 +64,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, unsigned long addr =3D start, unmapped =3D 0; spinlock_t *ptl; pte_t *ptep; + pmd_t pmdval; =20 again: - if (pmd_none(*pmdp)) + pmdval =3D READ_ONCE(*pmdp); + if (pmd_none(pmdval)) return migrate_vma_collect_hole(start, end, -1, walk); =20 - if (pmd_trans_huge(*pmdp)) { + if (pmd_trans_huge(pmdval)) { struct page *page; =20 ptl =3D pmd_lock(mm, pmdp); - if (unlikely(!pmd_trans_huge(*pmdp))) { + if (unlikely(!pmd_trans_huge(pmdval))) { spin_unlock(ptl); goto again; } =20 - page =3D pmd_page(*pmdp); + page =3D pmd_page(pmdval); if (is_huge_zero_page(page)) { spin_unlock(ptl); split_huge_pmd(vma, pmdp, addr); @@ -99,16 +101,18 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (ret) return migrate_vma_collect_skip(start, end, walk); - if (pmd_none(*pmdp)) + if (pmd_none(pmdval)) return migrate_vma_collect_hole(start, end, -1, walk); } } =20 - if (unlikely(pmd_bad(*pmdp))) + if (unlikely(pmd_bad(pmdval))) return migrate_vma_collect_skip(start, end, walk); =20 - ptep =3D pte_offset_map_lock(mm, pmdp, addr, &ptl); + ptep =3D pte_tryget_map_lock(mm, pmdp, addr, &ptl); + if (!ptep) + goto again; arch_enter_lazy_mmu_mode(); =20 for (; addr < end; addr +=3D PAGE_SIZE, ptep++) { @@ -588,7 +592,9 @@ static void migrate_vma_insert_page(struct migrate_vma = *migrate, entry =3D pte_mkwrite(pte_mkdirty(entry)); } =20 - ptep =3D pte_offset_map_lock(mm, pmdp, addr, &ptl); + ptep =3D pte_tryget_map_lock(mm, pmdp, addr, &ptl); + if (!ptep) + goto abort; =20 if (check_stable_address_space(mm)) goto unlock_abort; diff --git a/mm/mincore.c b/mm/mincore.c index 9122676b54d6..337f8a45ded0 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -105,6 +105,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long = addr, unsigned long end, unsigned char *vec =3D walk->private; int nr =3D (end - addr) >> PAGE_SHIFT; =20 +again: ptl =3D pmd_trans_huge_lock(pmd, vma); if (ptl) { memset(vec, 1, nr); @@ -117,7 +118,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long = addr, unsigned long end, goto out; } =20 - ptep =3D pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + ptep =3D pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!ptep) + goto again; for (; addr !=3D end; ptep++, addr +=3D PAGE_SIZE) { pte_t pte =3D *ptep; =20 diff --git a/mm/mlock.c b/mm/mlock.c index 716caf851043..89f7de636efc 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -314,6 +314,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long ad= dr, pte_t *start_pte, *pte; struct page *page; =20 +again: ptl =3D pmd_trans_huge_lock(pmd, vma); if (ptl) { if (!pmd_present(*pmd)) @@ -328,7 +329,9 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long ad= dr, goto out; } =20 - start_pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + start_pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!start_pte) + goto again; for (pte =3D start_pte; addr !=3D end; pte++, addr +=3D PAGE_SIZE) { if (!pte_present(*pte)) continue; diff --git a/mm/mprotect.c b/mm/mprotect.c index b69ce7a7b2b7..aa09cd34ea30 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -63,7 +63,9 @@ static unsigned long change_pte_range(struct vm_area_stru= ct *vma, pmd_t *pmd, * from under us even if the mmap_lock is only hold for * reading. */ - pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; =20 /* Get target node for single threaded private VMAs */ if (prot_numa && !(vma->vm_flags & VM_SHARED) && diff --git a/mm/mremap.c b/mm/mremap.c index 303d3290b938..d5ea5ce8a22a 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -167,7 +167,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t= *old_pmd, * We don't have to worry about the ordering of src and dst * pte locks because exclusive mmap_lock prevents deadlock. */ - old_pte =3D pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl); + old_pte =3D pte_tryget_map_lock(mm, old_pmd, old_addr, &old_ptl); + if (!old_pte) + goto drop_lock; new_pte =3D pte_offset_map(new_pmd, new_addr); new_ptl =3D pte_lockptr(mm, new_pmd); if (new_ptl !=3D old_ptl) @@ -206,6 +208,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t= *old_pmd, spin_unlock(new_ptl); pte_unmap(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); +drop_lock: if (need_rmap_locks) drop_rmap_locks(vma); } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 9b3db11a4d1d..264b717e24ef 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -50,7 +50,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr,= unsigned long end, err =3D walk_pte_range_inner(pte, addr, end, walk); pte_unmap(pte); } else { - pte =3D pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + return end; err =3D walk_pte_range_inner(pte, addr, end, walk); pte_unmap_unlock(pte, ptl); } diff --git a/mm/swapfile.c b/mm/swapfile.c index 63c61f8b2611..710fbeec9e58 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1790,10 +1790,14 @@ static int unuse_pte(struct vm_area_struct *vma, pm= d_t *pmd, if (unlikely(!page)) return -ENOMEM; =20 - pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte =3D pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) { + ret =3D -EAGAIN; + goto out; + } if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) { ret =3D 0; - goto out; + goto unlock; } =20 dec_mm_counter(vma->vm_mm, MM_SWAPENTS); @@ -1808,8 +1812,9 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, set_pte_at(vma->vm_mm, addr, pte, pte_mkold(mk_pte(page, vma->vm_page_prot))); swap_free(entry); -out: +unlock: pte_unmap_unlock(pte, ptl); +out: if (page !=3D swapcache) { unlock_page(page); put_page(page); @@ -1897,7 +1902,7 @@ static inline int unuse_pmd_range(struct vm_area_stru= ct *vma, pud_t *pud, if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; ret =3D unuse_pte_range(vma, pmd, addr, next, type); - if (ret) + if (ret && ret !=3D -EAGAIN) return ret; } while (pmd++, addr =3D next, addr !=3D end); return 0; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0cb8e5ef1713..c1bce9cf5657 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -79,7 +79,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pm= d_t *dst_pmd, _dst_pte =3D pte_mkwrite(_dst_pte); } =20 - dst_pte =3D pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + dst_pte =3D pte_tryget_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + if (!dst_pte) + return -EAGAIN; =20 if (vma_is_shmem(dst_vma)) { /* serialize against truncate with the page table lock */ @@ -194,7 +196,9 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, =20 _dst_pte =3D pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), dst_vma->vm_page_prot)); - dst_pte =3D pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + dst_pte =3D pte_tryget_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + if (!dst_pte) + return -EAGAIN; if (dst_vma->vm_file) { /* the shmem MAP_PRIVATE case requires checking the i_size */ inode =3D dst_vma->vm_file->f_inode; @@ -587,6 +591,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm= _struct *dst_mm, break; } =20 +again: dst_pmdval =3D pmd_read_atomic(dst_pmd); /* * If the dst_pmd is mapped as THP don't @@ -612,6 +617,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm= _struct *dst_mm, =20 err =3D mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, &page, mcopy_mode, wp_copy); + if (err =3D=3D -EAGAIN) + goto again; cond_resched(); =20 if (unlikely(err =3D=3D -ENOENT)) { --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 810C7C433EF for ; Fri, 29 Apr 2022 13:37:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359861AbiD2NlD (ORCPT ); Fri, 29 Apr 2022 09:41:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376282AbiD2Nks (ORCPT ); Fri, 29 Apr 2022 09:40:48 -0400 Received: from mail-pg1-x532.google.com (mail-pg1-x532.google.com [IPv6:2607:f8b0:4864:20::532]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D880CC516 for ; Fri, 29 Apr 2022 06:37:27 -0700 (PDT) Received: by mail-pg1-x532.google.com with SMTP id z21so6563451pgj.1 for ; Fri, 29 Apr 2022 06:37:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dRcr9R3ZgGJ3jIPCrJCelPkeBao75RZpxc/JH0Fqwmk=; b=On/bg069CRwx7Wor8888fyLbdp/Gc2rCyWaQOJucO5HhZjg4SG7Y7E+0wvaS2KVGCr rYVMsY9Pcx4ZkQBTX5xuyYpXhUf4C6OuINok1WrDesCw8g1UcjrCtHFP9Uc4bVMEZxjt OIcwN7j795Kjz5t35Z0xUEpDbRAji8lwd5XFDeha1a52CKhYwwVvCiV/cG99uz9myna2 P7rQ2e+2Kp2FbsCe8MmnE/QTIhVGwjnukQzyLdsopG1ty8vFAwWvsuTtbmFag60JVvpG kVfMWnwMlMZZEg4iQa0V1QUkR1j9mj3tvwjXdAGJmtu46/ZcKjb8ktKLMIi02Z/qhfWt /19A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dRcr9R3ZgGJ3jIPCrJCelPkeBao75RZpxc/JH0Fqwmk=; b=6elX+4XUgzoRQ4ENx/I6S4pM6uVx16aToao3B7lCdPZoUUnaZtn6ugQGGz8AIvh0oo yZ71oyJYZSX4RfK0ZGQP15F3OTaaSC271vYUL8WL0FxjYuFD/tJRP0wf7mXkxru5DUJi xzIZvTX95GKvpAddQCZOMmrHuCwFJAAk5Dyjw++PTqHmJjsCnbTccGt4duujgQ7yjZyt 81+ReWFRHFcodDhhGjXF1oixxtO3FhGX6lu4Aq2tTVRdpCpdkOvjNynPvvA38whUkStV tPa6O5ADQ3rw05H0/VxmnnsnYtvNvcTgTLDBRn2zUAXIMcE5E9ZSbC1oK3nOUPFRwRWU sNCg== X-Gm-Message-State: AOAM531PhQlkymXPwww4yT6sSWjSD1KAqLAVfJG4g1TnYJVlvq5+7we7 vNVH+xDF03XNi3AtdRKiYD7ipcjms/jJGg== X-Google-Smtp-Source: ABdhPJyRLve9t7CiZZ1B6242P2dY8VW+CeFhX8q+DbR17XJUJmfD+8d8bMjGyknZbnYW2wjWl9d0Kw== X-Received: by 2002:a05:6a00:140c:b0:4e1:530c:edc0 with SMTP id l12-20020a056a00140c00b004e1530cedc0mr39851295pfu.18.1651239446323; Fri, 29 Apr 2022 06:37:26 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:25 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Date: Fri, 29 Apr 2022 21:35:46 +0800 Message-Id: <20220429133552.33768-13-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Use pte_tryget_map() to help us to try to get the refcount of the PTE page table page we want to access, which can prevents the page from being freed during access. For unuse_pte_range(), there are multiple locations where pte_offset_map() is called, and it is inconvenient to handle error conditions, so we perform pte_tryget() in advance in unuse_pmd_range(). For the following cases, the PTE page table page is stable: - got the refcount of PTE page table page already - has no concurrent threads(e.g. the write lock of mmap_lock is acquired) - the PTE page table page is not yet visible - turn off the local cpu interrupt or hold the rcu lock (e.g. GUP fast path) - the PTE page table page is kernel PTE page table page So we still keep using pte_offset_map() and replace pte_unmap() with __pte_unmap() which doesn't reduce the refcount. Signed-off-by: Qi Zheng --- arch/x86/mm/mem_encrypt_identity.c | 11 ++++++++--- fs/userfaultfd.c | 10 +++++++--- include/linux/mm.h | 2 +- include/linux/swapops.h | 4 ++-- kernel/events/core.c | 5 ++++- mm/gup.c | 16 +++++++++++----- mm/hmm.c | 9 +++++++-- mm/huge_memory.c | 4 ++-- mm/khugepaged.c | 8 +++++--- mm/memory-failure.c | 11 ++++++++--- mm/memory.c | 19 +++++++++++++------ mm/migrate.c | 8 ++++++-- mm/mremap.c | 5 ++++- mm/page_table_check.c | 2 +- mm/page_vma_mapped.c | 13 ++++++++++--- mm/pagewalk.c | 2 +- mm/swap_state.c | 4 ++-- mm/swapfile.c | 9 ++++++--- mm/vmalloc.c | 2 +- 19 files changed, 99 insertions(+), 45 deletions(-) diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_i= dentity.c index 6d323230320a..37a3f4da7bd2 100644 --- a/arch/x86/mm/mem_encrypt_identity.c +++ b/arch/x86/mm/mem_encrypt_identity.c @@ -171,26 +171,31 @@ static void __init sme_populate_pgd(struct sme_popula= te_pgd_data *ppd) pud_t *pud; pmd_t *pmd; pte_t *pte; + pmd_t pmdval; =20 pud =3D sme_prepare_pgd(ppd); if (!pud) return; =20 pmd =3D pmd_offset(pud, ppd->vaddr); - if (pmd_none(*pmd)) { +retry: + pmdval =3D READ_ONCE(*pmd); + if (pmd_none(pmdval)) { pte =3D ppd->pgtable_area; memset(pte, 0, sizeof(*pte) * PTRS_PER_PTE); ppd->pgtable_area +=3D sizeof(*pte) * PTRS_PER_PTE; set_pmd(pmd, __pmd(PMD_FLAGS | __pa(pte))); } =20 - if (pmd_large(*pmd)) + if (pmd_large(pmdval)) return; =20 pte =3D pte_offset_map(pmd, ppd->vaddr); + if (!pte) + goto retry; if (pte_none(*pte)) set_pte(pte, __pte(ppd->paddr | ppd->pte_flags)); - pte_unmap(pte); + __pte_unmap(pte); } =20 static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index aa0c47cb0d16..c83fc73f29c0 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -309,6 +309,7 @@ static inline bool userfaultfd_must_wait(struct userfau= ltfd_ctx *ctx, * This is to deal with the instability (as in * pmd_trans_unstable) of the pmd. */ +retry: _pmd =3D READ_ONCE(*pmd); if (pmd_none(_pmd)) goto out; @@ -324,10 +325,13 @@ static inline bool userfaultfd_must_wait(struct userf= aultfd_ctx *ctx, } =20 /* - * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it - * and use the standard pte_offset_map() instead of parsing _pmd. + * After we tryget successfully, the pmd is stable (as in + * !pmd_trans_unstable) so we can re-read it and use the standard + * pte_offset_map() instead of parsing _pmd. */ - pte =3D pte_offset_map(pmd, address); + pte =3D pte_tryget_map(mm, pmd, address); + if (!pte) + goto retry; /* * Lockless access: we're in a wait_event so it's ok if it * changes under us. diff --git a/include/linux/mm.h b/include/linux/mm.h index 04f7a6c36dc7..cc8fb009bab7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2284,7 +2284,7 @@ static inline void pgtable_pte_page_dtor(struct page = *page) #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, = pmd)) =20 #define pte_alloc_map(mm, pmd, address) \ - (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address)) + (pte_alloc(mm, pmd) ? NULL : pte_tryget_map(mm, pmd, address)) =20 #define pte_alloc_map_lock(mm, pmd, address, ptlp) \ (pte_alloc(mm, pmd) ? \ diff --git a/include/linux/swapops.h b/include/linux/swapops.h index d356ab4047f7..b671ecd6b5e7 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -214,7 +214,7 @@ static inline swp_entry_t make_writable_migration_entry= (pgoff_t offset) =20 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl); -extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, +extern bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address); extern void migration_entry_wait_huge(struct vm_area_struct *vma, struct mm_struct *mm, pte_t *pte); @@ -236,7 +236,7 @@ static inline int is_migration_entry(swp_entry_t swp) =20 static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *pte= p, spinlock_t *ptl) { } -static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, +static inline bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { } static inline void migration_entry_wait_huge(struct vm_area_struct *vma, struct mm_struct *mm, pte_t *pte) { } diff --git a/kernel/events/core.c b/kernel/events/core.c index 23bb19716ad3..443b0af075e6 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7215,6 +7215,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm= , unsigned long addr) return pud_leaf_size(pud); =20 pmdp =3D pmd_offset_lockless(pudp, pud, addr); +retry: pmd =3D READ_ONCE(*pmdp); if (!pmd_present(pmd)) return 0; @@ -7222,7 +7223,9 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm= , unsigned long addr) if (pmd_leaf(pmd)) return pmd_leaf_size(pmd); =20 - ptep =3D pte_offset_map(&pmd, addr); + ptep =3D pte_tryget_map(mm, &pmd, addr); + if (!ptep) + goto retry; pte =3D ptep_get_lockless(ptep); if (pte_present(pte)) size =3D pte_leaf_size(pte); diff --git a/mm/gup.c b/mm/gup.c index d2c24181fb04..114a7e7f871b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -470,7 +470,8 @@ static struct page *follow_page_pte(struct vm_area_stru= ct *vma, if (!is_migration_entry(entry)) goto no_page; pte_unmap_unlock(ptep, ptl); - migration_entry_wait(mm, pmd, address); + if (!migration_entry_wait(mm, pmd, address)) + return no_page_table(vma, flags); goto retry; } if ((flags & FOLL_NUMA) && pte_protnone(pte)) @@ -805,6 +806,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned= long address, pmd_t *pmd; pte_t *pte; int ret =3D -EFAULT; + pmd_t pmdval; =20 /* user gate pages are read-only */ if (gup_flags & FOLL_WRITE) @@ -822,10 +824,14 @@ static int get_gate_page(struct mm_struct *mm, unsign= ed long address, if (pud_none(*pud)) return -EFAULT; pmd =3D pmd_offset(pud, address); - if (!pmd_present(*pmd)) +retry: + pmdval =3D READ_ONCE(*pmd); + if (!pmd_present(pmdval)) return -EFAULT; - VM_BUG_ON(pmd_trans_huge(*pmd)); - pte =3D pte_offset_map(pmd, address); + VM_BUG_ON(pmd_trans_huge(pmdval)); + pte =3D pte_tryget_map(mm, pmd, address); + if (!pte) + goto retry; if (pte_none(*pte)) goto unmap; *vma =3D get_gate_vma(mm); @@ -2223,7 +2229,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long add= r, unsigned long end, pte_unmap: if (pgmap) put_dev_pagemap(pgmap); - pte_unmap(ptem); + __pte_unmap(ptem); return ret; } #else diff --git a/mm/hmm.c b/mm/hmm.c index af71aac3140e..0cf45092efca 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -279,7 +279,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, uns= igned long addr, if (is_migration_entry(entry)) { pte_unmap(ptep); hmm_vma_walk->last =3D addr; - migration_entry_wait(walk->mm, pmdp, addr); + if (!migration_entry_wait(walk->mm, pmdp, addr)) + return -EAGAIN; return -EBUSY; } =20 @@ -384,12 +385,16 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); } =20 - ptep =3D pte_offset_map(pmdp, addr); + ptep =3D pte_tryget_map(walk->mm, pmdp, addr); + if (!ptep) + goto again; for (; addr < end; addr +=3D PAGE_SIZE, ptep++, hmm_pfns++) { int r; =20 r =3D hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, hmm_pfns); if (r) { + if (r =3D=3D -EAGAIN) + goto again; /* hmm_vma_handle_pte() did pte_unmap() */ return r; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c468fee595ff..73ac2e9c9193 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1932,7 +1932,7 @@ static void __split_huge_zero_page_pmd(struct vm_area= _struct *vma, pte =3D pte_offset_map(&_pmd, haddr); VM_BUG_ON(!pte_none(*pte)); set_pte_at(mm, haddr, pte, entry); - pte_unmap(pte); + __pte_unmap(pte); } smp_wmb(); /* make pte visible before pmd */ pmd_populate(mm, pmd, pgtable); @@ -2086,7 +2086,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, set_pte_at(mm, addr, pte, entry); if (!pmd_migration) atomic_inc(&page[i]._mapcount); - pte_unmap(pte); + __pte_unmap(pte); } =20 if (!pmd_migration) { diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3776cc315294..f540d7983b2d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1003,7 +1003,9 @@ static bool __collapse_huge_page_swapin(struct mm_str= uct *mm, .pmd =3D pmd, }; =20 - vmf.pte =3D pte_offset_map(pmd, address); + vmf.pte =3D pte_tryget_map(mm, pmd, address); + if (!vmf.pte) + return false; vmf.orig_pte =3D *vmf.pte; if (!is_swap_pte(vmf.orig_pte)) { pte_unmap(vmf.pte); @@ -1145,7 +1147,7 @@ static void collapse_huge_page(struct mm_struct *mm, spin_unlock(pte_ptl); =20 if (unlikely(!isolated)) { - pte_unmap(pte); + __pte_unmap(pte); spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); /* @@ -1168,7 +1170,7 @@ static void collapse_huge_page(struct mm_struct *mm, =20 __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, &compound_pagelist); - pte_unmap(pte); + __pte_unmap(pte); /* * spin_lock() below is not the equivalent of smp_wmb(), but * the smp_wmb() inside __SetPageUptodate() can be reused to diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 5247932df3fa..2a840ddfc34e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -304,6 +304,7 @@ static unsigned long dev_pagemap_mapping_shift(struct p= age *page, pud_t *pud; pmd_t *pmd; pte_t *pte; + pmd_t pmdval; =20 VM_BUG_ON_VMA(address =3D=3D -EFAULT, vma); pgd =3D pgd_offset(vma->vm_mm, address); @@ -318,11 +319,15 @@ static unsigned long dev_pagemap_mapping_shift(struct= page *page, if (pud_devmap(*pud)) return PUD_SHIFT; pmd =3D pmd_offset(pud, address); - if (!pmd_present(*pmd)) +retry: + pmdval =3D READ_ONCE(*pmd); + if (!pmd_present(pmdval)) return 0; - if (pmd_devmap(*pmd)) + if (pmd_devmap(pmdval)) return PMD_SHIFT; - pte =3D pte_offset_map(pmd, address); + pte =3D pte_tryget_map(vma->vm_mm, pmd, address); + if (!pte) + goto retry; if (pte_present(*pte) && pte_devmap(*pte)) ret =3D PAGE_SHIFT; pte_unmap(pte); diff --git a/mm/memory.c b/mm/memory.c index ca03006b32cb..aa2bac561d5e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1091,7 +1091,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct= vm_area_struct *src_vma, =20 arch_leave_lazy_mmu_mode(); spin_unlock(src_ptl); - pte_unmap(orig_src_pte); + __pte_unmap(orig_src_pte); add_mm_rss_vec(dst_mm, rss); pte_unmap_unlock(orig_dst_pte, dst_ptl); cond_resched(); @@ -3566,8 +3566,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) entry =3D pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { - migration_entry_wait(vma->vm_mm, vmf->pmd, - vmf->address); + if (!migration_entry_wait(vma->vm_mm, vmf->pmd, + vmf->address)) + ret =3D VM_FAULT_RETRY; } else if (is_device_exclusive_entry(entry)) { vmf->page =3D pfn_swap_entry_to_page(entry); ret =3D remove_device_exclusive_entry(vmf); @@ -4507,7 +4508,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |=3D TNF_MIGRATED; } else { flags |=3D TNF_MIGRATE_FAIL; - vmf->pte =3D pte_offset_map(vmf->pmd, vmf->address); + vmf->pte =3D pte_tryget_map(vma->vm_mm, vmf->pmd, vmf->address); + if (!vmf->pte) + return VM_FAULT_RETRY; spin_lock(vmf->ptl); if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -4617,7 +4620,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *v= mf) { pte_t entry; =20 - if (unlikely(pmd_none(*vmf->pmd))) { +retry: + if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) { /* * Leave __pte_alloc() until later: because vm_ops->fault may * want to allocate huge page, and if we expose page table @@ -4646,7 +4650,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *= vmf) * mmap_lock read mode and khugepaged takes it in write mode. * So now it's safe to run pte_offset_map(). */ - vmf->pte =3D pte_offset_map(vmf->pmd, vmf->address); + vmf->pte =3D pte_tryget_map(vmf->vma->vm_mm, vmf->pmd, + vmf->address); + if (!vmf->pte) + goto retry; vmf->orig_pte =3D *vmf->pte; =20 /* diff --git a/mm/migrate.c b/mm/migrate.c index 6c31ee1e1c9b..125fbe300df2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -301,12 +301,16 @@ void __migration_entry_wait(struct mm_struct *mm, pte= _t *ptep, pte_unmap_unlock(ptep, ptl); } =20 -void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, +bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { spinlock_t *ptl =3D pte_lockptr(mm, pmd); - pte_t *ptep =3D pte_offset_map(pmd, address); + pte_t *ptep =3D pte_tryget_map(mm, pmd, address); + if (!ptep) + return false; __migration_entry_wait(mm, ptep, ptl); + + return true; } =20 void migration_entry_wait_huge(struct vm_area_struct *vma, diff --git a/mm/mremap.c b/mm/mremap.c index d5ea5ce8a22a..71022d42f577 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -170,7 +170,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t= *old_pmd, old_pte =3D pte_tryget_map_lock(mm, old_pmd, old_addr, &old_ptl); if (!old_pte) goto drop_lock; - new_pte =3D pte_offset_map(new_pmd, new_addr); + new_pte =3D pte_tryget_map(mm, new_pmd, new_addr); + if (!new_pte) + goto unmap_drop_lock; new_ptl =3D pte_lockptr(mm, new_pmd); if (new_ptl !=3D old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); @@ -207,6 +209,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t= *old_pmd, if (new_ptl !=3D old_ptl) spin_unlock(new_ptl); pte_unmap(new_pte - 1); +unmap_drop_lock: pte_unmap_unlock(old_pte - 1, old_ptl); drop_lock: if (need_rmap_locks) diff --git a/mm/page_table_check.c b/mm/page_table_check.c index 2458281bff89..185e84f22c6c 100644 --- a/mm/page_table_check.c +++ b/mm/page_table_check.c @@ -251,7 +251,7 @@ void __page_table_check_pte_clear_range(struct mm_struc= t *mm, pte_t *ptep =3D pte_offset_map(&pmd, addr); unsigned long i; =20 - pte_unmap(ptep); + __pte_unmap(ptep); for (i =3D 0; i < PTRS_PER_PTE; i++) { __page_table_check_pte_clear(mm, addr, *ptep); addr +=3D PAGE_SIZE; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 14a5cda73dee..8ecf8fd7cf5e 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -15,7 +15,9 @@ static inline bool not_found(struct page_vma_mapped_walk = *pvmw) =20 static bool map_pte(struct page_vma_mapped_walk *pvmw) { - pvmw->pte =3D pte_offset_map(pvmw->pmd, pvmw->address); + pvmw->pte =3D pte_tryget_map(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address); + if (!pvmw->pte) + return false; if (!(pvmw->flags & PVMW_SYNC)) { if (pvmw->flags & PVMW_MIGRATION) { if (!is_swap_pte(*pvmw->pte)) @@ -203,6 +205,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *= pvmw) } =20 pvmw->pmd =3D pmd_offset(pud, pvmw->address); +retry: /* * Make sure the pmd value isn't cached in a register by the * compiler and used as a stale value after we've observed a @@ -251,8 +254,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk = *pvmw) step_forward(pvmw, PMD_SIZE); continue; } - if (!map_pte(pvmw)) - goto next_pte; + if (!map_pte(pvmw)) { + if (!pvmw->pte) + goto retry; + else + goto next_pte; + } this_pte: if (check_pte(pvmw)) return true; diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 264b717e24ef..adb5dacbd537 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr,= unsigned long end, if (walk->no_vma) { pte =3D pte_offset_map(pmd, addr); err =3D walk_pte_range_inner(pte, addr, end, walk); - pte_unmap(pte); + __pte_unmap(pte); } else { pte =3D pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); if (!pte) diff --git a/mm/swap_state.c b/mm/swap_state.c index 013856004825..5b70c2c815ef 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -743,7 +743,7 @@ static void swap_ra_info(struct vm_fault *vmf, SWAP_RA_VAL(faddr, win, 0)); =20 if (win =3D=3D 1) { - pte_unmap(orig_pte); + __pte_unmap(orig_pte); return; } =20 @@ -768,7 +768,7 @@ static void swap_ra_info(struct vm_fault *vmf, for (pfn =3D start; pfn !=3D end; pfn++) *tpte++ =3D *pte++; #endif - pte_unmap(orig_pte); + __pte_unmap(orig_pte); } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index 710fbeec9e58..f1c64fc15e24 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1845,7 +1845,7 @@ static int unuse_pte_range(struct vm_area_struct *vma= , pmd_t *pmd, continue; =20 offset =3D swp_offset(entry); - pte_unmap(pte); + __pte_unmap(pte); swap_map =3D &si->swap_map[offset]; page =3D lookup_swap_cache(entry, vma, addr); if (!page) { @@ -1880,7 +1880,7 @@ static int unuse_pte_range(struct vm_area_struct *vma= , pmd_t *pmd, try_next: pte =3D pte_offset_map(pmd, addr); } while (pte++, addr +=3D PAGE_SIZE, addr !=3D end); - pte_unmap(pte - 1); + __pte_unmap(pte - 1); =20 ret =3D 0; out: @@ -1901,8 +1901,11 @@ static inline int unuse_pmd_range(struct vm_area_str= uct *vma, pud_t *pud, next =3D pmd_addr_end(addr, end); if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; + if (!pte_tryget(vma->vm_mm, pmd, addr)) + continue; ret =3D unuse_pte_range(vma, pmd, addr, next, type); - if (ret && ret !=3D -EAGAIN) + __pte_put(pmd_pgtable(*pmd)); + if (ret) return ret; } while (pmd++, addr =3D next, addr !=3D end); return 0; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e163372d3967..080aa78bdaff 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -694,7 +694,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) pte =3D *ptep; if (pte_present(pte)) page =3D pte_page(pte); - pte_unmap(ptep); + __pte_unmap(ptep); =20 return page; } --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C4A6C433EF for ; Fri, 29 Apr 2022 13:37:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376301AbiD2NlM (ORCPT ); Fri, 29 Apr 2022 09:41:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55582 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359863AbiD2Nkw (ORCPT ); Fri, 29 Apr 2022 09:40:52 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 37403CB027 for ; Fri, 29 Apr 2022 06:37:32 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id g8so4536190pfh.5 for ; Fri, 29 Apr 2022 06:37:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=23JDuOeXS3qapCnfe9+MGlUBuO57K6IKT3rd6Zq9kR4=; b=MHSRX303E6gAt323/IxcNAaiiuJtGUdH5cveE5vIqfJsvePabyp0uPeIkm0nDQ8x+z mCEeA/j/KM/zmasCI3NTuE8qHxZdXOOfIfcn8do+VN9VYhVXhI+g+9WAniNW4BEPE+Hr J9OlM7BM2PT4gmQ+NP52DmvU2IhXT/3tZBir8W+BpSrREGb9XJc337kADy/ZkI4Obkb1 fYgKvKZsYGfcnMZSjdbXdUQsBbzcAXrLxtXnir6+P37YT8MteDij0NXzVlxDMbO/OmKK fvdicx7MXbzcUO6F6lxiLRamPoU1KEA+G/pfvWZB5UwwmBvNGsfLlRTWzI28/+l7upAy HoeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=23JDuOeXS3qapCnfe9+MGlUBuO57K6IKT3rd6Zq9kR4=; b=AaEMXa2wLiEsY/4gzMXF4Z4vPj5cLwvakJr8O/o7o1gT2hOZ/yIkP4M6mUE+gNGoSH +TkucfOFFLUaLWwcguLrv+KfziWtUHNMmcD96533LVfB/lBSOy1fM3SAA78Qgc8QlYRV 4qhRTRVShbVeUKXlLE1sOTK3GpofytoKwjaL8Hx2YrrPXQi3FDMlR6v611SbRSc02Aro cGrMHJr7NaZKRSSTk7GJ6pih9eLowmh08TioAfyp0OWaRlkZDC4ji4bcgl3bRjayAct8 Svrh/05/rMb3dBV+0pFj4fa1eYIEWS/sJadizWeEjX0Oev8O97nu+ZUNhBnKlZT9r0ve 6woA== X-Gm-Message-State: AOAM531yh283cNShRQnrxI2KKjHw6pZcFzcaRH4vm2/PGIc4xXgs/hhs i927YWAY0wnbu7iFsv4Fbw7B4w== X-Google-Smtp-Source: ABdhPJzyfarjjWMZT/Urjo6R6jR7MFp2VAkxiYz4i0cDzXZXP0EBlBsAWmKDRDq8ix5Fv+EZW5xpSg== X-Received: by 2002:a63:5847:0:b0:399:3452:ffe4 with SMTP id i7-20020a635847000000b003993452ffe4mr32556539pgm.406.1651239451972; Fri, 29 Apr 2022 06:37:31 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:31 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Date: Fri, 29 Apr 2022 21:35:47 +0800 Message-Id: <20220429133552.33768-14-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Normally, the percpu_ref of the user PTE page table page is in percpu mode. This patch add try_to_free_user_pte() to switch the percpu_ref to atomic mode and check if it is 0. If the percpu_ref is 0, which means that no one is using the user PTE page table page, then we can safely reclaim it. Signed-off-by: Qi Zheng --- include/linux/pte_ref.h | 7 +++ mm/pte_ref.c | 99 ++++++++++++++++++++++++++++++++++++++++- 2 files changed, 104 insertions(+), 2 deletions(-) diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index bfe620038699..379c3b45a6ab 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -16,6 +16,8 @@ void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsi= gned long addr); bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); void __pte_put(pgtable_t page); void pte_put(pte_t *ptep); +void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long = addr, + bool switch_back); =20 #else /* !CONFIG_FREE_USER_PTE */ =20 @@ -47,6 +49,11 @@ static inline void pte_put(pte_t *ptep) { } =20 +static inline void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, bool switch_back) +{ +} + #endif /* CONFIG_FREE_USER_PTE */ =20 #endif /* _LINUX_PTE_REF_H */ diff --git a/mm/pte_ref.c b/mm/pte_ref.c index 5b382445561e..bf9629272c71 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -8,6 +8,9 @@ #include #include #include +#include +#include +#include =20 #ifdef CONFIG_FREE_USER_PTE =20 @@ -44,8 +47,6 @@ void pte_ref_free(pgtable_t pte) kfree(ref); } =20 -void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) {} - /* * pte_tryget - try to get the pte_ref of the user PTE page table page * @mm: pointer the target address space @@ -102,4 +103,98 @@ void pte_put(pte_t *ptep) } EXPORT_SYMBOL(pte_put); =20 +#ifdef CONFIG_DEBUG_VM +void pte_free_debug(pmd_t pmd) +{ + pte_t *ptep =3D (pte_t *)pmd_page_vaddr(pmd); + int i =3D 0; + + for (i =3D 0; i < PTRS_PER_PTE; i++) + BUG_ON(!pte_none(*ptep++)); +} +#else +static inline void pte_free_debug(pmd_t pmd) +{ +} +#endif + +static inline void pte_free_rcu(struct rcu_head *rcu) +{ + struct page *page =3D container_of(rcu, struct page, rcu_head); + + pgtable_pte_page_dtor(page); + __free_page(page); +} + +/* + * free_user_pte - free the user PTE page table page + * @mm: pointer the target address space + * @pmd: pointer to a PMD + * @addr: start address of the tlb range to be flushed + * + * Context: The pmd range has been unmapped and TLB purged. And the user P= TE + * page table page will be freed by rcu handler. + */ +void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +{ + struct vm_area_struct vma =3D TLB_FLUSH_VMA(mm, 0); + spinlock_t *ptl; + pmd_t pmdval; + + ptl =3D pmd_lock(mm, pmd); + pmdval =3D *pmd; + if (pmd_none(pmdval) || pmd_leaf(pmdval)) { + spin_unlock(ptl); + return; + } + pmd_clear(pmd); + flush_tlb_range(&vma, addr, addr + PMD_SIZE); + spin_unlock(ptl); + + pte_free_debug(pmdval); + mm_dec_nr_ptes(mm); + call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu); +} + +/* + * try_to_free_user_pte - try to free the user PTE page table page + * @mm: pointer the target address space + * @pmd: pointer to a PMD + * @addr: virtual address associated with pmd + * @switch_back: indicates if switching back to percpu mode is required + */ +void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long = addr, + bool switch_back) +{ + pgtable_t pte; + + if (&init_mm =3D=3D mm) + return; + + if (!pte_tryget(mm, pmd, addr)) + return; + pte =3D pmd_pgtable(*pmd); + percpu_ref_switch_to_atomic_sync(pte->pte_ref); + rcu_read_lock(); + /* + * Here we can safely put the pte_ref because we already hold the rcu + * lock, which guarantees that the user PTE page table page will not + * be released. + */ + __pte_put(pte); + if (percpu_ref_is_zero(pte->pte_ref)) { + rcu_read_unlock(); + free_user_pte(mm, pmd, addr & PMD_MASK); + return; + } + rcu_read_unlock(); + + if (switch_back) { + if (pte_tryget(mm, pmd, addr)) { + percpu_ref_switch_to_percpu(pte->pte_ref); + __pte_put(pte); + } + } +} + #endif /* CONFIG_FREE_USER_PTE */ --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C850EC4332F for ; Fri, 29 Apr 2022 13:38:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376318AbiD2NlX (ORCPT ); Fri, 29 Apr 2022 09:41:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55270 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376289AbiD2Nk5 (ORCPT ); Fri, 29 Apr 2022 09:40:57 -0400 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2593FCC50A for ; Fri, 29 Apr 2022 06:37:38 -0700 (PDT) Received: by mail-pf1-x42c.google.com with SMTP id x52so5378492pfu.11 for ; Fri, 29 Apr 2022 06:37:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=rYeLNg+oS2y3akdD0dEeA3yQLeG8PhzEVgB+CG0F5x8=; b=r0oSktHd6lJh1xiGDkOseB1CwEKbnU5ADwHmNxjoz0rF9tVkTla0cHmYCIjX9kad48 HXeS+/YDOH1RsqM0QaLq3bkmUHA4i6Jbi5iXnJXClUtQW4MWr3dTM217XBF44U478R6b Shvq73K4KVjlhzE/n7vZ2+kDth8ZTt+DfZSh4jXPMr/dgaiOTUv7UitxA5wBpQeUxLT3 /SrfjXz8WTr03+JamLKfd/O6NmDHtxE1rvUHfP30jXpZwF3OrnSi1H2YikZmmdZh3Lk1 XhF0EkwMAPk7F1PJU3e9O2p9K6frj/vvPrSN47R2YW4XK31joU4z9lay8VQ6QPegDiSH W1uA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=rYeLNg+oS2y3akdD0dEeA3yQLeG8PhzEVgB+CG0F5x8=; b=hgdvW5c2TeSOaVPY7bmI7j95U+Psaqvetcc4CvsYUjyiK6WgeE53d+LkL4X9SmoT3s vwzoghAfr8NmAuGt44xNE2yb9XaW6Me+agXkAZBMBcxu2yGORqrrjAPoXZ1RtHBsuXfY PvVUtLdH/krooZCpJvkU+RfbY9dy4v0HUrPsjhOOGUwjP02Tcdp3qoJGKkpf+rlDKURT bU8dDLYFWQPg4SI44hsvgdjkmM68BcSGSL1KlV1wPibG90rrYAbTWebyDnO2H/luhBs3 eEhnh2wkpA25oC5tQvpxP3aEh9+cdcdBcb/Ss2zoY8ONzM1sVpr01yArk2gPe4v3Jl5q mRXA== X-Gm-Message-State: AOAM531/72bkwWOan136MUJpktCe7pT1OUEzlR6fbm1a4DszFk5EcPGW EqpfuoEERYsZsNfYsapKaFrrKA== X-Google-Smtp-Source: ABdhPJwqz/7MfX5i1uHVza5qIKuW50STzQue9UefcrOGDN8WPmjUeneAZB8Gavo9bb4peYnwNe/Qfg== X-Received: by 2002:a05:6a00:238f:b0:4f7:78b1:2f6b with SMTP id f15-20020a056a00238f00b004f778b12f6bmr40178261pfc.17.1651239457603; Fri, 29 Apr 2022 06:37:37 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:37 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Date: Fri, 29 Apr 2022 21:35:48 +0800 Message-Id: <20220429133552.33768-15-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Immediately after a successful MADV_DONTNEED operation, the physical page is unmapped from the PTE page table entry. This is a good time to call try_to_free_user_pte() to try to free the PTE page table page. Signed-off-by: Qi Zheng --- mm/internal.h | 3 ++- mm/memory.c | 43 +++++++++++++++++++++++++++++-------------- mm/oom_kill.c | 3 ++- 3 files changed, 33 insertions(+), 16 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index cf16280ce132..f93a9170d2e3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -77,7 +77,8 @@ struct zap_details; void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, - struct zap_details *details); + struct zap_details *details, + bool free_pte); =20 void page_cache_ra_order(struct readahead_control *, struct file_ra_state = *, unsigned int order); diff --git a/mm/memory.c b/mm/memory.c index aa2bac561d5e..75a0e16a095a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1339,7 +1339,8 @@ static inline bool should_zap_page(struct zap_details= *details, struct page *pag static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { struct mm_struct *mm =3D tlb->mm; int force_flush =3D 0; @@ -1348,6 +1349,7 @@ static unsigned long zap_pte_range(struct mmu_gather = *tlb, pte_t *start_pte; pte_t *pte; swp_entry_t entry; + unsigned long start =3D addr; =20 tlb_change_page_size(tlb, PAGE_SIZE); again: @@ -1455,13 +1457,17 @@ static unsigned long zap_pte_range(struct mmu_gathe= r *tlb, goto again; } =20 + if (free_pte) + try_to_free_user_pte(mm, pmd, start, true); + return addr; } =20 static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { pmd_t *pmd; unsigned long next; @@ -1496,7 +1502,8 @@ static inline unsigned long zap_pmd_range(struct mmu_= gather *tlb, */ if (pmd_none_or_trans_huge_or_clear_bad(pmd)) goto next; - next =3D zap_pte_range(tlb, vma, pmd, addr, next, details); + next =3D zap_pte_range(tlb, vma, pmd, addr, next, details, + free_pte); next: cond_resched(); } while (pmd++, addr =3D next, addr !=3D end); @@ -1507,7 +1514,8 @@ static inline unsigned long zap_pmd_range(struct mmu_= gather *tlb, static inline unsigned long zap_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { pud_t *pud; unsigned long next; @@ -1525,7 +1533,8 @@ static inline unsigned long zap_pud_range(struct mmu_= gather *tlb, } if (pud_none_or_clear_bad(pud)) continue; - next =3D zap_pmd_range(tlb, vma, pud, addr, next, details); + next =3D zap_pmd_range(tlb, vma, pud, addr, next, details, + free_pte); next: cond_resched(); } while (pud++, addr =3D next, addr !=3D end); @@ -1536,7 +1545,8 @@ static inline unsigned long zap_pud_range(struct mmu_= gather *tlb, static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { p4d_t *p4d; unsigned long next; @@ -1546,7 +1556,8 @@ static inline unsigned long zap_p4d_range(struct mmu_= gather *tlb, next =3D p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(p4d)) continue; - next =3D zap_pud_range(tlb, vma, p4d, addr, next, details); + next =3D zap_pud_range(tlb, vma, p4d, addr, next, details, + free_pte); } while (p4d++, addr =3D next, addr !=3D end); =20 return addr; @@ -1555,7 +1566,8 @@ static inline unsigned long zap_p4d_range(struct mmu_= gather *tlb, void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { pgd_t *pgd; unsigned long next; @@ -1567,7 +1579,8 @@ void unmap_page_range(struct mmu_gather *tlb, next =3D pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; - next =3D zap_p4d_range(tlb, vma, pgd, addr, next, details); + next =3D zap_p4d_range(tlb, vma, pgd, addr, next, details, + free_pte); } while (pgd++, addr =3D next, addr !=3D end); tlb_end_vma(tlb, vma); } @@ -1576,7 +1589,8 @@ void unmap_page_range(struct mmu_gather *tlb, static void unmap_single_vma(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { unsigned long start =3D max(vma->vm_start, start_addr); unsigned long end; @@ -1612,7 +1626,8 @@ static void unmap_single_vma(struct mmu_gather *tlb, i_mmap_unlock_write(vma->vm_file->f_mapping); } } else - unmap_page_range(tlb, vma, start, end, details); + unmap_page_range(tlb, vma, start, end, details, + free_pte); } } =20 @@ -1644,7 +1659,7 @@ void unmap_vmas(struct mmu_gather *tlb, start_addr, end_addr); mmu_notifier_invalidate_range_start(&range); for ( ; vma && vma->vm_start < end_addr; vma =3D vma->vm_next) - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); + unmap_single_vma(tlb, vma, start_addr, end_addr, NULL, false); mmu_notifier_invalidate_range_end(&range); } =20 @@ -1669,7 +1684,7 @@ void zap_page_range(struct vm_area_struct *vma, unsig= ned long start, update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range); for ( ; vma && vma->vm_start < range.end; vma =3D vma->vm_next) - unmap_single_vma(&tlb, vma, start, range.end, NULL); + unmap_single_vma(&tlb, vma, start, range.end, NULL, true); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } @@ -1695,7 +1710,7 @@ static void zap_page_range_single(struct vm_area_stru= ct *vma, unsigned long addr tlb_gather_mmu(&tlb, vma->vm_mm); update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range); - unmap_single_vma(&tlb, vma, address, range.end, details); + unmap_single_vma(&tlb, vma, address, range.end, details, true); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7ec38194f8e1..c4c25a7add7b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -549,7 +549,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm) ret =3D false; continue; } - unmap_page_range(&tlb, vma, range.start, range.end, NULL); + unmap_page_range(&tlb, vma, range.start, range.end, + NULL, false); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 605A6C433FE for ; Fri, 29 Apr 2022 13:38:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359809AbiD2Nl0 (ORCPT ); Fri, 29 Apr 2022 09:41:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376280AbiD2NlC (ORCPT ); Fri, 29 Apr 2022 09:41:02 -0400 Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B18DACB007 for ; Fri, 29 Apr 2022 06:37:43 -0700 (PDT) Received: by mail-pf1-x430.google.com with SMTP id p8so6935831pfh.8 for ; Fri, 29 Apr 2022 06:37:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=O5cSbfL+Kws2WK1XWtshx/2qRi6Wn3xrF3ZBZvoPEF4=; b=k82gmWwUxLeytVIW73YEj81XzJTEAi/VIKgG1khPQ5s9ntJrhtKUiHv2z65IM6x1pP cHyGFHCju/c6KUsgM1fEQ0gtlsNMOaUvhzKlUmUgO8zN6rDeRYSGJyNAofMSN3w21Nek gSVYkIKFvK/gSV3nvq9ZsEQmDBGd8P7fkTMiDlBwEy6tOiIFycE6VmPf+oth9zRFTpRq xQmyrVyejNKrcVf5wn9fIwvLO8i/YUdZETKNoql2OyZvMa0w5HO+dTEIc3OZxueEv5n+ C3QqYodUydvPO2/5EMdhE3bSPnrK80wuPiHAcYuSAuhfVECNh/kNOJxshL9h1CVpnKlE HNhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=O5cSbfL+Kws2WK1XWtshx/2qRi6Wn3xrF3ZBZvoPEF4=; b=QaxuSJkCrlGeHEvPEoxEC1LulPdn5n/rBrKuX6F1plsbY+6ji2ZuwAmJR3KVrZBsEw 4JC2XrYF+y62D+IU9RPyCfbi7cRj2U7NgfNPxiVIzxSpKYaR36P0+QFrgQAQyezcx7JI FF5MurWNIr9XBb+egOqqwM3tq05btnXpKnmtYBkI5y1V7gmUscCcCRhT/ZVYTZCVV47u OgRPrvAbhOsFenhuEufUdDu64uoyKP5Otn1tZklksnb1TT4UQj7PtBdgT42j4yTzsywT uRU77vUTL6JJxS8qsR6KsIQCIDvHapsAVks0Kh0XBDNpUdVzWwmxmtHWyvf7NxQb/bct 3HZQ== X-Gm-Message-State: AOAM533v/YYy77QAQ6INA/8PF+riTSowlgfxMC+5BwIg0p6OzID8sPUa z91tSoYDCnnzYNrUGBo4pSnM8Q== X-Google-Smtp-Source: ABdhPJw5CiRd+JTZxOeaHYQjHa7B3chtYZCpnIESroFu7Xy+Zl7X7Cm8G2ZfLUtwF1UYd9U60cOnNg== X-Received: by 2002:a63:f749:0:b0:3aa:361c:8827 with SMTP id f9-20020a63f749000000b003aa361c8827mr32583907pgk.361.1651239463216; Fri, 29 Apr 2022 06:37:43 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:42 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Date: Fri, 29 Apr 2022 21:35:49 +0800 Message-Id: <20220429133552.33768-16-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Different from MADV_DONTNEED case, MADV_FREE just marks the physical page as lazyfree instead of unmapping it immediately, and the physical page will not be unmapped until the system memory is tight. So we convert the percpu_ref of the related user PTE page table page to atomic mode in madvise_free_pte_range(), and then check if it is 0 in try_to_unmap_one(). If it is 0, we can safely reclaim the PTE page table page at this time. Signed-off-by: Qi Zheng --- include/linux/rmap.h | 2 ++ mm/madvise.c | 7 ++++++- mm/page_vma_mapped.c | 46 ++++++++++++++++++++++++++++++++++++++++++-- mm/rmap.c | 9 +++++++++ 4 files changed, 61 insertions(+), 3 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 17230c458341..a3174d3bf118 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -204,6 +204,8 @@ int make_device_exclusive_range(struct mm_struct *mm, u= nsigned long start, #define PVMW_SYNC (1 << 0) /* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) +/* Used for MADV_FREE page */ +#define PVMW_MADV_FREE (1 << 2) =20 struct page_vma_mapped_walk { unsigned long pfn; diff --git a/mm/madvise.c b/mm/madvise.c index 8123397f14c8..bd4bcaad5a9f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -598,7 +598,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, pte_t *orig_pte, *pte, ptent; struct page *page; int nr_swap =3D 0; + bool have_lazyfree =3D false; unsigned long next; + unsigned long start =3D addr; =20 next =3D pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) @@ -709,6 +711,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, tlb_remove_tlb_entry(tlb, pte, addr); } mark_page_lazyfree(page); + have_lazyfree =3D true; } out: if (nr_swap) { @@ -718,8 +721,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned= long addr, add_mm_counter(mm, MM_SWAPENTS, nr_swap); } arch_leave_lazy_mmu_mode(); - if (orig_pte) + if (orig_pte) { pte_unmap_unlock(orig_pte, ptl); + try_to_free_user_pte(mm, pmd, start, !have_lazyfree); + } cond_resched(); next: return 0; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 8ecf8fd7cf5e..00bc09f57f48 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -266,8 +266,30 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk = *pvmw) next_pte: do { pvmw->address +=3D PAGE_SIZE; - if (pvmw->address >=3D end) - return not_found(pvmw); + if (pvmw->address >=3D end) { + not_found(pvmw); + + if (pvmw->flags & PVMW_MADV_FREE) { + pgtable_t pte; + pmd_t pmdval; + + pvmw->flags &=3D ~PVMW_MADV_FREE; + rcu_read_lock(); + pmdval =3D READ_ONCE(*pvmw->pmd); + if (pmd_none(pmdval) || pmd_leaf(pmdval)) { + rcu_read_unlock(); + return false; + } + pte =3D pmd_pgtable(pmdval); + if (percpu_ref_is_zero(pte->pte_ref)) { + rcu_read_unlock(); + free_user_pte(mm, pvmw->pmd, pvmw->address); + } else { + rcu_read_unlock(); + } + } + return false; + } /* Did we cross page table boundary? */ if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) =3D=3D 0) { if (pvmw->ptl) { @@ -275,6 +297,26 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk = *pvmw) pvmw->ptl =3D NULL; } pte_unmap(pvmw->pte); + if (pvmw->flags & PVMW_MADV_FREE) { + pgtable_t pte; + pmd_t pmdval; + + pvmw->flags &=3D ~PVMW_MADV_FREE; + rcu_read_lock(); + pmdval =3D READ_ONCE(*pvmw->pmd); + if (pmd_none(pmdval) || pmd_leaf(pmdval)) { + rcu_read_unlock(); + pvmw->pte =3D NULL; + goto restart; + } + pte =3D pmd_pgtable(pmdval); + if (percpu_ref_is_zero(pte->pte_ref)) { + rcu_read_unlock(); + free_user_pte(mm, pvmw->pmd, pvmw->address); + } else { + rcu_read_unlock(); + } + } pvmw->pte =3D NULL; goto restart; } diff --git a/mm/rmap.c b/mm/rmap.c index fedb82371efe..f978d324d4f9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1616,6 +1616,8 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); dec_mm_counter(mm, MM_ANONPAGES); + if (IS_ENABLED(CONFIG_FREE_USER_PTE)) + pvmw.flags |=3D PVMW_MADV_FREE; goto discard; } =20 @@ -1627,6 +1629,13 @@ static bool try_to_unmap_one(struct folio *folio, st= ruct vm_area_struct *vma, folio_set_swapbacked(folio); ret =3D false; page_vma_mapped_walk_done(&pvmw); + if (IS_ENABLED(CONFIG_FREE_USER_PTE) && + pte_tryget(mm, pvmw.pmd, address)) { + pgtable_t pte_page =3D pmd_pgtable(*pvmw.pmd); + + percpu_ref_switch_to_percpu(pte_page->pte_ref); + __pte_put(pte_page); + } break; } =20 --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61892C433EF for ; Fri, 29 Apr 2022 13:38:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376308AbiD2Nlh (ORCPT ); Fri, 29 Apr 2022 09:41:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242918AbiD2NlR (ORCPT ); Fri, 29 Apr 2022 09:41:17 -0400 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 91F25CB01C for ; Fri, 29 Apr 2022 06:37:49 -0700 (PDT) Received: by mail-pj1-x102f.google.com with SMTP id gj17-20020a17090b109100b001d8b390f77bso10558421pjb.1 for ; Fri, 29 Apr 2022 06:37:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=VvydQG5zjUSumdtIipaqFpls71i8yJoag+sGGmgO3rc=; b=7NbcGgSkwHrJY1tx3toqeV+Xk5yX2uYskO+z5MFj6UbbTYBe0q1WO97WvSjnChdIC7 i1KaBtCoZvUzFuHqvg/yX7Sermx9dZh9Iowvqovf+yqAqOU9nR2rvNz8vX4p8YMW9Pr/ lcrLnleXLlHf98P6UtW9P6LVIs3t/qYmRfgfSYvLdrou9KwvgQiL7q66BHZj/JDMNoUC D900OhH7LCn0E2x6kfYvQ9+i0goWdK5HbTs5N9gnfI5iJNzW/g4nC1cnBFMZbnmGJu/U 3ZC4/cu0R2dvf0oaLfZZa5zTEhGLX+Dy1LhFT0O/TkQEjvp4dUTJFE4jWAq0xK4ZGERA XgYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=VvydQG5zjUSumdtIipaqFpls71i8yJoag+sGGmgO3rc=; b=ZD0EvWten7ZuNRbyPz3DPJUeS46p1iqC/Sm4QqZcxI8VJ5XJl7fonneXzkom7gJlJ4 jfeSCrmYwwBusOnAotIiTBRbhMXA/i/CTfzPzGCYBVdACShtgO5IIN1htKFW7wUT6fnv FMdoHc4m1WpHDsaNo6Q79Q3YKSlmuq+8P3qf+chq424ZI7CN84FExK7kbG9wffCmoWQQ 2YxsSBdwGujquLjsE1OkYC53J8EckEqY2P3myTvEbAr1dSQlM67OdqZq8GmJZTT+Brpb C61bpeq9NLQnUrKGnQ8/6UoVTh/dIMQ/zgJuVX6lQER6IZWzUsgMYDZLy+bwJV4lq1hJ LAag== X-Gm-Message-State: AOAM532qimRyaZ7N0iLYusV+CDkLGgPYsAwjiTanZf2NpQHDgTRaVOb4 2xORWDB8QJqLqr1kFSx3m49Taw== X-Google-Smtp-Source: ABdhPJzNnW/PfivTTKMAdGz8oCU9UxMQX9vEZMqhPr9ijXGFPd76Q1h9afEdGPE6Q5AkyqAGhbs2Cw== X-Received: by 2002:a17:90a:784b:b0:1db:dfe6:5d54 with SMTP id y11-20020a17090a784b00b001dbdfe65d54mr3938283pjl.112.1651239468914; Fri, 29 Apr 2022 06:37:48 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:48 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Date: Fri, 29 Apr 2022 21:35:50 +0800 Message-Id: <20220429133552.33768-17-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The track_pte_set() is used to track the setting of the PTE page table entry, and the percpu_ref of the PTE page table page will be incremented when the entry changes from pte_none() to !pte_none(). The track_pte_clear() is used to track the clearing of the PTE page table entry, and the percpu_ref of the PTE page table page will be decremented when the entry changes from !pte_none() to pte_none(). In this way, the usage of the PTE page table page can be tracked by its percpu_ref. Signed-off-by: Qi Zheng --- include/linux/pte_ref.h | 14 ++++++++++++++ mm/pte_ref.c | 30 ++++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+) diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index 379c3b45a6ab..6ab740e1b989 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -18,6 +18,10 @@ void __pte_put(pgtable_t page); void pte_put(pte_t *ptep); void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long = addr, bool switch_back); +void track_pte_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte); +void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte); =20 #else /* !CONFIG_FREE_USER_PTE */ =20 @@ -54,6 +58,16 @@ static inline void try_to_free_user_pte(struct mm_struct= *mm, pmd_t *pmd, { } =20 +static inline void track_pte_set(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ +} + +static inline void track_pte_clear(struct mm_struct *mm, unsigned long add= r, + pte_t *ptep, pte_t pte) +{ +} + #endif /* CONFIG_FREE_USER_PTE */ =20 #endif /* _LINUX_PTE_REF_H */ diff --git a/mm/pte_ref.c b/mm/pte_ref.c index bf9629272c71..e92510deda0b 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -197,4 +197,34 @@ void try_to_free_user_pte(struct mm_struct *mm, pmd_t = *pmd, unsigned long addr, } } =20 +void track_pte_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte) +{ + pgtable_t page; + + if (&init_mm =3D=3D mm || pte_huge(pte)) + return; + + page =3D pte_to_page(ptep); + BUG_ON(percpu_ref_is_zero(page->pte_ref)); + if (pte_none(*ptep) && !pte_none(pte)) + percpu_ref_get(page->pte_ref); +} +EXPORT_SYMBOL(track_pte_set); + +void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte) +{ + pgtable_t page; + + if (&init_mm =3D=3D mm || pte_huge(pte)) + return; + + page =3D pte_to_page(ptep); + BUG_ON(percpu_ref_is_zero(page->pte_ref)); + if (!pte_none(pte)) + percpu_ref_put(page->pte_ref); +} +EXPORT_SYMBOL(track_pte_clear); + #endif /* CONFIG_FREE_USER_PTE */ --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FCD7C433EF for ; Fri, 29 Apr 2022 13:38:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359872AbiD2Nlc (ORCPT ); Fri, 29 Apr 2022 09:41:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56870 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376308AbiD2NlS (ORCPT ); Fri, 29 Apr 2022 09:41:18 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E53BCC53C for ; Fri, 29 Apr 2022 06:37:55 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id i1so1025123plg.7 for ; Fri, 29 Apr 2022 06:37:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PDb352oR+7/T20U1mOURtFBi5DeN+iddATN1VUBhapY=; b=M4jKPxmu6lHLcuN2g3BJoI5FkuKhJPRPks8ERBsNxVTHX3vhgtEMdYENdfW7JcqHwT yIH289VoRZXsV1Haj1Czsaw7RvMgfpNlIK48fcm+b7NNl/oeI7592UtbcrVGAto3j8qC VAobCLwT2x5h91yk+I7lMpecP96Ie33InH4JS0EJJehK9pIzs5ACN6H54/tw0aIyHkt/ wfJfNrXbInpX6Uxy8tB2TVaoW38mmPPuQ4fg2KMeUpipyE6IjHJ7eHpciSWTRoh6yjhe peT1zi4IDbHb6FzrmQ0mld9hn0zaB46duAeGKxPzmtH5lzBQefASzBL34t5JOt4uPvRq 1WyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PDb352oR+7/T20U1mOURtFBi5DeN+iddATN1VUBhapY=; b=im3Nm0zm/t0IOOcAjoO4mdR1rqBKAN8gTZftNH/Zv8Eu86OUg+WwV3RzwnmTPf0ZiP t1GhcHY2nU4399nKIB4Z6v3ARrXLab2QG9FWwoHIRbRnwk2Nc2XQBaa+oL5RtxoX1A7b nkY6b5zVusFFW+6mrirAtBl4ApM0szujVo1Lwh5rYid071EfU52D6rARzwlRvOzLDfV/ MVCHP1uNtC1Id7QlAkWjHw+Kn6fHm2NOREM+g4IJcaUjOe1jHVnwHVF5CqMdZZ7qZwGN /0sI3mFRpX9EgxNJ7DjxUFHtt1m1xiRExIxHBVTNSf47Fug60qQGmenNLDdXs397YnEd UkDg== X-Gm-Message-State: AOAM533891h5BfUZ6kAHJyo5vQr9GT6BtrRlF6pvIlR0g2rSKEpyFayM 1brVf7GWOugegJnxMxCBHLp0hQ== X-Google-Smtp-Source: ABdhPJxuA4t7O9ORj9b40GN2GlWOmbbs1FpQC2BMOlij9ue4QgL40/Ximfv1xZWE9b9PeECFx/EvXA== X-Received: by 2002:a17:90a:e510:b0:1d8:39b3:280b with SMTP id t16-20020a17090ae51000b001d839b3280bmr4062679pjy.142.1651239474777; Fri, 29 Apr 2022 06:37:54 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:54 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Date: Fri, 29 Apr 2022 21:35:51 +0800 Message-Id: <20220429133552.33768-18-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add pte_ref hooks into routines that modify user PTE page tables, and select ARCH_SUPPORTS_FREE_USER_PTE, so that the pte_ref code can be compiled and worked on this architecture. Signed-off-by: Qi Zheng --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 7 ++++++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b0142e01002e..c1046fc15882 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -34,6 +34,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select ARCH_SUPPORTS_FREE_USER_PTE =20 config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 62ab07e24aef..08d0aa5ce8d4 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -23,6 +23,7 @@ #include #include #include +#include =20 extern pgd_t early_top_pgt[PTRS_PER_PGD]; bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd); @@ -1010,6 +1011,7 @@ static inline void set_pte_at(struct mm_struct *mm, u= nsigned long addr, pte_t *ptep, pte_t pte) { page_table_check_pte_set(mm, addr, ptep, pte); + track_pte_set(mm, addr, ptep, pte); set_pte(ptep, pte); } =20 @@ -1055,6 +1057,7 @@ static inline pte_t ptep_get_and_clear(struct mm_stru= ct *mm, unsigned long addr, { pte_t pte =3D native_ptep_get_and_clear(ptep); page_table_check_pte_clear(mm, addr, pte); + track_pte_clear(mm, addr, ptep, pte); return pte; } =20 @@ -1071,6 +1074,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm= _struct *mm, */ pte =3D native_local_ptep_get_and_clear(ptep); page_table_check_pte_clear(mm, addr, pte); + track_pte_clear(mm, addr, ptep, pte); } else { pte =3D ptep_get_and_clear(mm, addr, ptep); } @@ -1081,7 +1085,8 @@ static inline pte_t ptep_get_and_clear_full(struct mm= _struct *mm, static inline void ptep_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { - if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK)) + if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK) + || IS_ENABLED(CONFIG_FREE_USER_PTE)) ptep_get_and_clear(mm, addr, ptep); else pte_clear(mm, addr, ptep); --=20 2.20.1 From nobody Sun May 10 15:48:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A538C433FE for ; Fri, 29 Apr 2022 13:38:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376372AbiD2NmA (ORCPT ); Fri, 29 Apr 2022 09:42:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376362AbiD2NlW (ORCPT ); Fri, 29 Apr 2022 09:41:22 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F6DECB038 for ; Fri, 29 Apr 2022 06:38:01 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id z5-20020a17090a468500b001d2bc2743c4so7325489pjf.0 for ; Fri, 29 Apr 2022 06:38:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=eRy8MpvNC58F/UyXGBAubwgn5yhnXW5DArHTiriXqOA=; b=TkkXGSkcHeQ2HN+4kQyis5zmihRY3+P5XKbLE56DY4YYoqqTSnBAZQSkt8HShcDL1Y pna6lXKQks7nCKFwAb8lPv2ELLP1EHP7BC0L+tEblOUQ2tjXbPofhtBCEL4RHSKUiM+T SPXWW+3xIBJJPl60oVBE1kbw0YfRp5hBgkURC7WcxvTmgKuysqThrwaHfS2f/ZPJckX2 iRS4JQjouD9CEKnyfCkh5QKVkKqvMDxXrYZFXA7bjAukG7CzEOo/V1CQpcoLZehvz0Ll t8R5QN2CL7ruHzsj5cLWt/UWgvclVrA8+xFk2tgGWIk8gd2+/YwSVDmRPNjq4+Ef8xwN DLww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eRy8MpvNC58F/UyXGBAubwgn5yhnXW5DArHTiriXqOA=; b=xw462PZh7YgPUVV6u9a1qr1o5pS5ckhx8IWXBnKwI0KhUCfQ+PIIDxSum5CbAJ3WsQ M+zgM4REDlFRlBszJWuu4PIwx7+WtoRYe7YOewpfV4Sl8FGN3ybJZdCUnW5TN2Nj0Z0i LeEsVZ/qKQ31FVEwJ6r5UzxCh/8PHIgxPMK5K3xFH1xkHou/PyFmaL/JyWHptiw3JQhg vXUEKswYS/aqJ17kQ/erpXvPZJ3o7CZsCHS5MPeeP44K+n/H5AxZZ711gbj7p2yvG15V 9vqldoJHrVMwjz2eKgBlAhHUhORsxLIsbhMbb9LJe62UNRWB5qDMTV/k2nPYQoTLq/Ex tJCw== X-Gm-Message-State: AOAM531dRA8OoNrgSmevShtJFOb2JAboug5n6qAawTyfCILhLi0Y6f85 +tPyEWW0/EvEgZzxbbcVN6kbYA== X-Google-Smtp-Source: ABdhPJxSWULOjvO6LZb8fH5rIPVk36LG3U/StG5RIQqrXaps+4e7g0yd+bWTzUdJKU9VKCIdUNmIXg== X-Received: by 2002:a17:902:cec3:b0:15d:242c:477d with SMTP id d3-20020a170902cec300b0015d242c477dmr22380034plg.54.1651239480766; Fri, 29 Apr 2022 06:38:00 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:38:00 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 18/18] Documentation: add document for pte_ref Date: Fri, 29 Apr 2022 21:35:52 +0800 Message-Id: <20220429133552.33768-19-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This commit adds document for pte_ref under `Documentation/vm/`. Signed-off-by: Qi Zheng --- Documentation/vm/index.rst | 1 + Documentation/vm/pte_ref.rst | 210 +++++++++++++++++++++++++++++++++++ 2 files changed, 211 insertions(+) create mode 100644 Documentation/vm/pte_ref.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 44365c4574a3..ee71baccc2e7 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -31,6 +31,7 @@ algorithms. If you are looking for advice on simply allo= cating memory, see the page_frags page_owner page_table_check + pte_ref remap_file_pages slub split_page_table_lock diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst new file mode 100644 index 000000000000..0ac1e5a408d7 --- /dev/null +++ b/Documentation/vm/pte_ref.rst @@ -0,0 +1,210 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D +pte_ref: Tracking about how many references to each user PTE page table pa= ge +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D + +Preface +=3D=3D=3D=3D=3D=3D=3D + +Now in order to pursue high performance, applications mostly use some +high-performance user-mode memory allocators, such as jemalloc or tcmalloc. +These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release +physical memory for the following reasons:: + + First of all, we should hold as few write locks of mmap_lock as possible, + since the mmap_lock semaphore has long been a contention point in the + memory management subsystem. The mmap()/munmap() hold the write lock, and + the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using + madvise() instead of munmap() to released physical memory can reduce the + competition of the mmap_lock. + + Secondly, after using madvise() to release physical memory, there is no + need to build vma and allocate page tables again when accessing the same + virtual address again, which can also save some time. + +The following is the largest user PTE page table memory that can be +allocated by a single user process in a 32-bit and a 64-bit system. + ++---------------------------+--------+---------+ +| | 32-bit | 64-bit | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| user PTE page table pages | 3 MiB | 512 GiB | ++---------------------------+--------+---------+ +| user PMD page table pages | 3 KiB | 1 GiB | ++---------------------------+--------+---------+ + +(for 32-bit, take 3G user address space, 4K page size as an example; + for 64-bit, take 48-bit address width, 4K page size as an example.) + +After using madvise(), everything looks good, but as can be seen from the +above table, a single process can create a large number of PTE page tables +on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not +release page table memory. And before the process exits or calls munmap(), +the kernel cannot reclaim these pages even if these PTE page tables do not +map anything. + +To fix the situation, we introduces a reference count for each user PTE pa= ge +table page. Then we can track whether users are using the user PTE page ta= ble +page and reclaim the user PTE page table pages that does not map anything = at +the right time. + +Introduction +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The ``pte_ref``, which is the reference count of user PTE page table page,= is +``percpu_ref`` type. It is used to track the usage of each user PTE page t= able +page. + +Who will hold the pte_ref? +-------------------------- + +The following people will hold a pte_ref:: + + The !pte_none() entry, such as regular page table entry that map physical + pages, or swap entry, or migrate entry, etc. + + Visitor to the PTE page table entries, such as page table walker. + +Any ``!pte_none()`` entry and visitor can be regarded as the user of the P= TE +page table page. When the pte_ref is reduced to 0, it means that no one is +using the PTE page table page, then this free PTE page table page can be +reclaimed at this time. + +About mode switching +-------------------- + +When user PTE page table page is allocated, its ``pte_ref`` will be initia= lized +to percpu mode, which basically does not bring performance overhead. When = we +want to reclaim the PTE page, it will be switched to atomic mode. Then we = can +check if the ``pte_ref`` is zero:: + + - If it is zero, we can safely reclaim it immediately; + - If it is not zero but we expect that the PTE page can be reclaimed + automatically when no one is using it, we can keep its ``pte_ref`` in + atomic mode (e.g. MADV_FREE case); + - If it is not zero, and we will continue to try at the next opportunity, + then we can choose to switch back to percpu mode (e.g. MADV_DONTNEED ca= se). + +Competitive relationship +------------------------ + +Now, the user page table will only be released by calling ``free_pgtables(= )`` +when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` = path). +So other threads only need to ensure mutual exclusion with these paths to = ensure +that the page table is not released. For example:: + + thread A thread B + page table walker munmap + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + pte_offset_map_lock() + *walk page table* + pte_unmap_unlock() + } + mmap_read_unlock() + + mmap_write_lock_killable() + detach_vmas_to_be_unmapped() + unmap_region() + --> free_pgtables() + +But after we introduce the ``pte_ref`` for the user PTE page table page, t= hese +existing balances will be broken. The page can be released at any time whe= n its +``pte_ref`` is reduced to 0. Therefore, the following case may happen:: + + thread A thread B thread C + page table walker madvise(MADV_DONTNEED) page fault + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + + mmap_read_lock() + unmap_page_range() + --> zap_pte_range() + /* the pte_ref is reduced to 0 */ + --> free PTE page table page + + mmap_read_lock() + /* may allocate + * a new huge + * pmd or a new + * PTE page + */ + + /* broken!! */ + pte_offset_map_lock() + +As we can see, all of the thread A, B and C hold the read lock of mmap_loc= k, so +they can execute concurrently. When thread B releases the PTE page table p= age, +the value in the corresponding pmd entry will become unstable, which may be +none or huge pmd, or map a new PTE page table page again. This will cause = system +chaos and even panic. + +So as described in the section "Who will hold the pte_ref?", the page table +walker (visitor) also need to try to take a ``pte_ref`` to the user PTE pa= ge +table page before walking page table (the helper ``pte_tryget_map{_lock}()= `` +can help us to do this), then the system will become orderly again:: + + thread A thread B + page table walker madvise(MADV_DONTNEED) + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + pte_tryget() + --> percpu_ref_tryget + *if successfully, then:* + + mmap_read_lock() + unmap_page_range() + --> zap_pte_range() + /* the pte_refcount is reduced to 1 */ + + pte_offset_map_lock() + *walk page table* + pte_unmap_unlock() + +There is also a lock-less scenario(such as fast GUP). Fortunately, we don'= t need +to do any additional operations to ensure that the system is in order. Tak= e fast +GUP as an example:: + + thread A thread B + fast GUP madvise(MADV_DONTNEED) + =3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D + + get_user_pages_fast_only() + --> local_irq_save(); + call_rcu(pte_free_rcu) + gup_pgd_range(); + local_irq_restore(); + /* do pte_free_rcu() */ + +Helpers +=3D=3D=3D=3D=3D=3D=3D + ++----------------------+------------------------------------------------+ +| pte_ref_init | Initialize the pte_ref | ++----------------------+------------------------------------------------+ +| pte_ref_free | Free the pte_ref | ++----------------------+------------------------------------------------+ +| pte_tryget | Try to hold a pte_ref | ++----------------------+------------------------------------------------+ +| pte_put | Decrement a pte_ref | ++----------------------+------------------------------------------------+ +| pte_tryget_map | Do pte_tryget and pte_offset_map | ++----------------------+------------------------------------------------+ +| pte_tryget_map_lock | Do pte_tryget and pte_offset_map_lock | ++----------------------+------------------------------------------------+ +| free_user_pte | Free the user PTE page table page | ++----------------------+------------------------------------------------+ +| try_to_free_user_pte | Try to free the user PTE page table page | ++----------------------+------------------------------------------------+ +| track_pte_set | Track the setting of user PTE page table page | ++----------------------+------------------------------------------------+ +| track_pte_clear | Track the clearing of user PTE page table page | ++----------------------+------------------------------------------------+ + --=20 2.20.1