From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83CFAC64ED8 for ; Sat, 18 Feb 2023 00:28:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229820AbjBRA2s (ORCPT ); Fri, 17 Feb 2023 19:28:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229510AbjBRA2q (ORCPT ); Fri, 17 Feb 2023 19:28:46 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3CB2459B49 for ; Fri, 17 Feb 2023 16:28:45 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-536582abb72so22509327b3.5 for ; Fri, 17 Feb 2023 16:28:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=WRJSWRU8dQHXXhdXAbgeFtewLtYoHueyg1zV/lollrc=; b=JUkI+uUZUGikStVOrWOWfUFvkifZQjdvysrYVMcxjv1b2yQplxM71vw1fjhTzhEkxI BEohrL6MRF518g9B3YG2ljl8T7G93FwY6nToc+IVWoDTzVPkWMKdK99Fw+Y/07yhJ3P2 Pt0n2XoDP+p37aLwSJxXkZR/Jyh8LDLuXTOP46PclH0WaRZarQPsfJWYdmcsYTgivcxA V29rcePGsSV2Bm8JBI3cnvgY6VYwZ38Vy23D8uxt/hblg5U5P6fb6PnTeEwjCMXWB3jt Xt02v+X21m74LuwV377B2oYbsfLR/FpkRSgMkEe4zOTBZt33XomrJuA07ToJPeOqmC5B 2qvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WRJSWRU8dQHXXhdXAbgeFtewLtYoHueyg1zV/lollrc=; b=EFhDPtIx+Qrf0TIl/4WzadxlbNvCFb/DC0O4nJH6L5mqgMMaPMCn21myVcAXUcXeEh RfTkfQ4w7ZrSo/ssU1dGh9y1kn8UnUJPO1DHqQeVswx94rABzuihVLJGlinpHrXcNPQC PeCpgxQHiAT8W9gvJDNzTuEr/5R2ICjDzaRtIv6gBH7x0W9r+kQbAvw8SV98nWtx1jWu qsdSJgUQAgoJDcWIxAlOvNcDUOD4uCDa5D6iPiNVwz2WB1EWm1JsxsDfWRc8za4CD/wQ tBvokvaP006rcr7iYsbKHDSd72up7gagteZt3Lclwwx9S2DRWB5lLgMhpe9VLQUYvTBU R9WA== X-Gm-Message-State: AO0yUKXItu4v/sEEHBs8svAlKqTmYGHRsTmwIDH6YikvS8OTk9Cyvmmd usz/rZVcHt6xu/rt2Efoj330AKZlzMcJg5fS X-Google-Smtp-Source: AK7set/WjcZL+XrN64oa4iFWZNoLag2qFEizB65Q47paRVpixNGKzuLMD0K/0D4ZsPotwj/bHVSjyMfKhh7AGtfF X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:4981:0:b0:527:adb4:3297 with SMTP id w123-20020a814981000000b00527adb43297mr1507453ywa.161.1676680124408; Fri, 17 Feb 2023 16:28:44 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:34 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-2-jthoughton@google.com> Subject: [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If would be bad if we actually set PageUptodate with UFFDIO_CONTINUE; PageUptodate indicates that the page has been zeroed, and we don't want to give a non-zeroed page to the user. The reason this change is being made now is because UFFDIO_CONTINUEs on subpages definitely shouldn't set this page flag on the head page. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 07abcb6eb203..792cb2e67ce5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6256,7 +6256,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_m= m, * preceding stores to the page contents become visible before * the set_pte_at() write. */ - __folio_mark_uptodate(folio); + if (!is_continue) + __folio_mark_uptodate(folio); + else if (!folio_test_uptodate(folio)) { + /* + * This should never happen; HugeTLB pages are always Uptodate + * as soon as they are allocated. + */ + ret =3D -EFAULT; + goto out_release_nounlock; + } =20 /* Add shared, newly allocated pages to the page cache. */ if (vm_shared && !is_continue) { --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC392C636D6 for ; Sat, 18 Feb 2023 00:28:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229968AbjBRA2x (ORCPT ); Fri, 17 Feb 2023 19:28:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229694AbjBRA2r (ORCPT ); Fri, 17 Feb 2023 19:28:47 -0500 Received: from mail-ua1-x949.google.com (mail-ua1-x949.google.com [IPv6:2607:f8b0:4864:20::949]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 421555A3A3 for ; Fri, 17 Feb 2023 16:28:46 -0800 (PST) Received: by mail-ua1-x949.google.com with SMTP id j26-20020ab06cba000000b0066119a9d3bbso855914uaa.21 for ; Fri, 17 Feb 2023 16:28:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=LnKhelmr0oBezYImc/CiFkR5B91EfUKvE0Rbw9pLRe4=; b=lQipQIAGaSVcFBsRExtQaFLnPm6HAzNz2+VJbpondpycb1x3PDuXTd+i0o++TSF9er gxut9YhAnGbep2SPcG0jDIMVR2mq6NQ41EE/6bkL3ZaN7Gb6kf8EWJR5CwB22gEUYwsb dux6/8aAmRzRNX/2PmjME6f1c/ZNU7ZaQroNOMlaJyZVlnL1oAKwRpxb7/HYgt7/PydP lxdcZ2uDPD9axodhgg/osne68KRXEtgsS6fa3iXBXgLQZ3W+b4wOBh19dwO8qERvglmh y56jvQy6e2CUxUzTLyorFkYxBNBmTKZfxgrrDzoZSssQ0cO8QYft6CecYtHZ5imrv86K Ha7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LnKhelmr0oBezYImc/CiFkR5B91EfUKvE0Rbw9pLRe4=; b=t3k1eNGmgIllVmMfGbkxM2JeaQUux2IfPFoHtU9ouXRB556HeRdp57gENK4S6UlN7t PF3knzrOn5VEHUROtLBpUswa+freXkrJp6OjOEScCh/NMVbpvQ4yCA7idCGPijP59C6R UegqopOA6ECSmPol+tVNP6/sSceokL29/RMAkr8EzFzLh85CWJZJVlJiyNisWJKniHPE 7seYOgyunXLrt3Eb4nEnziC1riril+RSQQ0o0Fl2cyXIclLseIrQsJB3KUAvB00BeYaF KIq+xnTPy2xd6KTDsDmLdPPNFg/HAbnuliTan9RkOftLJjcy3Wbewa39aY2+hXpk8rjQ 6k3Q== X-Gm-Message-State: AO0yUKUvPMPstaS5OamcopfNnBOyh/4eESrLiDHWlgfMCxwoq/jONArJ VbXJ70oXeHweKh1tfCcxwyLH4QRDu3K7u413 X-Google-Smtp-Source: AK7set9cfmc3rDPTjjWrEWA1g/aV5e0+YQFlWRnOBBxS/xzEKz7fRzIC1V+FfmOXZMHdILJkQVdHqDoADwXiTyis X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6130:a0a:b0:67a:2833:5ceb with SMTP id bx10-20020a0561300a0a00b0067a28335cebmr54247uab.0.1676680125416; Fri, 17 Feb 2023 16:28:45 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:35 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-3-jthoughton@google.com> Subject: [PATCH v2 02/46] hugetlb: remove mk_huge_pte; it is unused From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" mk_huge_pte is unused and not necessary. pte_mkhuge is the appropriate function to call to create a HugeTLB PTE (see Documentation/mm/arch_pgtable_helpers.rst). It is being removed now to avoid complicating the implementation of HugeTLB high-granularity mapping. Acked-by: Peter Xu Acked-by: Mina Almasry Reviewed-by: Mike Kravetz Signed-off-by: James Houghton diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetl= b.h index ccdbccfde148..c34893719715 100644 --- a/arch/s390/include/asm/hugetlb.h +++ b/arch/s390/include/asm/hugetlb.h @@ -77,11 +77,6 @@ static inline void huge_ptep_set_wrprotect(struct mm_str= uct *mm, set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte)); } =20 -static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot) -{ - return mk_pte(page, pgprot); -} - static inline int huge_pte_none(pte_t pte) { return pte_none(pte); diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h index d7f6335d3999..be2e763e956f 100644 --- a/include/asm-generic/hugetlb.h +++ b/include/asm-generic/hugetlb.h @@ -5,11 +5,6 @@ #include #include =20 -static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot) -{ - return mk_pte(page, pgprot); -} - static inline unsigned long huge_pte_write(pte_t pte) { return pte_write(pte); diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index af59cc7bd307..fbbc53113473 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -925,7 +925,7 @@ static void __init hugetlb_basic_tests(struct pgtable_d= ebug_args *args) * as it was previously derived from a real kernel symbol. */ page =3D pfn_to_page(args->fixed_pmd_pfn); - pte =3D mk_huge_pte(page, args->page_prot); + pte =3D mk_pte(page, args->page_prot); =20 WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte))); WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte)))); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 792cb2e67ce5..540cdf9570d3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4899,11 +4899,10 @@ static pte_t make_huge_pte(struct vm_area_struct *v= ma, struct page *page, unsigned int shift =3D huge_page_shift(hstate_vma(vma)); =20 if (writable) { - entry =3D huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page, - vma->vm_page_prot))); + entry =3D huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page, + vma->vm_page_prot))); } else { - entry =3D huge_pte_wrprotect(mk_huge_pte(page, - vma->vm_page_prot)); + entry =3D huge_pte_wrprotect(mk_pte(page, vma->vm_page_prot)); } entry =3D pte_mkyoung(entry); entry =3D arch_make_huge_pte(entry, shift, vma->vm_flags); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 737C0C05027 for ; Sat, 18 Feb 2023 00:28:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229763AbjBRA25 (ORCPT ); Fri, 17 Feb 2023 19:28:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229714AbjBRA2r (ORCPT ); Fri, 17 Feb 2023 19:28:47 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FA8A5A3AD for ; Fri, 17 Feb 2023 16:28:47 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id l206-20020a25ccd7000000b006fdc6aaec4fso2656471ybf.20 for ; Fri, 17 Feb 2023 16:28:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ZDdwvW86Qv3EMGRuKtmhHN/6JFPd8EDP9/1ArRwRYXY=; b=fw5qqHMfBDzeIudommkAOwo+9UU/gmxeQgIih9FaoEsaJr2Xir63Ni9u5plph47GPQ qKNMSk3EoMZlFvKBixc5SlWffdrOIGT9xwufVeB6w6XtbUefWoHl0UTkSBIeVWA0nx9V 3vx/k8P3X+Nic17umWBpychYwMRvoLYQTiDpyeEvZuRVuWt8x3OPh+qcKKxIEMfqpB+f E4BJn1mxijItNJzVQ7AX794mA3LDFIyPdFN2u6SvQkdpMoHSeet4GeunVcsqyLAUuaFL HvKzb76potSZYgInefPIg3oX/vB0taPuTk8WpA+EnFjlSQJ7bSf8eNOYwuT07YyeDAq/ 1HaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ZDdwvW86Qv3EMGRuKtmhHN/6JFPd8EDP9/1ArRwRYXY=; b=rV/Ak7uYg1+nUcXhwhFgwb2O/djGNJvsu5VfW/zyy9GNhI/MF9QwDI11JpTPUBSXA9 f894AFAWIbiv8UG8ny1I3ULhi0HCESK8GIVpVsUva6LG3Vuau3T0EgbGiJo3Rvk/nrc7 IhFviLsUwdCa2dtciOCWR2d9mXjqJ64u63kFAokQBBOsUVMsj99Xof3dN9msjNXRZXcz BHULRRQoG5XWKeHT+WATlEWisJIN5vSRvGkuYojXrvL4XWdIfyRkKjmOX8Rzv+gS5HmC gp4YLuNih5R1o29AfspttvQMZLbC6tMx6a7J0iQLdz0dYr0Kgb3EVa4er02BdLbjgBud /6/Q== X-Gm-Message-State: AO0yUKWlITKU+Uu0KT1F0zlIwMwrhm8rk6luLgvznZreMQk+9UD6AkjI eCfUc1sUYup2bm6IYC5YOJ/ZwZZJyVGReTxd X-Google-Smtp-Source: AK7set+jWNiXveMIuz9URODx+pb4bTot1Uag93kYz1pPcv1KgJFihpEhzSj3o+H8iWxGWVbjL8VHgACEeLjh2Bea X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:4511:0:b0:52f:69d:cc75 with SMTP id s17-20020a814511000000b0052f069dcc75mr301138ywa.6.1676680126332; Fri, 17 Feb 2023 16:28:46 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:36 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-4-jthoughton@google.com> Subject: [PATCH v2 03/46] hugetlb: remove redundant pte_mkhuge in migration path From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" arch_make_huge_pte, which is called immediately following pte_mkhuge, already makes the necessary changes to the PTE that pte_mkhuge would have. The generic implementation of arch_make_huge_pte simply calls pte_mkhuge. Acked-by: Peter Xu Acked-by: Mina Almasry Reviewed-by: Mike Kravetz Signed-off-by: James Houghton diff --git a/mm/migrate.c b/mm/migrate.c index 37865f85df6d..d3964c414010 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -249,7 +249,6 @@ static bool remove_migration_pte(struct folio *folio, if (folio_test_hugetlb(folio)) { unsigned int shift =3D huge_page_shift(hstate_vma(vma)); =20 - pte =3D pte_mkhuge(pte); pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); if (folio_test_anon(folio)) hugepage_add_anon_rmap(new, vma, pvmw.address, --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34E17C636D6 for ; Sat, 18 Feb 2023 00:29:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230033AbjBRA3A (ORCPT ); Fri, 17 Feb 2023 19:29:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229822AbjBRA2s (ORCPT ); Fri, 17 Feb 2023 19:28:48 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0D9BA59732 for ; Fri, 17 Feb 2023 16:28:48 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5365936facfso20637387b3.4 for ; Fri, 17 Feb 2023 16:28:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ywqgi26Brd11uCoWP0sbMHtbYsofzUY9ZAimSzTW2NE=; b=n/AuNA+wao69AJqH2jxvia/RDl00z2MfIhcsqjJucnXsT0cN+rnMxxAJPhBDLk5ZSi MCiuvlZOLO6F7xC0qObRnm2Fw7Kk4xKcB8RUyTN9WchfsVg6IiD35+jkjOxG7TArgt04 nthdLjbi+YqMB//69CKeg7ZPl8QIrBQY7Ll3FExxGFCSIT1L0xNOUp19y9PDuIe+SxhZ B4Zyh5fsXW/P04RT85l9L0piLa0E//fpZzktUrEB+JJqehCEtNNVOMSTn6SN8kGNhUKz UMxeGr/Oj6oj5yYpspoF7jsafZJJiQT+dGrZV68rl4q3U6mgeh9r3mvqGZ+fe/wLSD9M jzxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ywqgi26Brd11uCoWP0sbMHtbYsofzUY9ZAimSzTW2NE=; b=krrTtTPp1MUrJHSb8GgjBn4L/tTomeHYvpDdjYpzejPTBRA18rxCgjY7GHp7+0v3by KYElLJQd5OAwJ5pXMfRuXvwngMfZz2n6TZlDnAnIrCcVkgAG/NopjtzO3ScxFnBrBz6+ 0YMcaFuHS7YUEDpLaBUrOjf5VIeG9p8SVOmMMG6gvkyH9c15GGrP3KJZAhsNtV6FRi/G JN17aIudVJDlaFMMnCY0IJmGrW6MOg0JmRpqxAHtZK81+oW2+kkORHR8evrwZvP3I9Fq uhBr86WhR15hUM9AGP5gGYUyDMPJLDuhvCUt0Brt1hNY1Uvjw/23kob9BzdjrwySuRXD IpoQ== X-Gm-Message-State: AO0yUKUYl5XgNYC3pm6BWlKhCVGXYcnngVu/ZQ/NzqSJ6C3OIE+hrto/ vuDZaV779mIm31QKoFzvo7MaSsBwVraagJCp X-Google-Smtp-Source: AK7set91EOlgY73vi4WtH0/wZAAYWGqc3EHj1NmJd/UToA1QLT2ebGJE6OT31llBzsdBdICY2Qgx1JjnImasJfkA X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:84:0:b0:902:5b5c:73f7 with SMTP id b4-20020a5b0084000000b009025b5c73f7mr14406ybp.12.1676680127282; Fri, 17 Feb 2023 16:28:47 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:37 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-5-jthoughton@google.com> Subject: [PATCH v2 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently this check is overly aggressive. For some userfaultfd VMAs, VMA sharing is disabled, yet we still widen the address range, which is used for flushing TLBs and sending MMU notifiers. This is done now, as HGM VMAs also have sharing disabled, yet would still have flush ranges adjusted. Overaggressively flushing TLBs and triggering MMU notifiers is particularly harmful with lots of high-granularity operations. Acked-by: Peter Xu Reviewed-by: Mike Kravetz Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 540cdf9570d3..08004371cfed 100644 Acked-by: Mina Almasry --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6999,22 +6999,31 @@ static unsigned long page_table_shareable(struct vm= _area_struct *svma, return saddr; } =20 -bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr) +static bool pmd_sharing_possible(struct vm_area_struct *vma) { - unsigned long start =3D addr & PUD_MASK; - unsigned long end =3D start + PUD_SIZE; - #ifdef CONFIG_USERFAULTFD if (uffd_disable_huge_pmd_share(vma)) return false; #endif /* - * check on proper vm_flags and page table alignment + * Only shared VMAs can share PMDs. */ if (!(vma->vm_flags & VM_MAYSHARE)) return false; if (!vma->vm_private_data) /* vma lock required for sharing */ return false; + return true; +} + +bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr) +{ + unsigned long start =3D addr & PUD_MASK; + unsigned long end =3D start + PUD_SIZE; + /* + * check on proper vm_flags and page table alignment + */ + if (!pmd_sharing_possible(vma)) + return false; if (!range_in_vma(vma, start, end)) return false; return true; @@ -7035,7 +7044,7 @@ void adjust_range_if_pmd_sharing_possible(struct vm_a= rea_struct *vma, * vma needs to span at least one aligned PUD size, and the range * must be at least partially within in. */ - if (!(vma->vm_flags & VM_MAYSHARE) || !(v_end > v_start) || + if (!pmd_sharing_possible(vma) || !(v_end > v_start) || (*end <=3D v_start) || (*start >=3D v_end)) return; =20 --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19436C05027 for ; Sat, 18 Feb 2023 00:29:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230044AbjBRA3D (ORCPT ); Fri, 17 Feb 2023 19:29:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229872AbjBRA2u (ORCPT ); Fri, 17 Feb 2023 19:28:50 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 184F35D3DC for ; Fri, 17 Feb 2023 16:28:49 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-53659386dc8so20702647b3.6 for ; Fri, 17 Feb 2023 16:28:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=nelWhEWWDb3ok6PYl3fRG2FGkHP98SSg6vQYnhj1QJk=; b=UvLBkxS6HWSvm6F/1ZojF9oUJKGMsoR6TJYUH52tIUGTUnHPSS4OjC8ZKXvwZwASAV tHFCfrP/r3B6sftGO2ROO6qyp0hWV5ssCNe8BVzZ/8UYiB4V6Kh8RSM5cqHAivyWBXpO lalg+riEaU3qUCklp3VFoj3s+ykjm2m5yzWRFdOAwMvlFIaPzMq9C6wMUYVDqXxeAYN4 ydi5/oNHHp4Z8bFgiq0SG1+hakJtq3Bas6/hdi8e26/G9KknLRZXzrhTcc3TXa9yzpwQ m8r8Qx+aRsi+iePo4b0oOe0PRXw52ZFZExKevE8Zt1tK+nMw9eyeE3g+fji1JRRzlVBM Sb3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=nelWhEWWDb3ok6PYl3fRG2FGkHP98SSg6vQYnhj1QJk=; b=quRgufxWyMhwKPm5XptNlcLylFb7dMsE2Zl8L+/6tHn6rru8rRZOrKznOokuR/i2OP xYySJfAcP8s+iXOm1K7cyHhFtKCjCQDI9Nui/7g62uvSJ62kXBf4QzCWzIKbjgh0JbQS 4PG71p6DxktD9QVDAP+rQd/GPPUQhjJxs+Sg+9dgcvNFmN3FKnwcX8hlP/+2PYxEI0iY +1+IGFstgoja0Om4y3AoDHOwNDwprlljIlB6PqHKtZa90fMe73s/VkNcZnATAVkbpayq yKdfwV1L19Y/Nnka7qku3inlAMaFHiXWsvnoRCFlXZSOnlP9s/G3tgqeXN3MouXe9CEw zfWg== X-Gm-Message-State: AO0yUKXCrSMAk/0akckQIHPYxgwYvqJNIIm63L95NeqZIe0X5fhcFfLp 1tMO7AyeoP7HxiUSN04gLyyPWkr84by35lDz X-Google-Smtp-Source: AK7set/Hqt7/tFdWTqNQyJko4/eRErfrIERd8YtMlK8cdL7HORy0LZ2CmPBJNFYLl26tfN06LjHCnVCVpHyEbeAH X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:b705:0:b0:534:d71f:14e6 with SMTP id v5-20020a81b705000000b00534d71f14e6mr53479ywh.9.1676680128284; Fri, 17 Feb 2023 16:28:48 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:38 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-6-jthoughton@google.com> Subject: [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This only applies to file-backed HugeTLB, and it should be a no-op until high-granularity mapping is possible. Also update page_remove_rmap to support the eventual case where !compound && folio_test_hugetlb(). HugeTLB doesn't use LRU or mlock, so we avoid those bits. This also means we don't need to use subpage_mapcount; if we did, it would overflow with only a few mappings. There is still one caller of page_dup_file_rmap left: copy_present_pte, and it is always called with compound=3Dfalse in this case. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 08004371cfed..6c008c9de80e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5077,7 +5077,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, * sleep during the process. */ if (!PageAnon(ptepage)) { - page_dup_file_rmap(ptepage, true); + page_add_file_rmap(ptepage, src_vma, true); } else if (page_try_dup_anon_rmap(ptepage, true, src_vma)) { pte_t src_pte_old =3D entry; @@ -5910,7 +5910,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *m= m, if (anon_rmap) hugepage_add_new_anon_rmap(folio, vma, haddr); else - page_dup_file_rmap(&folio->page, true); + page_add_file_rmap(&folio->page, vma, true); new_pte =3D make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED))); /* @@ -6301,7 +6301,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, goto out_release_unlock; =20 if (folio_in_pagecache) - page_dup_file_rmap(&folio->page, true); + page_add_file_rmap(&folio->page, dst_vma, true); else hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr); =20 diff --git a/mm/migrate.c b/mm/migrate.c index d3964c414010..b0f87f19b536 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -254,7 +254,7 @@ static bool remove_migration_pte(struct folio *folio, hugepage_add_anon_rmap(new, vma, pvmw.address, rmap_flags); else - page_dup_file_rmap(new, true); + page_add_file_rmap(new, vma, true); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); } else #endif diff --git a/mm/rmap.c b/mm/rmap.c index 15ae24585fc4..c010d0af3a82 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1318,21 +1318,21 @@ void page_add_file_rmap(struct page *page, struct v= m_area_struct *vma, int nr =3D 0, nr_pmdmapped =3D 0; bool first; =20 - VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); + VM_BUG_ON_PAGE(compound && !PageTransHuge(page) + && !folio_test_hugetlb(folio), page); =20 /* Is page being mapped by PTE? Is this its first map to be added? */ if (likely(!compound)) { first =3D atomic_inc_and_test(&page->_mapcount); nr =3D first; - if (first && folio_test_large(folio)) { + if (first && folio_test_large(folio) + && !folio_test_hugetlb(folio)) { nr =3D atomic_inc_return_relaxed(mapped); nr =3D (nr < COMPOUND_MAPPED); } - } else if (folio_test_pmd_mappable(folio)) { - /* That test is redundant: it's for safety or to optimize out */ - + } else { first =3D atomic_inc_and_test(&folio->_entire_mapcount); - if (first) { + if (first && !folio_test_hugetlb(folio)) { nr =3D atomic_add_return_relaxed(COMPOUND_MAPPED, mapped); if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) { nr_pmdmapped =3D folio_nr_pages(folio); @@ -1347,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_= area_struct *vma, } } =20 + if (folio_test_hugetlb(folio)) + return; + if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ? NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); @@ -1376,8 +1379,7 @@ void page_remove_rmap(struct page *page, struct vm_ar= ea_struct *vma, VM_BUG_ON_PAGE(compound && !PageHead(page), page); =20 /* Hugetlb pages are not counted in NR_*MAPPED */ - if (unlikely(folio_test_hugetlb(folio))) { - /* hugetlb pages are always mapped with pmds */ + if (unlikely(folio_test_hugetlb(folio)) && compound) { atomic_dec(&folio->_entire_mapcount); return; } @@ -1386,15 +1388,14 @@ void page_remove_rmap(struct page *page, struct vm_= area_struct *vma, if (likely(!compound)) { last =3D atomic_add_negative(-1, &page->_mapcount); nr =3D last; - if (last && folio_test_large(folio)) { + if (last && folio_test_large(folio) + && !folio_test_hugetlb(folio)) { nr =3D atomic_dec_return_relaxed(mapped); nr =3D (nr < COMPOUND_MAPPED); } - } else if (folio_test_pmd_mappable(folio)) { - /* That test is redundant: it's for safety or to optimize out */ - + } else { last =3D atomic_add_negative(-1, &folio->_entire_mapcount); - if (last) { + if (last && !folio_test_hugetlb(folio)) { nr =3D atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped); if (likely(nr < COMPOUND_MAPPED)) { nr_pmdmapped =3D folio_nr_pages(folio); @@ -1409,6 +1410,9 @@ void page_remove_rmap(struct page *page, struct vm_ar= ea_struct *vma, } } =20 + if (folio_test_hugetlb(folio)) + return; + if (nr_pmdmapped) { if (folio_test_anon(folio)) idx =3D NR_ANON_THPS; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9BDDAC64ED6 for ; Sat, 18 Feb 2023 00:29:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230051AbjBRA3G (ORCPT ); Fri, 17 Feb 2023 19:29:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41792 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229885AbjBRA2v (ORCPT ); Fri, 17 Feb 2023 19:28:51 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 113A15F252 for ; Fri, 17 Feb 2023 16:28:50 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 188-20020a2503c5000000b008e1de4c1e7dso2280796ybd.17 for ; Fri, 17 Feb 2023 16:28:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8nfyEWndKsa3tyc1KlPsVZDyN8xlZ4zW0vpf47Ary0I=; b=ghgOaEONmk3TxY7Vbd4llR8C6ZhiH535rAEn2mRoH4sgelGi+DzvBNDNzGInJZzqfs MEErbyZH29/h7FEPNCh9VZcW/Bm4YgRch9ir0kS2NDXVsbFg5l6v2m5+GdPoQ68p73Au kx7/25Br5F6Yw90cWFqLg3eViGqsOMcWI58kCeCLrtLPN7KaOw1BIGgyIswJlZykM+9O 0y4w9vz06OJ9NLQ+jIbxIwJsCdh/V36xONqd3gjqH80qszv271zhh3nf8DhiYiNLA2qQ k8oLBo0ZQMuPQXPM4GYcllmYMP6s7dZX1ehwhZA1pOfZpnWcQPV0VVxxICdxJelUUbzk U9lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8nfyEWndKsa3tyc1KlPsVZDyN8xlZ4zW0vpf47Ary0I=; b=s8B0tOqgX8vcKTYrHBQBdT1oUZ3EGwnGoCLmybxnIwa4n7Zhb0wnkPWZs4vVe2B0gJ kdbiuQsNZfeMSJ0ArfCD+JueRq8yUBD8u9NMlLwcWNT0eIisXBIgBmES3kpN2esXL+YA 6SrUOZjl5lpC8YJnSSx1bZFFIZF+W2ShwhsEDKijLNncYbPwGoNbJFDspUmj/SAyWA3M yNTg/5EB7z0+P8kLpx43cd4rNWABw+N75mbKDjtmuBtHxvdEQ71P2ZPsHn8Z3hy2HzA3 hdk97J1ft/C4wixEyelLJoNx08YOxXYh/6pd3QIm1VeanwuXsH9S93NzmdkdD9g3KTaI 8FGg== X-Gm-Message-State: AO0yUKV/ys6LJDzMLR91QYaHn8uzP+tXTKQzI0htQMvqoKsdcwlpkRZo QkBnMaFhQTFqRAJc+OOcNK7EAYOfWwsvhVWH X-Google-Smtp-Source: AK7set+i/pEV+wA/Uyq68EOFTB/N9gXVBIKZiQfbLrFGkhuKCKH/yF+5J8fuZrkmB0Jrum8skN72smbY4sjy/+UA X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:241:b0:8db:41c9:aa6f with SMTP id k1-20020a056902024100b008db41c9aa6fmr200113ybs.2.1676680129319; Fri, 17 Feb 2023 16:28:49 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:39 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-7-jthoughton@google.com> Subject: [PATCH v2 06/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This adds the Kconfig to enable or disable high-granularity mapping. Each architecture must explicitly opt-in to it (via ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING), but when opted in, HGM will be enabled by default if HUGETLB_PAGE is enabled. Signed-off-by: James Houghton diff --git a/fs/Kconfig b/fs/Kconfig index 2685a4d0d353..a072bbe3439a 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -246,6 +246,18 @@ config HUGETLBFS config HUGETLB_PAGE def_bool HUGETLBFS =20 +config ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING + bool + +config HUGETLB_HIGH_GRANULARITY_MAPPING + bool "HugeTLB high-granularity mapping support" + default n + depends on ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING + help + HugeTLB high-granularity mapping (HGM) allows userspace to issue + UFFDIO_CONTINUE on HugeTLB mappings in PAGE_SIZE chunks. + HGM is incompatible with the HugeTLB Vmemmap Optimization (HVO). + # # Select this config option from the architecture Kconfig, if it is prefer= red # to enable the feature of HugeTLB Vmemmap Optimization (HVO). @@ -257,6 +269,7 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP def_bool HUGETLB_PAGE depends on ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP depends on SPARSEMEM_VMEMMAP + depends on !HUGETLB_HIGH_GRANULARITY_MAPPING =20 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON bool "HugeTLB Vmemmap Optimization (HVO) defaults to on" --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8EE3BC05027 for ; Sat, 18 Feb 2023 00:29:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229708AbjBRA3J (ORCPT ); Fri, 17 Feb 2023 19:29:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41834 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229903AbjBRA2v (ORCPT ); Fri, 17 Feb 2023 19:28:51 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A6C6D68AD1 for ; Fri, 17 Feb 2023 16:28:50 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-536582abb72so22511017b3.5 for ; Fri, 17 Feb 2023 16:28:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Wk7vTpC+014vSbmA7BXLNrcppJOJiIEYznqCa6jfXqI=; b=L4T7D1Q77q9QoOsnz1tm6Qf6BnOnAK2ZGTzjyBq232kdcaHnTPLu0KkbIFKg2iqZGi ur2eS2TG9FID8wSfSSNIUDRAH4/hc4cg6N78xCCSSvyxSCrRt7XxD8Kq9J+3/utwUlNa XwiHxLMwsKt54AmsMKUm23cdHK0jjFppi4B8oAiG6cdvRInp32jKLbSvstKyMXiqo8UK 07XHP/2+lJH0n/T1p+F0gJBrThd4SgIk1odXs4Xk0ZR7hodPP4o/xUlKaYBCidUzDuQ8 b+jdsTQykyqbj5MGq0jpb+xO5JrvhEf0KZUw8P19p8z3GR5sTSonXH6Em23ytW9e2DzL g/dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Wk7vTpC+014vSbmA7BXLNrcppJOJiIEYznqCa6jfXqI=; b=VU/kYh2YLkc4Pgq3AXqOM48MzXWsWoCVL3Y4iPPyZDykghSyXb4IQO7TH5X7u+jom6 jurRhl+ucqR7oxsZAASoea47zGOuBU+D7WfNJf4nrF1H2pKxpFvT2LUlY97y1arTJsMG uvObI/VHKzvHKpavYhRsZwVl4gSicaalsiUyVayp7lfJQuRzdHpqnwjKleU7MMyAyY5l AL2YI2vX1/jNC1JZl5BJQ9MFh6Vn1Ez5FgaxR9mOPyDUCXuRu7kAtghKJKAJq6HSCLfT BP6xNO9/fX17eVby/mntmUwOaqxze+6ik94m2KrKbNxJdt5+GDIoCMt7nU4G8uAr5KXo INRA== X-Gm-Message-State: AO0yUKVhn3hRpRdPGX6T/iskMO6hWI340N9wrX1HaFYN8U74nQYY4tsz f+zEWSL+WUQnTA2WSGwWUGZNQ5egBYE3Nb3G X-Google-Smtp-Source: AK7set9Co62mLp/e8p6CW/1K+ZyZaQWY5gSTbmZ9xHWOnSJaaxinPhJ3ofPuLxwknQEtxTaVNR5J+AVeIdAiLUtz X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:4c3:0:b0:904:2aa2:c26c with SMTP id u3-20020a5b04c3000000b009042aa2c26cmr196573ybp.5.1676680130224; Fri, 17 Feb 2023 16:28:50 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:40 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-8-jthoughton@google.com> Subject: [PATCH v2 07/46] mm: add VM_HUGETLB_HGM VMA flag From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" VM_HUGETLB_HGM indicates that a HugeTLB VMA may contain high-granularity mappings. Its VmFlags string is "hm". Signed-off-by: James Houghton diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 6a96e1713fd5..77b72f42556a 100644 Acked-by: Mike Kravetz --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, str= uct vm_area_struct *vma) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR [ilog2(VM_UFFD_MINOR)] =3D "ui", #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING + [ilog2(VM_HUGETLB_HGM)] =3D "hm", +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ }; size_t i; =20 diff --git a/include/linux/mm.h b/include/linux/mm.h index 2992a2d55aee..9d3216b4284a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -383,6 +383,13 @@ extern unsigned int kobjsize(const void *objp); # define VM_UFFD_MINOR VM_NONE #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ =20 +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +# define VM_HUGETLB_HGM_BIT 38 +# define VM_HUGETLB_HGM BIT(VM_HUGETLB_HGM_BIT) /* HugeTLB high-granulari= ty mapping */ +#else /* !CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ +# define VM_HUGETLB_HGM VM_NONE +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ + /* Bits set in the VMA until the stack is in its final location */ #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) =20 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 9db52bc4ce19..bceb960dbada 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -162,6 +162,12 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "sk= ip_kasan_poison") # define IF_HAVE_UFFD_MINOR(flag, name) #endif =20 +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +# define IF_HAVE_HUGETLB_HGM(flag, name) {flag, name}, +#else +# define IF_HAVE_HUGETLB_HGM(flag, name) +#endif + #define __def_vmaflag_names \ {VM_READ, "read" }, \ {VM_WRITE, "write" }, \ @@ -186,6 +192,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_ACCOUNT, "account" }, \ {VM_NORESERVE, "noreserve" }, \ {VM_HUGETLB, "hugetlb" }, \ +IF_HAVE_HUGETLB_HGM(VM_HUGETLB_HGM, "hugetlb_hgm" ) \ {VM_SYNC, "sync" }, \ __VM_ARCH_SPECIFIC_1 , \ {VM_WIPEONFORK, "wipeonfork" }, \ --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A19A5C6379F for ; Sat, 18 Feb 2023 00:29:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230090AbjBRA3T (ORCPT ); Fri, 17 Feb 2023 19:29:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41956 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229963AbjBRA2x (ORCPT ); Fri, 17 Feb 2023 19:28:53 -0500 Received: from mail-ua1-x949.google.com (mail-ua1-x949.google.com [IPv6:2607:f8b0:4864:20::949]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7673F62FF2 for ; Fri, 17 Feb 2023 16:28:52 -0800 (PST) Received: by mail-ua1-x949.google.com with SMTP id s31-20020a9f3662000000b00683c94d9881so924538uad.3 for ; Fri, 17 Feb 2023 16:28:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=EVR7nxu9NqFl8iop5UBI5tIYqVwEs8TwTl7GvIiAmDc=; b=f9I/sF017TCFQ06v8JVGvP2Gz6zjUBu/YI/6tPRqHUhBratYAP73U723CvkiNRmX+2 Xq1dOa8TR7SCr3LVAGWBGIMiSfJg1XkFZM5gm/al2A53lbtwLqPsmpTxwC6EJWviOgyV 6ZoThnT1qOoxqJ91hnxCH3KMvw34Z4n+qBZp+x8ogItt4N49rc+X+Kia1yaFeZctlPrt dHqaxn4yVk5u2oBjIzFDTnhm8t11FS1zIMAY2JsmW8Lyk6cGaRH4PRACzyyvtyhEDhjv 2AMBPlp0rxu60Rp9Xuc47FP4zxAkrddmeJ3HXqOflP02EXV117YxiSCeFjnJ8DA0bJ1V pWMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=EVR7nxu9NqFl8iop5UBI5tIYqVwEs8TwTl7GvIiAmDc=; b=FAup0ns9VEeANrNpxNVtn4sv/iM06UQ3zsrPsLB7byECciDfxDoyJGP5BLQDdHwvqW jhwWtl1AB15EBfYt/xO4JszZKNLPTwVwoEI6F13tyubgk+tXOi18Fj2R45ugpe9kkzn2 E4+JJOHfE9EVsPbbo66k4kOJuXaxAlpper/InArBOmXzXSfezKlWjiouLLFSv0ePE5Rj ft6NzPZ5XSanl38OBeLIGyBk7Ml2gCjFEkgof0Twke7KviXNLezEp8Xuecl/Y+pccyqR aoKwXhtM/WwOLGLJKUDEam/M6tCHqp+7IjcAgJF21xWJY+UFaAAWJuUzm+tKIPkuO68B DkEA== X-Gm-Message-State: AO0yUKVymVzqsQdj/pmLtfmftBmjFhxeePVwbb7aUKJaDhGE/IeeVagK wUwK8WmherXvk2/pLLMh0RE5cJ0i4AbCGTgy X-Google-Smtp-Source: AK7set+7kve29zHBIESSAgNBn7bVNkTyjU2Wm0RcYZpG6xt7UjvZapCdpaj7ibFEOjWwLks8K3N9f134pdwQEDVw X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a1f:21c4:0:b0:406:983c:e6de with SMTP id h187-20020a1f21c4000000b00406983ce6demr657672vkh.1.1676680131643; Fri, 17 Feb 2023 16:28:51 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:41 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-9-jthoughton@google.com> Subject: [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" hugetlb_hgm_eligible indicates that a VMA is eligible to have HGM explicitly enabled via MADV_SPLIT, and hugetlb_hgm_enabled indicates that HGM has been enabled. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 7c977d234aba..efd2635a87f5 100644 Reviewed-by: Mina Almasry --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -1211,6 +1211,20 @@ static inline void hugetlb_unregister_node(struct no= de *node) } #endif /* CONFIG_HUGETLB_PAGE */ =20 +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +bool hugetlb_hgm_enabled(struct vm_area_struct *vma); +bool hugetlb_hgm_eligible(struct vm_area_struct *vma); +#else +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma) +{ + return false; +} +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma) +{ + return false; +} +#endif + static inline spinlock_t *huge_pte_lock(struct hstate *h, struct mm_struct *mm, pte_t *pte) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6c008c9de80e..0576dcc98044 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7004,6 +7004,10 @@ static bool pmd_sharing_possible(struct vm_area_stru= ct *vma) #ifdef CONFIG_USERFAULTFD if (uffd_disable_huge_pmd_share(vma)) return false; +#endif +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING + if (hugetlb_hgm_enabled(vma)) + return false; #endif /* * Only shared VMAs can share PMDs. @@ -7267,6 +7271,18 @@ __weak unsigned long hugetlb_mask_last_page(struct h= state *h) =20 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ =20 +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +bool hugetlb_hgm_eligible(struct vm_area_struct *vma) +{ + /* All shared VMAs may have HGM. */ + return vma && (vma->vm_flags & VM_MAYSHARE); +} +bool hugetlb_hgm_enabled(struct vm_area_struct *vma) +{ + return vma && (vma->vm_flags & VM_HUGETLB_HGM); +} +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ + /* * These functions are overwritable if your architecture needs its own * behavior. --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1237C6379F for ; Sat, 18 Feb 2023 00:29:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229744AbjBRA3Y (ORCPT ); Fri, 17 Feb 2023 19:29:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42972 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229570AbjBRA3P (ORCPT ); Fri, 17 Feb 2023 19:29:15 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 517526A050 for ; Fri, 17 Feb 2023 16:28:53 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-5366c11a9e2so12373607b3.17 for ; Fri, 17 Feb 2023 16:28:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Gzu9Oqr9H8Fn0tUeHEDWXYAlOB8OWdsieehz/zpOW6k=; b=tYpmmFQzEgrObj/2FMcbtdaJQm3sazU5VnbVoR+zz6vUh7rASk45yjIKn2FedC7eo+ 64TFERWH3kAocQ35djKIpY0U7uy4JBvt4LsnfsaNiT08MQoKwPApF8bVx1iphGbWHeEc IMInRpDhVgnidOgxuvPZq1H7/CcXyiRAgK7ISL6kaB1IUSOqgvGwY430h2xrB2v8neVQ 8AN2tGoltRJlPrCZJ1MuICMziVOgA+TSNCC1pOgqumOa93V1to2Sec2gOzLTecmc35fN 74QeAtnqV0HzcQ4DlaDW3Lxjmfem+TFSMTYeknGNL3g2HO+TM9ub4VO/dStuPqoBrn6U ZpmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Gzu9Oqr9H8Fn0tUeHEDWXYAlOB8OWdsieehz/zpOW6k=; b=a93MA8VZ8t8TtJv26HE1lm7513AIX+3ERm69v5wHfx0caROyNWexI4Su/Smm67Stn2 hLfwxFfpAdFCYcn44wxSf54p9FY4ny+dc3ZDoJqrZmo0n+ymWXgjxAHgQRY7qRAQbbAQ sr03ioIzCOW78Hlah5Fe83OlccLcqA+AiNL4viGCf4vbySsyhDSEzg6hJMeyriCn4r1y 5Q6MmO1gsxIRJ+CBwNUEsatMZH3OH2GeyOwgyScPCqrKPAumX6/8vWkdGRsXOo0RyNLW iuDozqu8CHUfmMbR5Hm4cqcmlFJtlPBjXv04ejWmrg+irsoX1mRrBR3jYNLsoQssT87D E8oA== X-Gm-Message-State: AO0yUKUWztCEyXAzbrZ79lUKD8st0Ht7yksB7/OB+w7kE7VMlZhm4XBZ R1eI4w64b8NZk642KCmGfZ+uPvy56wDpKP9G X-Google-Smtp-Source: AK7set9Ig32WWEUYTKzsQPjrHFiRL8Rvh2WamhJ5X+omJqqagrRjDEce35WMrY5ZjwX5thzjLfDORxzKFgGHjUzE X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:1024:b0:8fc:686c:cf87 with SMTP id x4-20020a056902102400b008fc686ccf87mr53474ybt.4.1676680132546; Fri, 17 Feb 2023 16:28:52 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:42 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-10-jthoughton@google.com> Subject: [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be applied to non-HugeTLB memory in the future, if such an application is to arise. MADV_SPLIT provides several API changes for some syscalls on HugeTLB address ranges: 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE alignment. 2. read()ing a page fault event from a userfaultfd will yield a PAGE_SIZE-rounded address, instead of a huge-page-size-rounded address (unless UFFD_FEATURE_EXACT_ADDRESS is used). There is no way to disable the API changes that come with issuing MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page table mappings that come from the extended functionality that comes with using MADV_SPLIT. For post-copy live migration, the expected use-case is: 1. mmap(MAP_SHARED, some_fd) primary mapping 2. mmap(MAP_SHARED, some_fd) alias mapping 3. MADV_SPLIT the primary mapping 4. UFFDIO_REGISTER/etc. the primary mapping 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the corresponding PAGE_SIZE sections in the primary mapping. More API changes may be added in the future. Signed-off-by: James Houghton diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/a= sm/mman.h index 763929e814e9..7a26f3648b90 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,8 @@ =20 #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ =20 +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 =20 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm= /mman.h index c6e1fc77c996..f8a74a3a0928 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -105,6 +105,8 @@ =20 #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ =20 +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 =20 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi= /asm/mman.h index 68c44f99bc93..a6dc6a56c941 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -72,6 +72,8 @@ =20 #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ =20 +#define MADV_SPLIT 74 /* Enable hugepage high-granularity APIs */ + #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ =20 diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi= /asm/mman.h index 1ff0c858544f..f98a77c430a9 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -113,6 +113,8 @@ =20 #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ =20 +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 =20 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-gene= ric/mman-common.h index 6ce1f1ceb432..996e8ded092f 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,8 @@ =20 #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ =20 +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 =20 diff --git a/mm/madvise.c b/mm/madvise.c index c2202f51e9dd..8c004c678262 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vm= a, return error; } =20 +static int madvise_split(struct vm_area_struct *vma, + unsigned long *new_flags) +{ +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING + if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma)) + return -EINVAL; + + /* + * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part + * of a VMA, then we will split the VMA. Here, we're unsharing before + * splitting because it's simpler, although we may be unsharing more + * than we need. + */ + hugetlb_unshare_all_pmds(vma); + + *new_flags |=3D VM_HUGETLB_HGM; + return 0; +#else + return -EINVAL; +#endif +} + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its= own @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struc= t *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end); + case MADV_SPLIT: + error =3D madvise_split(vma, &new_flags); + if (error) + goto out; + break; } =20 anon_name =3D anon_vma_name(vma); @@ -1178,6 +1205,9 @@ madvise_behavior_valid(int behavior) case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: case MADV_COLLAPSE: +#endif +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING + case MADV_SPLIT: #endif case MADV_DONTDUMP: case MADV_DODUMP: @@ -1368,6 +1398,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsig= ned long start, * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. * MADV_COLLAPSE - synchronously coalesce pages into new THP. + * MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows + * UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions. * MADV_DONTDUMP - the application wants to prevent pages in the given ra= nge * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15774C6379F for ; Sat, 18 Feb 2023 00:29:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230105AbjBRA31 (ORCPT ); Fri, 17 Feb 2023 19:29:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41792 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229884AbjBRA3Q (ORCPT ); Fri, 17 Feb 2023 19:29:16 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9149A5BDB0 for ; Fri, 17 Feb 2023 16:28:54 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 64-20020a250243000000b007eba3f8e3baso2483973ybc.4 for ; Fri, 17 Feb 2023 16:28:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=5Xd63Fapv1zuFib4b0fw8zL6DTisPv2GzpUTcimolMg=; b=G9OKNC7MB5QZ1WtbCVd/b4GGEEQxFMkkCyh9YPN/SaI+Iv0UY6xpXV4C/bjLuR5GX/ xT5jqUIS0kJvgcXBGXG6eGMOl41EQ+9+EPXqDEN8/E+QrPDYAERgHP5tENbu09gPxEEE J2plHmBfhFP7cLD9TJAsfNCeQfwyLgPfEiNyKU7bNFt6UToPyOfI/Tv35qpqCPqZOx+3 BkeCahUvSvDUjoiOquOpSnIGEQ48uVtj2zGVrhp8k95YXggqw2b91kgnZMkucKn+Y1SW QrvcwU7NPBdC0gLOUvUCZFlWQklPf0wCQDXXAGua2AKBXeE+nHdPEk2ToJ76Lq1odat1 0YNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=5Xd63Fapv1zuFib4b0fw8zL6DTisPv2GzpUTcimolMg=; b=7Y0mktldTTNqhkx9PjBuw4/RXIRzAz0HFpcWlWOETlIMoi+/HPZv9MAiqlXzyoitcX LJBiy8pnWi6I4yyKq03SvrwibNymU1jdYjA1zjy82gLlLs2zQs41FMVj+hmZgFQU0laC 1VoZ/wHL9T99Jg1Qlnyh4/nGW6B7tLWhISLXohsy+097XllhkUIvQ1CefkzO7y5ufWGC OpPEEmkYq979LXoKrKgblCDSEdulYqnh8uROkrtPYpo8YTXl+rEKQxUWx9IxY2DSsoDu lOFxMsXanWtqzYfe3jNsv2G8ROpnAmHU1FMtwbdU42+I+smlVrv9G/eM9bAzrlfWblPJ EKPA== X-Gm-Message-State: AO0yUKXLbnQJsWPrxZtLUhRSFIY4Bjh2Gu6EGoYvJ4H8kGV653+9FyYe ItmULVBeE+0AIQRGgSda5aSyEACxgMMUeZe2 X-Google-Smtp-Source: AK7set8eABNjJa8kuKJfW7WVM4L/kIrgHT4nRE22RZ7aTpjJYSh+BZra+YA1xFFUMi8kyRc4Y+q2EPk2y7TmuTgs X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:b60f:0:b0:52a:92e9:27c1 with SMTP id u15-20020a81b60f000000b0052a92e927c1mr279953ywh.10.1676680133715; Fri, 17 Feb 2023 16:28:53 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:43 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-11-jthoughton@google.com> Subject: [PATCH v2 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This is needed to handle PTL locking with high-granularity mapping. We won't always be using the PMD-level PTL even if we're using the 2M hugepage hstate. It's possible that we're dealing with 4K PTEs, in which case, we need to lock the PTL for the 4K PTE. Reviewed-by: Mina Almasry Acked-by: Mike Kravetz Signed-off-by: James Houghton diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index cb2dcdb18f8e..035a0df47af0 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -261,7 +261,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *v= ma, =20 psize =3D hstate_get_psize(h); #ifdef CONFIG_DEBUG_VM - assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep)); + assert_spin_locked(huge_pte_lockptr(huge_page_shift(h), + vma->vm_mm, ptep)); #endif =20 #else diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index efd2635a87f5..a1ceb9417f01 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -958,12 +958,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hst= ate *h, gfp_t gfp_mask) return modified_mask; } =20 -static inline spinlock_t *huge_pte_lockptr(struct hstate *h, +static inline spinlock_t *huge_pte_lockptr(unsigned int shift, struct mm_struct *mm, pte_t *pte) { - if (huge_page_size(h) =3D=3D PMD_SIZE) + if (shift =3D=3D PMD_SHIFT) return pmd_lockptr(mm, (pmd_t *) pte); - VM_BUG_ON(huge_page_size(h) =3D=3D PAGE_SIZE); return &mm->page_table_lock; } =20 @@ -1173,7 +1172,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hst= ate *h, gfp_t gfp_mask) return 0; } =20 -static inline spinlock_t *huge_pte_lockptr(struct hstate *h, +static inline spinlock_t *huge_pte_lockptr(unsigned int shift, struct mm_struct *mm, pte_t *pte) { return &mm->page_table_lock; @@ -1230,7 +1229,7 @@ static inline spinlock_t *huge_pte_lock(struct hstate= *h, { spinlock_t *ptl; =20 - ptl =3D huge_pte_lockptr(h, mm, pte); + ptl =3D huge_pte_lockptr(huge_page_shift(h), mm, pte); spin_lock(ptl); return ptl; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 0576dcc98044..5ca9eae0ac42 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5017,7 +5017,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, } =20 dst_ptl =3D huge_pte_lock(h, dst, dst_pte); - src_ptl =3D huge_pte_lockptr(h, src, src_pte); + src_ptl =3D huge_pte_lockptr(huge_page_shift(h), src, src_pte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); entry =3D huge_ptep_get(src_pte); again: @@ -5098,7 +5098,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, =20 /* Install the new hugetlb folio if src pte stable */ dst_ptl =3D huge_pte_lock(h, dst, dst_pte); - src_ptl =3D huge_pte_lockptr(h, src, src_pte); + src_ptl =3D huge_pte_lockptr(huge_page_shift(h), + src, src_pte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); entry =3D huge_ptep_get(src_pte); if (!pte_same(src_pte_old, entry)) { @@ -5152,7 +5153,7 @@ static void move_huge_pte(struct vm_area_struct *vma,= unsigned long old_addr, pte_t pte; =20 dst_ptl =3D huge_pte_lock(h, mm, dst_pte); - src_ptl =3D huge_pte_lockptr(h, mm, src_pte); + src_ptl =3D huge_pte_lockptr(huge_page_shift(h), mm, src_pte); =20 /* * We don't have to worry about the ordering of src and dst ptlocks diff --git a/mm/migrate.c b/mm/migrate.c index b0f87f19b536..9b4a7e75f6e6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -363,7 +363,8 @@ void __migration_entry_wait_huge(struct vm_area_struct = *vma, =20 void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { - spinlock_t *ptl =3D huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte); + spinlock_t *ptl =3D huge_pte_lockptr(huge_page_shift(hstate_vma(vma)), + vma->vm_mm, pte); =20 __migration_entry_wait_huge(vma, pte, ptl); } --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2E1AC05027 for ; Sat, 18 Feb 2023 00:29:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230116AbjBRA3a (ORCPT ); Fri, 17 Feb 2023 19:29:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41804 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229951AbjBRA3Q (ORCPT ); Fri, 17 Feb 2023 19:29:16 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 18D716ABC9 for ; Fri, 17 Feb 2023 16:28:55 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id l206-20020a25ccd7000000b006fdc6aaec4fso2656724ybf.20 for ; Fri, 17 Feb 2023 16:28:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Pz2nYroaVRrGSVy94qGG4nZ9E30MSqR5I0rmAzuTvq0=; b=V1UHwis97dacsHy0l1b3lzUp4aEffElgkQqFRGAP4R9Hw6zzh9pL3pU8TQ6LeCkOOO LAnGR0E6fLjGxKpr6E0yb8yofQJ3Yplq8n0cLT/bshgkpsM/hMNaBPY6WVg4rCJqcfel PmhHCvaQjRLpj3PZA4l3c17dZhuSlGVknfSXIUsszWjHRSH2g8HHkboEveCGaLYpHpYI huKFnLPcITV45hT7FdZX9JtRZdP0XFlN1WbgnsZXqSux2N7F1a5KMCJ3UlJJdxC6K6rW 3mwyAccoa5IIH5iiNCO6YwhAkDP3JU0L27dI4gD0cC8hs7RdC32Vnk5NUwnl613BgcAj bg0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Pz2nYroaVRrGSVy94qGG4nZ9E30MSqR5I0rmAzuTvq0=; b=BtJwc+hq0KUweiaaK/OSQmOHE0bMaoAE+zp5RUzUHkGJ6wG049gVstSMxE0CrFL7sX LBwy5kaIXShajNOvYM5TNWqlh21yneKwgmdPGB0oNaqg9M7iCfFHtnb8ff3RSWhLCpoA d2WgR6Un3UnobYhlWBimcuvNt5/MExVniq9VYx3ARJGkGJvX8w0oEDf3KuBOApvMWRW2 8aXacgWb10H+qwzflludmfyR8GBCCLb5LiI36lGDT1f5HBP6MWPjXRY1Yhd1AwEpY2Mv sCiYmpiX4V99MmRz+uwAFi5pneqtszFc6u+0VLOsj31RsYpdWDgZvCEBxYfmPZahnIx6 8pcQ== X-Gm-Message-State: AO0yUKU2PaKJSL8/EfMtDX9UlcHi62l3kDx1ODq92P3qZhSPYxMNzli8 Xk7w44/o6p5ZO39LZs6wcgCEdE+IMIzCKWuV X-Google-Smtp-Source: AK7set+zcld/vdofwpUCcvIXOcKa++WCtmqUm8NMGDjM4RjhBKN3Re9Z99sY+8b3ruIVEFKeHpX84TBdzsYueQO8 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:10c:b0:997:c919:4484 with SMTP id o12-20020a056902010c00b00997c9194484mr28393ybh.6.1676680134689; Fri, 17 Feb 2023 16:28:54 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:44 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-12-jthoughton@google.com> Subject: [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After high-granularity mapping, page table entries for HugeTLB pages can be of any size/type. (For example, we can have a 1G page mapped with a mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB PTE after we have done a page table walk. Without this, we'd have to pass around the "size" of the PTE everywhere. We effectively did this before; it could be fetched from the hstate, which we pass around pretty much everywhere. hugetlb_pte_present_leaf is included here as a helper function that will be used frequently later on. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a1ceb9417f01..eeacadf3272b 100644 Acked-by: Mike Kravetz Reviewed-by: Mina Almasry --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -26,6 +26,25 @@ typedef struct { unsigned long pd; } hugepd_t; #define __hugepd(x) ((hugepd_t) { (x) }) #endif =20 +enum hugetlb_level { + HUGETLB_LEVEL_PTE =3D 1, + /* + * We always include PMD, PUD, and P4D in this enum definition so that, + * when logged as an integer, we can easily tell which level it is. + */ + HUGETLB_LEVEL_PMD, + HUGETLB_LEVEL_PUD, + HUGETLB_LEVEL_P4D, + HUGETLB_LEVEL_PGD, +}; + +struct hugetlb_pte { + pte_t *ptep; + unsigned int shift; + enum hugetlb_level level; + spinlock_t *ptl; +}; + #ifdef CONFIG_HUGETLB_PAGE =20 #include @@ -39,6 +58,20 @@ typedef struct { unsigned long pd; } hugepd_t; */ #define __NR_USED_SUBPAGE 3 =20 +static inline +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte) +{ + return 1UL << hpte->shift; +} + +static inline +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte) +{ + return ~(hugetlb_pte_size(hpte) - 1); +} + +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); + struct hugepage_subpool { spinlock_t lock; long count; @@ -1234,6 +1267,45 @@ static inline spinlock_t *huge_pte_lock(struct hstat= e *h, return ptl; } =20 +static inline +spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte) +{ + return hpte->ptl; +} + +static inline +spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte) +{ + spinlock_t *ptl =3D hugetlb_pte_lockptr(hpte); + + spin_lock(ptl); + return ptl; +} + +static inline +void __hugetlb_pte_init(struct hugetlb_pte *hpte, pte_t *ptep, + unsigned int shift, enum hugetlb_level level, + spinlock_t *ptl) +{ + /* + * If 'shift' indicates that this PTE is contiguous, then @ptep must + * be the first pte of the contiguous bunch. + */ + hpte->ptl =3D ptl; + hpte->ptep =3D ptep; + hpte->shift =3D shift; + hpte->level =3D level; +} + +static inline +void hugetlb_pte_init(struct mm_struct *mm, struct hugetlb_pte *hpte, + pte_t *ptep, unsigned int shift, + enum hugetlb_level level) +{ + __hugetlb_pte_init(hpte, ptep, shift, level, + huge_pte_lockptr(shift, mm, ptep)); +} + #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA) extern void __init hugetlb_cma_reserve(int order); #else diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5ca9eae0ac42..6c74adff43b6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1269,6 +1269,35 @@ static bool vma_has_reserves(struct vm_area_struct *= vma, long chg) return false; } =20 +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte) +{ + pgd_t pgd; + p4d_t p4d; + pud_t pud; + pmd_t pmd; + + switch (hpte->level) { + case HUGETLB_LEVEL_PGD: + pgd =3D __pgd(pte_val(pte)); + return pgd_present(pgd) && pgd_leaf(pgd); + case HUGETLB_LEVEL_P4D: + p4d =3D __p4d(pte_val(pte)); + return p4d_present(p4d) && p4d_leaf(p4d); + case HUGETLB_LEVEL_PUD: + pud =3D __pud(pte_val(pte)); + return pud_present(pud) && pud_leaf(pud); + case HUGETLB_LEVEL_PMD: + pmd =3D __pmd(pte_val(pte)); + return pmd_present(pmd) && pmd_leaf(pmd); + case HUGETLB_LEVEL_PTE: + return pte_present(pte); + default: + WARN_ON_ONCE(1); + return false; + } +} + + static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio) { int nid =3D folio_nid(folio); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C5A0C64ED6 for ; Sat, 18 Feb 2023 00:29:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230119AbjBRA3e (ORCPT ); Fri, 17 Feb 2023 19:29:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229694AbjBRA3R (ORCPT ); Fri, 17 Feb 2023 19:29:17 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 118F16ABF3 for ; Fri, 17 Feb 2023 16:28:56 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5365a2b9e4fso19053897b3.15 for ; Fri, 17 Feb 2023 16:28:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Coc3VejVpx2o70gw2F1FMLmVPB7+pJmCD4QBd1TWaEQ=; b=b9QeURwuMOHoR6oMuvaNhmrHn5fZGAuH6iVnLRa9UH1kSasBCwgdba6QJPd+K6jzHb xz6s/6q+eVA72+3bOi37v4IpLi+gt+dDSU4ZiqKrbXd+tRUGRvRJ1mGjZ02I2XMyurj4 SsSE1g1trZVfhuYlrhfBM4w2eQv5qoEpUPlr5CJMDFyaM52fzsGSeRGzxi0rFH4dVLC6 IFd4sxykmk7qNUAT62W1oRvfIgHfE5E1BizMsR8G5eNcCZgGLkwZhgNod0TNXPX+Z62j 6Y1IxEaN7fiBpSKCe960CuuSJGub6psv+J5ih1KSgDonLBFQ4O9QsMNnykmF5WR/o4UG 3CMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Coc3VejVpx2o70gw2F1FMLmVPB7+pJmCD4QBd1TWaEQ=; b=klwZ9okWf7daMKyeDQ0qFM8L7FDsN5pOwsL7CRHRgZ3SM5JhPzz8QOzOydPwjoACUj 7VXFCi3kyGhJ/jRrOZWSw6ogAjQdaktugZf/lpSroB8MctgELZQvGyiL7jNbYp2emez5 10KCnYsM+OEL+oVtyRjRjZnDc1YxPuHnkjQUgOhsZ13E+ppVw97ZCuFFqrtO9YltPny/ rtxU+99k/BnlmDBAaHpvY4wf7hmkeSRQ3GvPIl6fQWXFsSowC2dJH6x3p2An5L9yALLe N7RuUwpGn3UDSLYPU3UPdaivWXw/GeGnesY12IhtIN17lrJrQhTJf+6fSBr+80UmIlf3 ziPQ== X-Gm-Message-State: AO0yUKXOH+ifzjGYDFbF9mVdSLaY0CypMaP58GQQo72c/iJJTcrXjhom h4kxniod5CGCa2R7DnPUSoThV53CdGmIAxcB X-Google-Smtp-Source: AK7set9ALTOiMY3O/c14DOIxwwoyqi8AmUxl79K9DJLNuYtBmR+KN/t3Tjueou+iosR6XToAl+9j+ymfqMf1UvES X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:b705:0:b0:534:d71f:14e6 with SMTP id v5-20020a81b705000000b00534d71f14e6mr53501ywh.9.1676680135521; Fri, 17 Feb 2023 16:28:55 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:45 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-13-jthoughton@google.com> Subject: [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" These functions are used to allocate new PTEs below the hstate PTE. This will be used by hugetlb_walk_step, which implements stepping forwards in a HugeTLB high-granularity page table walk. The reasons that we don't use the standard pmd_alloc/pte_alloc* functions are: 1) This prevents us from accidentally overwriting swap entries or attempting to use swap entries as present non-leaf PTEs (see pmd_alloc(); we assume that !pte_none means pte_present and non-leaf). 2) Locking hugetlb PTEs can different than regular PTEs. (Although, as implemented right now, locking is the same.) 3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB HGM won't use HIGHPTE, but the kernel can still be built with it, and other mm code will use it. When GENERAL_HUGETLB supports P4D-based hugepages, we will need to implement hugetlb_pud_alloc to implement hugetlb_walk_step. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index eeacadf3272b..9d839519c875 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -72,6 +72,11 @@ unsigned long hugetlb_pte_mask(const struct hugetlb_pte = *hpte) =20 bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); =20 +pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte, + unsigned long addr); +pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte, + unsigned long addr); + struct hugepage_subpool { spinlock_t lock; long count; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6c74adff43b6..bb424cdf79e4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -483,6 +483,120 @@ static bool has_same_uncharge_info(struct file_region= *rg, #endif } =20 +/* + * hugetlb_alloc_pmd -- Allocate or find a PMD beneath a PUD-level hpte. + * + * This is meant to be used to implement hugetlb_walk_step when one must g= o to + * step down to a PMD. Different architectures may implement hugetlb_walk_= step + * differently, but hugetlb_alloc_pmd and hugetlb_alloc_pte are architectu= re- + * independent. + * + * Returns: + * On success: the pointer to the PMD. This should be placed into a + * hugetlb_pte. @hpte is not changed. + * ERR_PTR(-EINVAL): hpte is not PUD-level + * ERR_PTR(-EEXIST): there is a non-leaf and non-empty PUD in @hpte + * ERR_PTR(-ENOMEM): could not allocate the new PMD + */ +pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte, + unsigned long addr) +{ + spinlock_t *ptl =3D hugetlb_pte_lockptr(hpte); + pmd_t *new; + pud_t *pudp; + pud_t pud; + + if (hpte->level !=3D HUGETLB_LEVEL_PUD) + return ERR_PTR(-EINVAL); + + pudp =3D (pud_t *)hpte->ptep; +retry: + pud =3D READ_ONCE(*pudp); + if (likely(pud_present(pud))) + return unlikely(pud_leaf(pud)) + ? ERR_PTR(-EEXIST) + : pmd_offset(pudp, addr); + else if (!pud_none(pud)) + /* + * Not present and not none means that a swap entry lives here, + * and we can't get rid of it. + */ + return ERR_PTR(-EEXIST); + + new =3D pmd_alloc_one(mm, addr); + if (!new) + return ERR_PTR(-ENOMEM); + + spin_lock(ptl); + if (!pud_same(pud, *pudp)) { + spin_unlock(ptl); + pmd_free(mm, new); + goto retry; + } + + mm_inc_nr_pmds(mm); + smp_wmb(); /* See comment in pmd_install() */ + pud_populate(mm, pudp, new); + spin_unlock(ptl); + return pmd_offset(pudp, addr); +} + +/* + * hugetlb_alloc_pte -- Allocate a PTE beneath a pmd_none PMD-level hpte. + * + * See the comment above hugetlb_alloc_pmd. + */ +pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte, + unsigned long addr) +{ + spinlock_t *ptl =3D hugetlb_pte_lockptr(hpte); + pgtable_t new; + pmd_t *pmdp; + pmd_t pmd; + + if (hpte->level !=3D HUGETLB_LEVEL_PMD) + return ERR_PTR(-EINVAL); + + pmdp =3D (pmd_t *)hpte->ptep; +retry: + pmd =3D READ_ONCE(*pmdp); + if (likely(pmd_present(pmd))) + return unlikely(pmd_leaf(pmd)) + ? ERR_PTR(-EEXIST) + : pte_offset_kernel(pmdp, addr); + else if (!pmd_none(pmd)) + /* + * Not present and not none means that a swap entry lives here, + * and we can't get rid of it. + */ + return ERR_PTR(-EEXIST); + + /* + * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result + * in page tables being allocated in high memory, needing a kmap to + * access. Instead, we call __pte_alloc_one directly with + * GFP_PGTABLE_USER to prevent these PTEs being allocated in high + * memory. + */ + new =3D __pte_alloc_one(mm, GFP_PGTABLE_USER); + if (!new) + return ERR_PTR(-ENOMEM); + + spin_lock(ptl); + if (!pmd_same(pmd, *pmdp)) { + spin_unlock(ptl); + pgtable_pte_page_dtor(new); + __free_page(new); + goto retry; + } + + mm_inc_nr_ptes(mm); + smp_wmb(); /* See comment in pmd_install() */ + pmd_populate(mm, pmdp, new); + spin_unlock(ptl); + return pte_offset_kernel(pmdp, addr); +} + static void coalesce_file_region(struct resv_map *resv, struct file_region= *rg) { struct file_region *nrg, *prg; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB316C636D6 for ; Sat, 18 Feb 2023 00:29:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230141AbjBRA3l (ORCPT ); Fri, 17 Feb 2023 19:29:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43004 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230009AbjBRA3S (ORCPT ); Fri, 17 Feb 2023 19:29:18 -0500 Received: from mail-ua1-x94a.google.com (mail-ua1-x94a.google.com [IPv6:2607:f8b0:4864:20::94a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BA966ABE3 for ; Fri, 17 Feb 2023 16:28:57 -0800 (PST) Received: by mail-ua1-x94a.google.com with SMTP id j4-20020ab06004000000b0068b93413c63so812928ual.18 for ; Fri, 17 Feb 2023 16:28:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1676680136; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=uRKYla0TQmxQv6jLt3zA0HnNAsQhQcQxeVgRZIxAfD0=; b=aY3F+EdXtCUshFPXDCNcttdzNX+c4LbhQi1SlJKFNgfqHWfMJ9vDmVr9sTpieHk05X o4uR5QadzoXUGRF67oJyJbtK2h79UlcD3xZm62j78SP9Oa4ap24kfVkt0+OF4A/n+edW 7IAq9oxJqPxznoi+TtMM93uAoNkDnONQ+ts18nLlvOqmoYn3iH/qQKh7/ZFGrihwcuj9 1V4KX6t2Pwg0FKhmdUmpWWjWMENeKOVdjc9494RWzzvR00UMehx3hUstaQ0j8uMeOex4 5VuSVQoGMANGWUUusrSYOeAZy9IVSY0n/qOvAGec/mTz8lN3nGuzMNi0BmdrRxSyqNfw Sbbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676680136; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=uRKYla0TQmxQv6jLt3zA0HnNAsQhQcQxeVgRZIxAfD0=; b=sPrLLK6vsihjMt6o+NwR2KqE6PQrEr644HtbFMBmmTBWpgWrDhW25qcndTV3emU9JK fTZgdnrr9wnFkytlrXY5NDmrELsG40qIXingdbA3uLTDCuIKZ2ROh2qo3ytj1OsQ0zKU KMYBrd49zPEPqXfGbNKOcTpAk9DLSqv82itM/orswCTERqs63h5HRyVpkRvtZgBk2Rxb Fly7ZLSqB48BpyMqE2ZwJzShyEehR33yiVOOPz5gy7FGxNF/eLdUaOz+UomMb5JD5GYv X2WuBJl9UkQGTAFcOMyPvTh3GbP1soH40SW7tF9yVHf0DvtCXjjypYQgDXDVdenrL5tc 7XRQ== X-Gm-Message-State: AO0yUKVPN32B/j2XVD4urBD9a5+99ggPeEZaZV2OD85tFL1GO/QO388P jrJ+fBmPdoEjJp0TlGM4J04NarKq3QHGkoTv X-Google-Smtp-Source: AK7set/GEorAyYMaCd0jNzqquEaNj7jwa1sYuJrs5tjMlDyYAqUGDdv0CU5O6YV9UsW4yKpSO/jg62VpebURY2U0 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6102:153:b0:417:159c:218b with SMTP id a19-20020a056102015300b00417159c218bmr652647vsr.13.1676680136399; Fri, 17 Feb 2023 16:28:56 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:46 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-14-jthoughton@google.com> Subject: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" hugetlb_hgm_walk implements high-granularity page table walks for HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return immediately. hugetlb_walk_step implements how we step forwards in the walk. For architectures that don't use GENERAL_HUGETLB, they will need to provide their own implementation. The broader API that should be used is hugetlb_full_walk[,alloc|,continue]. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 9d839519c875..726d581158b1 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -223,6 +223,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *map= ping, pgoff_t idx); pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pud_t *pud); =20 +int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma, + unsigned long addr); +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte, + struct vm_area_struct *vma, unsigned long addr); +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte, + struct vm_area_struct *vma, unsigned long addr, + unsigned long target_sz); + struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage); =20 extern int sysctl_hugetlb_shm_group; @@ -272,6 +280,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_a= rea_struct *vma, pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); unsigned long hugetlb_mask_last_page(struct hstate *h); +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte, + unsigned long addr, unsigned long sz); int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, @@ -1054,6 +1064,8 @@ void hugetlb_register_node(struct node *node); void hugetlb_unregister_node(struct node *node); #endif =20 +enum hugetlb_level hpage_size_to_level(unsigned long sz); + #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; =20 @@ -1246,6 +1258,11 @@ static inline void hugetlb_register_node(struct node= *node) static inline void hugetlb_unregister_node(struct node *node) { } + +static inline enum hugetlb_level hpage_size_to_level(unsigned long sz) +{ + return HUGETLB_LEVEL_PTE; +} #endif /* CONFIG_HUGETLB_PAGE */ =20 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bb424cdf79e4..810c05feb41f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -97,6 +97,29 @@ static void __hugetlb_vma_unlock_write_free(struct vm_ar= ea_struct *vma); static void hugetlb_unshare_pmds(struct vm_area_struct *vma, unsigned long start, unsigned long end); =20 +/* + * hpage_size_to_level() - convert @sz to the corresponding page table lev= el + * + * @sz must be less than or equal to a valid hugepage size. + */ +enum hugetlb_level hpage_size_to_level(unsigned long sz) +{ + /* + * We order the conditionals from smallest to largest to pick the + * smallest level when multiple levels have the same size (i.e., + * when levels are folded). + */ + if (sz < PMD_SIZE) + return HUGETLB_LEVEL_PTE; + if (sz < PUD_SIZE) + return HUGETLB_LEVEL_PMD; + if (sz < P4D_SIZE) + return HUGETLB_LEVEL_PUD; + if (sz < PGDIR_SIZE) + return HUGETLB_LEVEL_P4D; + return HUGETLB_LEVEL_PGD; +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -7315,6 +7338,154 @@ bool want_pmd_share(struct vm_area_struct *vma, uns= igned long addr) } #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */ =20 +/* __hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to res= olve + * the page table entry for @addr. We might allocate new PTEs. + * + * @hpte must always be pointing at an hstate-level PTE or deeper. + * + * This function will never walk further if it encounters a PTE of a size + * less than or equal to @sz. + * + * @alloc determines what we do when we encounter an empty PTE. If false, + * we stop walking. If true and @sz is less than the current PTE's size, + * we make that PTE point to the next level down, going until @sz is the s= ame + * as our current PTE. + * + * If @alloc is false and @sz is PAGE_SIZE, this function will always + * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz. + * + * Return: + * -ENOMEM if we couldn't allocate new PTEs. + * -EEXIST if the caller wanted to walk further than a migration PTE, + * poison PTE, or a PTE marker. The caller needs to manually deal + * with this scenario. + * -EINVAL if called with invalid arguments (@sz invalid, @hpte not + * initialized). + * 0 otherwise. + * + * Even if this function fails, @hpte is guaranteed to always remain + * valid. + */ +static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct = *vma, + struct hugetlb_pte *hpte, unsigned long addr, + unsigned long sz, bool alloc) +{ + int ret =3D 0; + pte_t pte; + + if (WARN_ON_ONCE(sz < PAGE_SIZE)) + return -EINVAL; + + if (WARN_ON_ONCE(!hpte->ptep)) + return -EINVAL; + + while (hugetlb_pte_size(hpte) > sz && !ret) { + pte =3D huge_ptep_get(hpte->ptep); + if (!pte_present(pte)) { + if (!alloc) + return 0; + if (unlikely(!huge_pte_none(pte))) + return -EEXIST; + } else if (hugetlb_pte_present_leaf(hpte, pte)) + return 0; + ret =3D hugetlb_walk_step(mm, hpte, addr, sz); + } + + return ret; +} + +/* + * hugetlb_hgm_walk - Has the same behavior as __hugetlb_hgm_walk but will + * initialize @hpte with hstate-level PTE pointer @ptep. + */ +static int hugetlb_hgm_walk(struct hugetlb_pte *hpte, + pte_t *ptep, + struct vm_area_struct *vma, + unsigned long addr, + unsigned long target_sz, + bool alloc) +{ + struct hstate *h =3D hstate_vma(vma); + + hugetlb_pte_init(vma->vm_mm, hpte, ptep, huge_page_shift(h), + hpage_size_to_level(huge_page_size(h))); + return __hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz, + alloc); +} + +/* + * hugetlb_full_walk_continue - continue a high-granularity page-table wal= k. + * + * If a user has a valid @hpte but knows that @hpte is not a leaf, they can + * attempt to continue walking by calling this function. + * + * This function will never fail, but @hpte might not change. + * + * If @hpte hasn't been initialized, then this function's behavior is + * undefined. + */ +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte, + struct vm_area_struct *vma, + unsigned long addr) +{ + /* __hugetlb_hgm_walk will never fail with these arguments. */ + WARN_ON_ONCE(__hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, + PAGE_SIZE, false)); +} + +/* + * hugetlb_full_walk - do a high-granularity page-table walk; never alloca= te. + * + * This function can only fail if we find that the hstate-level PTE is not + * allocated. Callers can take advantage of this fact to skip address regi= ons + * that cannot be mapped in that case. + * + * If this function succeeds, @hpte is guaranteed to be valid. + */ +int hugetlb_full_walk(struct hugetlb_pte *hpte, + struct vm_area_struct *vma, + unsigned long addr) +{ + struct hstate *h =3D hstate_vma(vma); + unsigned long sz =3D huge_page_size(h); + /* + * We must mask the address appropriately so that we pick up the first + * PTE in a contiguous group. + */ + pte_t *ptep =3D hugetlb_walk(vma, addr & huge_page_mask(h), sz); + + if (!ptep) + return -ENOMEM; + + /* hugetlb_hgm_walk will never fail with these arguments. */ + WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, PAGE_SIZE, false)); + return 0; +} + +/* + * hugetlb_full_walk_alloc - do a high-granularity walk, potentially alloc= ate + * new PTEs. + */ +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte, + struct vm_area_struct *vma, + unsigned long addr, + unsigned long target_sz) +{ + struct hstate *h =3D hstate_vma(vma); + unsigned long sz =3D huge_page_size(h); + /* + * We must mask the address appropriately so that we pick up the first + * PTE in a contiguous group. + */ + pte_t *ptep =3D huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h), + sz); + + if (!ptep) + return -ENOMEM; + + return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true); +} + #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long sz) @@ -7382,6 +7553,48 @@ pte_t *huge_pte_offset(struct mm_struct *mm, return (pte_t *)pmd; } =20 +/* + * hugetlb_walk_step() - Walk the page table one step to resolve the page + * (hugepage or subpage) entry at address @addr. + * + * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the + * lowest level PTE). + * + * @hpte will always remain valid, even if this function fails. + * + * Architectures that implement this function must ensure that if @hpte do= es + * not change levels, then its PTL must also stay the same. + */ +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte, + unsigned long addr, unsigned long sz) +{ + pte_t *ptep; + spinlock_t *ptl; + + switch (hpte->level) { + case HUGETLB_LEVEL_PUD: + ptep =3D (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr); + if (IS_ERR(ptep)) + return PTR_ERR(ptep); + hugetlb_pte_init(mm, hpte, ptep, PMD_SHIFT, + HUGETLB_LEVEL_PMD); + break; + case HUGETLB_LEVEL_PMD: + ptep =3D hugetlb_alloc_pte(mm, hpte, addr); + if (IS_ERR(ptep)) + return PTR_ERR(ptep); + ptl =3D pte_lockptr(mm, (pmd_t *)hpte->ptep); + __hugetlb_pte_init(hpte, ptep, PAGE_SHIFT, + HUGETLB_LEVEL_PTE, ptl); + break; + default: + WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n", + __func__, hpte->level, hpte->shift); + return -EINVAL; + } + return 0; +} + /* * Return a mask that can be used to update an address to the last huge * page in a page table page mapping size. Used to skip non-present --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7C63C636D6 for ; Sat, 18 Feb 2023 00:29:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230054AbjBRA3s (ORCPT ); Fri, 17 Feb 2023 19:29:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229903AbjBRA3T (ORCPT ); Fri, 17 Feb 2023 19:29:19 -0500 Received: from mail-ua1-x949.google.com (mail-ua1-x949.google.com [IPv6:2607:f8b0:4864:20::949]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38F3D6B303 for ; Fri, 17 Feb 2023 16:28:58 -0800 (PST) Received: by mail-ua1-x949.google.com with SMTP id f9-20020ab049c9000000b00419afefbe3eso547186uad.4 for ; Fri, 17 Feb 2023 16:28:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6oe0JKndOCFJAJQ6FnTw91/ULLeb3XwnLhf6Mjl8ksE=; b=ZXgI9bmT7U5wLokZ0C4oNJ6Gk9RcbKcaK8gcoGLiwH9auXPSDVM0Ol5I8tJdiHkQBm cqJXsvHhpPo95w78Rkbk3SLZVdtOTUOQ5nztMCmY5/Ud5mRj3hFPJhvSI6benmg8rsXE zUOmpa69V6q/M9cjYkffaZNIdg3JvQ6rKjZMp6Dj0DYTIaXY+Lg7BLYbHF6CKnajEaK+ wl2mnnELaLd/60dmFNDm2ghVntcF83+VBQ5Wv+xyP+zua7DC9xkWhlP06ETH3LjJnFLh bf40rMtbYVk0AxD73EzwULPcAUGoIp7bBDuosmnG4aaaqWZsAeWgTL7c1Q5nHavHsXWV bZFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6oe0JKndOCFJAJQ6FnTw91/ULLeb3XwnLhf6Mjl8ksE=; b=xBdmixG4X0rO4cr/lK7de3N0pw0oRTCqyX1b98hzmfShRwMC5lWjNEO3xbkrtJ4vot FQdvFBd4WG77Sn23vFNkcbISjzmrIgZ1pYEmtz4EBNvVGXhqNi6aIMd07IinHm8a+T9z 1c7BBbkid2J1JHxFNm48VCN+ZJRHtrF8M9WWVoDrvu5DSBVcnCAJ3kj/yQQQBZDnCUDk BQs4eA7CtEThoH5MSbeXaBz7k6Ko24mvi9XG7BnDSktQfsHxJgDzMOATZZp3v35jFZE0 w/FDWFpAlECEKQocxVhKOKLiP7diNL8kKbScQxD9okSzuZ2zpUJ6Z4ARlDO71cEbLzgP Vhaw== X-Gm-Message-State: AO0yUKXgbmeLzP5jgwVKl1N7fkjfS7rOGjMyj+OH3tmnX1Lofcj4eHPi NLdukANXts9C5m3yRHqKp4mFl3HBa9GwXPFA X-Google-Smtp-Source: AK7set8EIO6lV9DKEYEv7Gocq2q4pnYUxsgcQQWqM5Ogw0BQTVEawiIrDCY/tG3VwGpwjeKgD8+npdD3lNuWqFdx X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a1f:9111:0:b0:409:92de:63bd with SMTP id t17-20020a1f9111000000b0040992de63bdmr110245vkd.12.1676680137159; Fri, 17 Feb 2023 16:28:57 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:47 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-15-jthoughton@google.com> Subject: [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Fix how UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT interact in these two ways: - UFFDIO_WRITEPROTECT no longer prevents a high-granularity UFFDIO_CONTINUE. - UFFD-WP PTE markers installed with UFFDIO_WRITEPROTECT will be properly propagated when high-granularily UFFDIO_CONTINUEs are performed. Note: UFFDIO_WRITEPROTECT is not yet permitted at PAGE_SIZE granularity. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 810c05feb41f..f74183acc521 100644 Acked-by: Mike Kravetz --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -506,6 +506,30 @@ static bool has_same_uncharge_info(struct file_region = *rg, #endif } =20 +static void hugetlb_install_markers_pmd(pmd_t *pmdp, pte_marker marker) +{ + int i; + + for (i =3D 0; i < PTRS_PER_PMD; ++i) + /* + * WRITE_ONCE not needed because the pud hasn't been + * installed yet. + */ + pmdp[i] =3D __pmd(pte_val(make_pte_marker(marker))); +} + +static void hugetlb_install_markers_pte(pte_t *ptep, pte_marker marker) +{ + int i; + + for (i =3D 0; i < PTRS_PER_PTE; ++i) + /* + * WRITE_ONCE not needed because the pmd hasn't been + * installed yet. + */ + ptep[i] =3D make_pte_marker(marker); +} + /* * hugetlb_alloc_pmd -- Allocate or find a PMD beneath a PUD-level hpte. * @@ -528,23 +552,32 @@ pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct= hugetlb_pte *hpte, pmd_t *new; pud_t *pudp; pud_t pud; + bool is_marker; + pte_marker marker; =20 if (hpte->level !=3D HUGETLB_LEVEL_PUD) return ERR_PTR(-EINVAL); =20 pudp =3D (pud_t *)hpte->ptep; retry: + is_marker =3D false; pud =3D READ_ONCE(*pudp); if (likely(pud_present(pud))) return unlikely(pud_leaf(pud)) ? ERR_PTR(-EEXIST) : pmd_offset(pudp, addr); - else if (!pud_none(pud)) + else if (!pud_none(pud)) { /* - * Not present and not none means that a swap entry lives here, - * and we can't get rid of it. + * Not present and not none means that a swap entry lives here. + * If it's a PTE marker, we can deal with it. If it's another + * swap entry, we don't attempt to split it. */ - return ERR_PTR(-EEXIST); + is_marker =3D is_pte_marker(__pte(pud_val(pud))); + if (!is_marker) + return ERR_PTR(-EEXIST); + + marker =3D pte_marker_get(pte_to_swp_entry(__pte(pud_val(pud)))); + } =20 new =3D pmd_alloc_one(mm, addr); if (!new) @@ -557,6 +590,13 @@ pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct = hugetlb_pte *hpte, goto retry; } =20 + /* + * Install markers before PUD to avoid races with other + * page tables walks. + */ + if (is_marker) + hugetlb_install_markers_pmd(new, marker); + mm_inc_nr_pmds(mm); smp_wmb(); /* See comment in pmd_install() */ pud_populate(mm, pudp, new); @@ -576,23 +616,32 @@ pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct= hugetlb_pte *hpte, pgtable_t new; pmd_t *pmdp; pmd_t pmd; + bool is_marker; + pte_marker marker; =20 if (hpte->level !=3D HUGETLB_LEVEL_PMD) return ERR_PTR(-EINVAL); =20 pmdp =3D (pmd_t *)hpte->ptep; retry: + is_marker =3D false; pmd =3D READ_ONCE(*pmdp); if (likely(pmd_present(pmd))) return unlikely(pmd_leaf(pmd)) ? ERR_PTR(-EEXIST) : pte_offset_kernel(pmdp, addr); - else if (!pmd_none(pmd)) + else if (!pmd_none(pmd)) { /* - * Not present and not none means that a swap entry lives here, - * and we can't get rid of it. + * Not present and not none means that a swap entry lives here. + * If it's a PTE marker, we can deal with it. If it's another + * swap entry, we don't attempt to split it. */ - return ERR_PTR(-EEXIST); + is_marker =3D is_pte_marker(__pte(pmd_val(pmd))); + if (!is_marker) + return ERR_PTR(-EEXIST); + + marker =3D pte_marker_get(pte_to_swp_entry(__pte(pmd_val(pmd)))); + } =20 /* * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result @@ -613,6 +662,9 @@ pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct h= ugetlb_pte *hpte, goto retry; } =20 + if (is_marker) + hugetlb_install_markers_pte(page_address(new), marker); + mm_inc_nr_ptes(mm); smp_wmb(); /* See comment in pmd_install() */ pmd_populate(mm, pmdp, new); @@ -7384,7 +7436,12 @@ static int __hugetlb_hgm_walk(struct mm_struct *mm, = struct vm_area_struct *vma, if (!pte_present(pte)) { if (!alloc) return 0; - if (unlikely(!huge_pte_none(pte))) + /* + * In hugetlb_alloc_pmd and hugetlb_alloc_pte, + * we split PTE markers, so we can tolerate + * PTE markers here. + */ + if (unlikely(!huge_pte_none_mostly(pte))) return -EEXIST; } else if (hugetlb_pte_present_leaf(hpte, pte)) return 0; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92EDDC636D6 for ; Sat, 18 Feb 2023 00:29:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230091AbjBRA3o (ORCPT ); Fri, 17 Feb 2023 19:29:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43120 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229947AbjBRA3T (ORCPT ); Fri, 17 Feb 2023 19:29:19 -0500 Received: from mail-vk1-xa4a.google.com (mail-vk1-xa4a.google.com [IPv6:2607:f8b0:4864:20::a4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 20AC96B311 for ; Fri, 17 Feb 2023 16:28:58 -0800 (PST) Received: by mail-vk1-xa4a.google.com with SMTP id n123-20020a1fbd81000000b00401684aa41aso609622vkf.17 for ; Fri, 17 Feb 2023 16:28:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=xG9DbE8VOder+S14FjxoH+IwzXAdpZM7a0a5kZcYva0=; b=a3A4PV5p5qh9WhQPMBGPxG18XU5syYapjVA8cJrIQKg9SlfzzX0meZ5KSEbjCS76/h KkgUk1GyrdDSRmNwgJs67ciqNHmVAv5DJmhTpojLF0Uyhx6Rxq0nYHhEXa500KqHdLSY 8OPHt1vFrno7tMNUTlD5OZynE2nE9LC7G2tsTG/pmDJw36wEBjXLJkf3kjpEqyBFrFHU onGHC9FIBifpRYHynXz9NhBphtMbFvKNAw8XziHocmnSczkwT0klgF5M5m56ALq64vUA bsFoi/I6FyoSWC8X++RYI9nsaIKHXsGED5l665XewLFlFoDtjmgAh+mOcZ6rWrr94rxB 1OVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xG9DbE8VOder+S14FjxoH+IwzXAdpZM7a0a5kZcYva0=; b=44XdOPU0tjiy/GdRBHvTZJQxlwEc8kTJGm83Bx1p+wkIszjxL9vD+cKhBEYtyqGJCq xpEqttxco6656KAfXOFz2cny1Rhd6fajnXNtzxY/MyzRFNrcL402xU8gJyHCE3rbprw9 R4dHWAYx44EuUwkEq3IaQtCQpaa3soxpIw9i+DEPGV1nItk6yKQN0IfxGJfrIRSoLfhn Q5MkW56KnoJo2EBcO8gpQ4oXp3Dpi9FesMD8lcT68Bnj2wm5DspYjhBtTmmL0Oyww6bz AYlenYlmukvtykVqqmmkYDcgxsOPesqcN9EoE1DSlPCuUHLuwhVOmBbMFvp7aj96IuIy UBug== X-Gm-Message-State: AO0yUKWWwpudVbK7JLBdHtRgIWe8Fll5ur2FiOKcKdarbDoLp6HGLizN PTeaC1y4j9ogAej6+4/rPGN+g2mk/yu2Bru+ X-Google-Smtp-Source: AK7set90mnEVzNzEwK6fPTTkXMb5MvuuAdnXyAlauMlVrzIA9ZFHAX7xiVkHYeaXZEfkvh8rxB+tXWDlCgeheDym X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:ab0:53d9:0:b0:68d:6360:77b with SMTP id l25-20020ab053d9000000b0068d6360077bmr26282uaa.1.1676680137978; Fri, 17 Feb 2023 16:28:57 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:48 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-16-jthoughton@google.com> Subject: [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This allows us to make huge PTEs at shifts other than the hstate shift, which will be necessary for high-granularity mappings. Acked-by: Mike Kravetz Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f74183acc521..ed1d806020de 100644 Reviewed-by: Mina Almasry --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5110,11 +5110,11 @@ const struct vm_operations_struct hugetlb_vm_ops = =3D { .pagesize =3D hugetlb_vm_op_pagesize, }; =20 -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, - int writable) +static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma, + struct page *page, int writable, + int shift) { pte_t entry; - unsigned int shift =3D huge_page_shift(hstate_vma(vma)); =20 if (writable) { entry =3D huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page, @@ -5128,6 +5128,14 @@ static pte_t make_huge_pte(struct vm_area_struct *vm= a, struct page *page, return entry; } =20 +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, + int writable) +{ + unsigned int shift =3D huge_page_shift(hstate_vma(vma)); + + return make_huge_pte_with_shift(vma, page, writable, shift); +} + static void set_huge_ptep_writable(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5911C05027 for ; Sat, 18 Feb 2023 00:31:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230257AbjBRAbA (ORCPT ); Fri, 17 Feb 2023 19:31:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229773AbjBRA36 (ORCPT ); Fri, 17 Feb 2023 19:29:58 -0500 Received: from mail-ua1-x949.google.com (mail-ua1-x949.google.com [IPv6:2607:f8b0:4864:20::949]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D764F692BA for ; Fri, 17 Feb 2023 16:29:19 -0800 (PST) Received: by mail-ua1-x949.google.com with SMTP id f9-20020ab049c9000000b00419afefbe3eso547545uad.4 for ; Fri, 17 Feb 2023 16:29:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1676680159; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=iq/7jaT5OFDRS9NrMnKt72E+h6CEDJFDrueWaD5zQ0Y=; b=LJj15ou5aMEGPdFnqegHuN/91OeXfp4rRCRW3cHsebkq5WCcCf2NnXo+eHh1uMaO0x HGob/Qz7INsRX4HMjLzhzpa+KbL6w+JBpKkbb6I/Ysl4ceV4fqt+kMp8qMdpjJbgCgDd Tcz7EoAzke3MYE3bcoJtrb8uvPbhZOxumTn85X2FXJuWHnvKqPNwDU25UykrRCmdga/F knEIFRpT0+EamIE3eGJxptV9TuGnb5/oXNb+6p7X3bGPy9+qYpYd06XgumHdicpJSnfx vsXDE/Y+0UGaB7YP7fdTkJQcKpB/Er0uopshKy4OGw0b9a8rj98PAJY4bfzYQJ4aBoYZ rlXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676680159; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=iq/7jaT5OFDRS9NrMnKt72E+h6CEDJFDrueWaD5zQ0Y=; b=h52h9LsO2+kALS3Uzc3ZFJ/Q3YCMhrMgvqr3U24xXrk+RYkys4/CGnLKgP3g9DbT+s dWJPnkz976UMx+uM9z+1O7fCy3pT1LbHMxu7EwlHkClJDAPPW9DDy/UzCY1vyvzscpvn lG9r8n/6e7QFzPAnbspy2qlXEItj6BB0kwYo1qLq7fv+flPIa7m2c2keWjiMnqaR6nrb gJDbk7ZVd0QEJFmiN0ZkLxL/8laBFs4rRbFtVY8o5Iael+V7ZZdfUyWpCrPHbKEgTZ+8 InMMeB2urEAD8Llqpe87wzlDJzggexTvwcPUk+Cwpjqp1ciYSYAyaD9+HOsiHoeie4O4 BFbg== X-Gm-Message-State: AO0yUKVaeD27MiKgHlL7DA3SyU5UbqAWtkmf+feLYNu+5cORjikB/erA 3nSpRHu8rfoemmpzZ0HHU7kQIT2Q3rE0vgQ1 X-Google-Smtp-Source: AK7set8EhVlNCK9IccPiC1WcAxl/BD8tBcWSdj7osxR3aU5m9A2R/pd4W9j1Bya2QmGiWMHJAKk9Wk9ZzEwHF/MA X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a67:f1c6:0:b0:3ec:ab8:a571 with SMTP id v6-20020a67f1c6000000b003ec0ab8a571mr271401vsm.55.1676680139239; Fri, 17 Feb 2023 16:28:59 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:49 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-17-jthoughton@google.com> Subject: [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This is a simple change: don't create a "huge" PTE if we are making a regular, PAGE_SIZE PTE. All architectures that want to implement HGM likely need to be changed in a similar way if they implement their own version of arch_make_huge_pte. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 726d581158b1..b767b6889dea 100644 Reviewed-by: Mike Kravetz --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -899,7 +899,7 @@ static inline void arch_clear_hugepage_flags(struct pag= e *page) { } static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) { - return pte_mkhuge(entry); + return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry; } #endif =20 --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88718C6379F for ; Sat, 18 Feb 2023 00:29:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229972AbjBRA35 (ORCPT ); Fri, 17 Feb 2023 19:29:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229773AbjBRA3X (ORCPT ); Fri, 17 Feb 2023 19:29:23 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D8F66C00F for ; Fri, 17 Feb 2023 16:29:01 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id k72-20020a25244b000000b0083fa6f15c2fso1931074ybk.16 for ; Fri, 17 Feb 2023 16:29:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=0wSMKMWzwfQ8fLAFHXt0NmwJTi1H1o52qxxMn528B3Y=; b=axPHyxy1GBdJ3j5NWk4sW5NYHawOKf9AoibB4HV5FfIZJydc97adJIODrPR+QPgxNK nr00e5FMgkdE8K907sG4ZDoHr5Lt8bsvRMbzhBZu9D/vDZvX+woxXErLOoWLQSU7lSB3 ColvlhjaPZZh6IxAnrEln+IITIxYCmy04Z2l1SbzE3B6POnZJUSMCAYWIRaT1Yw6pXLR Wq0cbmdstbtdCR3d2d+bJjVqqNZDwx8tE8yRYgFkpy+28U20Db+9VK0iOYSso+pD2sG8 OJeoIbKDbGGxUqdzljTj52m+1rUrxcY+G67jmbfYWqpHTo5y/BK5aqWA0UQIkWR55NRP GAqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0wSMKMWzwfQ8fLAFHXt0NmwJTi1H1o52qxxMn528B3Y=; b=M9PdICCwevoqsB+YVTZsRvQA61+M95wXb0iiBN3kIbODe4IrbG7LRQmgVpBlmVrqaS 3PDMXLyKm5+6I0IXX1hKLyLZdou2vq95clAAR4aCJGS/qRCvtcAjtqz40z7Y5lMaK9KV l0E9+fyxGoU4kbf198vOow1giRLatW4yVE/bHiPLqaRhoWLyrec8sTlUMOxnQyG7V4Yl OJNMLEeAKhvBq3QytuB34asEcjSPnnVWKLhZTdWfg7zb7hpD1jpUn15fOct9SWRFBABH WcWWYLlpe5zA91qiP5lJn4J+vmB984/78vdtDnOS9Fc/0Q9vkYn9A9algB6N+kmL6hzW Aa4g== X-Gm-Message-State: AO0yUKUGGrCD7M5JpPpo+d/M2EdLTg+ulxaUvXg1m2BkERda9pyfJIUJ Uu30ZxWxXHgh0wnMwBQ7wgQAQC5hRf6PGl5/ X-Google-Smtp-Source: AK7set8P4IRNjQRV3GSaApBMgcJvUr/kZLWUTr2MSxmgTV4ZqD4cm7y5M6XYvFN0JDI5vO1MJ3hc9iawF3gnH6MW X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:690c:38d:b0:533:a15a:d33e with SMTP id bh13-20020a05690c038d00b00533a15ad33emr73114ywb.5.1676680140092; Fri, 17 Feb 2023 16:29:00 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:50 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-18-jthoughton@google.com> Subject: [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Because it is safe to do so, do a full high-granularity page table walk to check if the page is mapped. Signed-off-by: James Houghton diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index cfd09f95551b..c0ee69f0418e 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -386,17 +386,24 @@ static void hugetlb_delete_from_page_cache(struct fol= io *folio) static bool hugetlb_vma_maps_page(struct vm_area_struct *vma, unsigned long addr, struct page *page) { - pte_t *ptep, pte; + pte_t pte; + struct hugetlb_pte hpte; =20 - ptep =3D hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma))); - if (!ptep) + if (hugetlb_full_walk(&hpte, vma, addr)) return false; =20 - pte =3D huge_ptep_get(ptep); + pte =3D huge_ptep_get(hpte.ptep); if (huge_pte_none(pte) || !pte_present(pte)) return false; =20 - if (pte_page(pte) =3D=3D page) + if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) + /* + * We raced with someone splitting us, and the only case + * where this is impossible is when the pte was none. + */ + return false; + + if (compound_head(pte_page(pte)) =3D=3D page) return true; =20 return false; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E4FFC05027 for ; Sat, 18 Feb 2023 00:30:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230160AbjBRAaA (ORCPT ); Fri, 17 Feb 2023 19:30:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230110AbjBRA33 (ORCPT ); Fri, 17 Feb 2023 19:29:29 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D9D56C012 for ; Fri, 17 Feb 2023 16:29:01 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5365a2b9e4fso19055427b3.15 for ; Fri, 17 Feb 2023 16:29:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Of81VsYOGaJiC6DFXM6sxGYcirLfRpMhUkY/yI+h7kc=; b=faTIHphcD3M2ErzVxaHvUUDOZt6TLoZA4Xzil1xu0/G1//a74JWzTWI0tyBHfpuNha JuiaDE7sP9JLYbJHclSCn1zLcoQow2nzU2lXYNGugjxvfih5hgYInPRhhLcCctGhNLo3 KwFJIDeU1tiOhE1xHmEihdXHq7cGiCAhH0q2zoPtNQjwpXYxrAP0w6dyKJSxmBEiRhDd 7V/h4dydjcI+nPCAOZLKD8SpPxrMq1uA1FOYjkSBWBh0f82wcANh+dEFe2hWIerkaZPV R7mY45T1lGt6PPZpw4zMX2iPUJ6tZ/5sKy1qlMP0mc3wUXq6LpfnZyl3pSM6f22otbp5 BCNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Of81VsYOGaJiC6DFXM6sxGYcirLfRpMhUkY/yI+h7kc=; b=dpe7MO5yNPjJOD6pNxRuArZ7++05XNtRPP8xXitwwP3HTMVFevvgFpgg6YonnxYYqG eY0MOVA15pZ6I1g4FtanrJYSwOtDQgJ9P2rBPLde1sW+ey016G1Vo8fC16JM+nvNYulj +2tNx5LLD0hbO05cnV5+xbjsBOPvloZehtkxSUbLSXtj8mtTiMXxEyt1W9FFEOz4/d4t GZiiOagUHShfD9aFRAmE9qKM3So62cDuzW6SK1KWQ/dQQQVWqTuavJ70dGc6NgcAEYlK PKSG9nZbZADSlPA+2KnMm1EKG9NzKoztLrV5paXcdYSj0+sZbhxwTYHBFoThxm9wEzkF uE4g== X-Gm-Message-State: AO0yUKUCkRCKGExYhIeAB3biDJOYmrjFQPJSxlM6WIKLnQ4bZCt/kqIU qUPc9lYYqwEglHxqx4AEWiVjNf/ryU5gxqUm X-Google-Smtp-Source: AK7set9tTFKkYw3kZzJu84DBO6wAm5FU1Hz3uatyB+dq4su6im3hpPVARJ7+PJEJD7fhUdzldcN3emm0/I1SXTEb X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:889:0:b0:95b:7778:5158 with SMTP id e9-20020a5b0889000000b0095b77785158mr63089ybq.12.1676680140991; Fri, 17 Feb 2023 16:29:00 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:51 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-19-jthoughton@google.com> Subject: [PATCH v2 18/46] hugetlb: add HGM support to __unmap_hugepage_range From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Enlighten __unmap_hugepage_range to deal with high-granularity mappings. This doesn't change its API; it still must be called with hugepage alignment, but it will correctly unmap hugepages that have been mapped at high granularity. Eventually, functionality here can be expanded to allow users to call MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is not done here. Introduce hugetlb_remove_rmap to properly decrement mapcount for high-granularity-mapped HugeTLB pages. Signed-off-by: James Houghton diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b46617207c93..31267471760e 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -598,9 +598,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gathe= r *tlb, __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0) =20 -#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \ +#define tlb_remove_huge_tlb_entry(tlb, hpte, address) \ do { \ - unsigned long _sz =3D huge_page_size(h); \ + unsigned long _sz =3D hugetlb_pte_size(&hpte); \ if (_sz >=3D P4D_SIZE) \ tlb_flush_p4d_range(tlb, address, _sz); \ else if (_sz >=3D PUD_SIZE) \ @@ -609,7 +609,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gathe= r *tlb, tlb_flush_pmd_range(tlb, address, _sz); \ else \ tlb_flush_pte_range(tlb, address, _sz); \ - __tlb_remove_tlb_entry(tlb, ptep, address); \ + __tlb_remove_tlb_entry(tlb, hpte.ptep, address);\ } while (0) =20 /** diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b767b6889dea..1a1a71868dfd 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -160,6 +160,9 @@ struct hugepage_subpool *hugepage_new_subpool(struct hs= tate *h, long max_hpages, long min_hpages); void hugepage_put_subpool(struct hugepage_subpool *spool); =20 +void hugetlb_remove_rmap(struct page *subpage, unsigned long shift, + struct hstate *h, struct vm_area_struct *vma); + void hugetlb_dup_vma_private(struct vm_area_struct *vma); void clear_vma_resv_huge_pages(struct vm_area_struct *vma); int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff= _t *); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ed1d806020de..ecf1a28dbaaa 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -120,6 +120,28 @@ enum hugetlb_level hpage_size_to_level(unsigned long s= z) return HUGETLB_LEVEL_PGD; } =20 +void hugetlb_remove_rmap(struct page *subpage, unsigned long shift, + struct hstate *h, struct vm_area_struct *vma) +{ + struct page *hpage =3D compound_head(subpage); + + if (shift =3D=3D huge_page_shift(h)) { + VM_BUG_ON_PAGE(subpage !=3D hpage, subpage); + page_remove_rmap(hpage, vma, true); + } else { + unsigned long nr_subpages =3D 1UL << (shift - PAGE_SHIFT); + struct page *final_page =3D &subpage[nr_subpages]; + + VM_BUG_ON_PAGE(HPageVmemmapOptimized(hpage), hpage); + /* + * Decrement the mapcount on each page that is getting + * unmapped. + */ + for (; subpage < final_page; ++subpage) + page_remove_rmap(subpage, vma, false); + } +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -5466,10 +5488,10 @@ static void __unmap_hugepage_range(struct mmu_gathe= r *tlb, struct vm_area_struct { struct mm_struct *mm =3D vma->vm_mm; unsigned long address; - pte_t *ptep; + struct hugetlb_pte hpte; pte_t pte; spinlock_t *ptl; - struct page *page; + struct page *hpage, *subpage; struct hstate *h =3D hstate_vma(vma); unsigned long sz =3D huge_page_size(h); unsigned long last_addr_mask; @@ -5479,35 +5501,33 @@ static void __unmap_hugepage_range(struct mmu_gathe= r *tlb, struct vm_area_struct BUG_ON(start & ~huge_page_mask(h)); BUG_ON(end & ~huge_page_mask(h)); =20 - /* - * This is a hugetlb vma, all the pte entries should point - * to huge page. - */ - tlb_change_page_size(tlb, sz); tlb_start_vma(tlb, vma); =20 last_addr_mask =3D hugetlb_mask_last_page(h); address =3D start; - for (; address < end; address +=3D sz) { - ptep =3D hugetlb_walk(vma, address, sz); - if (!ptep) { - address |=3D last_addr_mask; + + while (address < end) { + if (hugetlb_full_walk(&hpte, vma, address)) { + address =3D (address | last_addr_mask) + sz; continue; } =20 - ptl =3D huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, vma, address, ptep)) { + ptl =3D hugetlb_pte_lock(&hpte); + if (hugetlb_pte_size(&hpte) =3D=3D sz && + huge_pmd_unshare(mm, vma, address, hpte.ptep)) { spin_unlock(ptl); tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE); force_flush =3D true; address |=3D last_addr_mask; + address +=3D sz; continue; } =20 - pte =3D huge_ptep_get(ptep); + pte =3D huge_ptep_get(hpte.ptep); + if (huge_pte_none(pte)) { spin_unlock(ptl); - continue; + goto next_hpte; } =20 /* @@ -5523,24 +5543,35 @@ static void __unmap_hugepage_range(struct mmu_gathe= r *tlb, struct vm_area_struct */ if (pte_swp_uffd_wp_any(pte) && !(zap_flags & ZAP_FLAG_DROP_MARKER)) - set_huge_pte_at(mm, address, ptep, + set_huge_pte_at(mm, address, hpte.ptep, make_pte_marker(PTE_MARKER_UFFD_WP)); else - huge_pte_clear(mm, address, ptep, sz); + huge_pte_clear(mm, address, hpte.ptep, + hugetlb_pte_size(&hpte)); + spin_unlock(ptl); + goto next_hpte; + } + + if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) { + /* + * We raced with someone splitting out from under us. + * Retry the walk. + */ spin_unlock(ptl); continue; } =20 - page =3D pte_page(pte); + subpage =3D pte_page(pte); + hpage =3D compound_head(subpage); /* * If a reference page is supplied, it is because a specific * page is being unmapped, not a range. Ensure the page we * are about to unmap is the actual page of interest. */ if (ref_page) { - if (page !=3D ref_page) { + if (hpage !=3D ref_page) { spin_unlock(ptl); - continue; + goto next_hpte; } /* * Mark the VMA as having unmapped its page so that @@ -5550,25 +5581,32 @@ static void __unmap_hugepage_range(struct mmu_gathe= r *tlb, struct vm_area_struct set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED); } =20 - pte =3D huge_ptep_get_and_clear(mm, address, ptep); - tlb_remove_huge_tlb_entry(h, tlb, ptep, address); + pte =3D huge_ptep_get_and_clear(mm, address, hpte.ptep); + tlb_change_page_size(tlb, hugetlb_pte_size(&hpte)); + tlb_remove_huge_tlb_entry(tlb, hpte, address); if (huge_pte_dirty(pte)) - set_page_dirty(page); + set_page_dirty(hpage); /* Leave a uffd-wp pte marker if needed */ if (huge_pte_uffd_wp(pte) && !(zap_flags & ZAP_FLAG_DROP_MARKER)) - set_huge_pte_at(mm, address, ptep, + set_huge_pte_at(mm, address, hpte.ptep, make_pte_marker(PTE_MARKER_UFFD_WP)); - hugetlb_count_sub(pages_per_huge_page(h), mm); - page_remove_rmap(page, vma, true); + hugetlb_count_sub(hugetlb_pte_size(&hpte)/PAGE_SIZE, mm); + hugetlb_remove_rmap(subpage, hpte.shift, h, vma); =20 spin_unlock(ptl); - tlb_remove_page_size(tlb, page, huge_page_size(h)); /* - * Bail out after unmapping reference page if supplied + * Lower the reference count on the head page. + */ + tlb_remove_page_size(tlb, hpage, sz); + /* + * Bail out after unmapping reference page if supplied, + * and there's only one PTE mapping this page. */ - if (ref_page) + if (ref_page && hugetlb_pte_size(&hpte) =3D=3D sz) break; +next_hpte: + address +=3D hugetlb_pte_size(&hpte); } tlb_end_vma(tlb, vma); =20 @@ -5846,7 +5884,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, st= ruct vm_area_struct *vma, /* Break COW or unshare */ huge_ptep_clear_flush(vma, haddr, ptep); mmu_notifier_invalidate_range(mm, range.start, range.end); - page_remove_rmap(old_page, vma, true); + hugetlb_remove_rmap(old_page, huge_page_shift(h), h, vma); hugepage_add_new_anon_rmap(new_folio, vma, haddr); set_huge_pte_at(mm, haddr, ptep, make_huge_pte(vma, &new_folio->page, !unshare)); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75C33C636D6 for ; Sat, 18 Feb 2023 00:30:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230166AbjBRAaE (ORCPT ); Fri, 17 Feb 2023 19:30:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230133AbjBRA3g (ORCPT ); Fri, 17 Feb 2023 19:29:36 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D4426BDF7 for ; Fri, 17 Feb 2023 16:29:03 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id o14-20020a25810e000000b0095d2ada3d26so1816863ybk.5 for ; Fri, 17 Feb 2023 16:29:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=+tSPnCkiv+4IG9wAhN3nJiAPBMdRQ7SRBdMiKhfpNi8=; b=tIIvrAMRq+3HXFL5rW8Lh7/gCEy2v7mBrYIzZnmVImsr6rHLUrJkHvESEkKIKnP2HA ETKeSXB6gqBCteIJ68LmS1NgcW8NqFCp4N0/v9eSUb5HNjRwRpHF7O8+s4KVElrP4dI4 Zjvu8aQ6Opgdsvu/yAqPR0lJhv574KFnXRf4LH++FsPR7lEug2Wl9pAFrHWFu0GZq42N Vy0SevF4NJaR0vgAKFGoEdFDQKl01vAhAwbwFzmxhSRiOGi2HQEkv51SNB+xC5ZLD4Tm F1M7gD13G/x1ohwgsWJjFr2DhGgQpTZDe/17U0ZGQKm1J8xgGRPTne6Pgw+hiXkqL9v1 Pi9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+tSPnCkiv+4IG9wAhN3nJiAPBMdRQ7SRBdMiKhfpNi8=; b=2jPgZkFUViozVZ5jdoVD8eX8E9BfxcdbQBBu/MRq9Eqythmw03kh5CbbaoQqNVOB2I EQyOXwNu+dgDhgYOTsp/suhR8X0COTvOll1FKwOC0Ci4Wn/Q6RyHPyNs0rm7cDD8FUs8 DMgcjsBkbouiTbgTl/gl/HACx4thgm2UClCffL0uql1GE6e7KLLO9be3luNmB3Vx7nCW ACFcxS/rQAcVvramB9TxP2scXwTnyQR+kHVlbbcpX+A/Gffqc3kZ3LZwu6dlqx3mxE6h /veBKvXwfVgfOAtFepMixUKyOkE8fF0moZPpUyJvluONW+bnRLaRkfqRBCju715EYtgV oV+A== X-Gm-Message-State: AO0yUKXiGhVRyion8IEAQKlTv/bwfwq3WeQYtvMRgoPnkEnk5YJrfPKg fdeBX05Qqqo9IXtLmkuA80akt2EywcZOz1Kw X-Google-Smtp-Source: AK7set8J1uc8tp8/hEWZItT5UxOZqGWxRV48NY6+J5XvomWx6cLnGWfxbZYO/6vWe6Thx7C5x8jbEtY4jfNJnPC8 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:46d6:0:b0:52e:e3a8:d0b9 with SMTP id t205-20020a8146d6000000b0052ee3a8d0b9mr1575870ywa.509.1676680141961; Fri, 17 Feb 2023 16:29:01 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:52 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-20-jthoughton@google.com> Subject: [PATCH v2 19/46] hugetlb: add HGM support to hugetlb_change_protection From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The main change here is to do a high-granularity walk and pulling the shift from the walk (not from the hstate). Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ecf1a28dbaaa..7321c6602d6f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6900,15 +6900,15 @@ long hugetlb_change_protection(struct vm_area_struc= t *vma, { struct mm_struct *mm =3D vma->vm_mm; unsigned long start =3D address; - pte_t *ptep; pte_t pte; struct hstate *h =3D hstate_vma(vma); - long pages =3D 0, psize =3D huge_page_size(h); + long base_pages =3D 0, psize =3D huge_page_size(h); bool shared_pmd =3D false; struct mmu_notifier_range range; unsigned long last_addr_mask; bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve =3D cp_flags & MM_CP_UFFD_WP_RESOLVE; + struct hugetlb_pte hpte; =20 /* * In the case of shared PMDs, the area to flush could be beyond @@ -6926,39 +6926,43 @@ long hugetlb_change_protection(struct vm_area_struc= t *vma, hugetlb_vma_lock_write(vma); i_mmap_lock_write(vma->vm_file->f_mapping); last_addr_mask =3D hugetlb_mask_last_page(h); - for (; address < end; address +=3D psize) { + while (address < end) { spinlock_t *ptl; - ptep =3D hugetlb_walk(vma, address, psize); - if (!ptep) { + if (hugetlb_full_walk(&hpte, vma, address)) { if (!uffd_wp) { - address |=3D last_addr_mask; + address =3D (address | last_addr_mask) + psize; continue; } /* * Userfaultfd wr-protect requires pgtable * pre-allocations to install pte markers. + * + * Use hugetlb_full_walk_alloc to allocate + * the hstate-level PTE. */ - ptep =3D huge_pte_alloc(mm, vma, address, psize); - if (!ptep) { - pages =3D -ENOMEM; + if (hugetlb_full_walk_alloc(&hpte, vma, + address, psize)) { + base_pages =3D -ENOMEM; break; } } - ptl =3D huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, vma, address, ptep)) { + + ptl =3D hugetlb_pte_lock(&hpte); + if (hugetlb_pte_size(&hpte) =3D=3D psize && + huge_pmd_unshare(mm, vma, address, hpte.ptep)) { /* * When uffd-wp is enabled on the vma, unshare * shouldn't happen at all. Warn about it if it * happened due to some reason. */ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve); - pages++; + base_pages +=3D psize / PAGE_SIZE; spin_unlock(ptl); shared_pmd =3D true; - address |=3D last_addr_mask; + address =3D (address | last_addr_mask) + psize; continue; } - pte =3D huge_ptep_get(ptep); + pte =3D huge_ptep_get(hpte.ptep); if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) { /* Nothing to do. */ } else if (unlikely(is_hugetlb_entry_migration(pte))) { @@ -6974,7 +6978,7 @@ long hugetlb_change_protection(struct vm_area_struct = *vma, entry =3D make_readable_migration_entry( swp_offset(entry)); newpte =3D swp_entry_to_pte(entry); - pages++; + base_pages +=3D hugetlb_pte_size(&hpte) / PAGE_SIZE; } =20 if (uffd_wp) @@ -6982,34 +6986,49 @@ long hugetlb_change_protection(struct vm_area_struc= t *vma, else if (uffd_wp_resolve) newpte =3D pte_swp_clear_uffd_wp(newpte); if (!pte_same(pte, newpte)) - set_huge_pte_at(mm, address, ptep, newpte); + set_huge_pte_at(mm, address, hpte.ptep, newpte); } else if (unlikely(is_pte_marker(pte))) { /* No other markers apply for now. */ WARN_ON_ONCE(!pte_marker_uffd_wp(pte)); if (uffd_wp_resolve) /* Safe to modify directly (non-present->none). */ - huge_pte_clear(mm, address, ptep, psize); + huge_pte_clear(mm, address, hpte.ptep, + hugetlb_pte_size(&hpte)); } else if (!huge_pte_none(pte)) { pte_t old_pte; - unsigned int shift =3D huge_page_shift(hstate_vma(vma)); + unsigned int shift =3D hpte.shift; + + if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) { + /* + * Someone split the PTE from under us, so retry + * the walk, + */ + spin_unlock(ptl); + continue; + } =20 - old_pte =3D huge_ptep_modify_prot_start(vma, address, ptep); + old_pte =3D huge_ptep_modify_prot_start( + vma, address, hpte.ptep); pte =3D huge_pte_modify(old_pte, newprot); - pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); + pte =3D arch_make_huge_pte( + pte, shift, vma->vm_flags); if (uffd_wp) pte =3D huge_pte_mkuffd_wp(pte); else if (uffd_wp_resolve) pte =3D huge_pte_clear_uffd_wp(pte); - huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); - pages++; + huge_ptep_modify_prot_commit( + vma, address, hpte.ptep, + old_pte, pte); + base_pages +=3D hugetlb_pte_size(&hpte) / PAGE_SIZE; } else { /* None pte */ if (unlikely(uffd_wp)) /* Safe to modify directly (none->non-present). */ - set_huge_pte_at(mm, address, ptep, + set_huge_pte_at(mm, address, hpte.ptep, make_pte_marker(PTE_MARKER_UFFD_WP)); } spin_unlock(ptl); + address +=3D hugetlb_pte_size(&hpte); } /* * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare @@ -7032,7 +7051,7 @@ long hugetlb_change_protection(struct vm_area_struct = *vma, hugetlb_vma_unlock_write(vma); mmu_notifier_invalidate_range_end(&range); =20 - return pages > 0 ? (pages << h->order) : pages; + return base_pages; } =20 /* Return true if reservation was successful, false otherwise. */ --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2EAEC636D6 for ; Sat, 18 Feb 2023 00:30:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230180AbjBRAaO (ORCPT ); Fri, 17 Feb 2023 19:30:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43096 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230144AbjBRA3l (ORCPT ); Fri, 17 Feb 2023 19:29:41 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6E7526BDCD for ; Fri, 17 Feb 2023 16:29:04 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-53659b9818dso20129197b3.18 for ; Fri, 17 Feb 2023 16:29:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Wqs0fsEFEjqd+rJsTObxpTCfifnDhyGaJkzr9TFCSvk=; b=cJVTi2N6kM66ucMEEndfS5PbMZx/snZYIaX4pqh8wHLpb03XCb5lJGQNTsMPX9uGLM 3wrM65Jw9kjrzso9alDlQrPBuVrtt7sPBKbdMEoD3wjSNN9aVSq6XDimkW/z3s2yU53N 01HcXY5734HriS3KEBsLiLM4XpS0m8aYkKnF+jbQkjlqFDqIpmrep+3g8RZmFM7dnjig IH2rbnvxfAr1zdpMGTaaiYQFb5fCvbDD9c/lVJg17L/SIZ0Al4fOA+YIO6cW7+D3MDjz 3rGL2IRT+C4f2EXWISMVmkij1Ys+Tetb9JtllXr0s2awCerx9KT+6hSElNdAsCbda374 SHaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Wqs0fsEFEjqd+rJsTObxpTCfifnDhyGaJkzr9TFCSvk=; b=u+N8/bXbdR9yqAGXnvdn+etPMWwFbvIO+c6RibUsWoIO6rc7KfHpHfO/VU8WQWXDYp T57BC6dztBoWemt3GDklyJfGlwPLgQxOVwrHl+ZtE4t+l8/jNr2Xnuax3FUqK/ISCiUH r9lIamgfWo9KTu+yGAEF/ODPz+YklVcgOe+NBlM5H78svZRFB0JNBquATIYDc3U0fWS3 8btNfVYkT3BnmpAoIrOemZCoKRlani+MqlPnScZ5GTwyfP9dsqv1X5BTTxNAnONh+cFo 3Dg59GJrbru8z+LMhlrRQOYKhBz/lAuZxot9/bdfZVykQ/vfQKRuk4LU3RgQLrvnEry1 PSng== X-Gm-Message-State: AO0yUKV88JXzHZabA/waC93Bno8iTssEFT+a9W22nmqKpXOXjteUtIEM dQzwIHERoSqVkjjXKBa3P58pozXvlAieTbvs X-Google-Smtp-Source: AK7set8Vb2mEtNk6xau6GwJBBjn11LS8z+AdmXnFLSNyWOrweHaxia9rgscf+J08KFziUFJlJkWb+B9Wav4nPE21 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:711:0:b0:97a:956d:6a4 with SMTP id g17-20020a5b0711000000b0097a956d06a4mr36513ybq.5.1676680143155; Fri, 17 Feb 2023 16:29:03 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:53 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-21-jthoughton@google.com> Subject: [PATCH v2 20/46] hugetlb: add HGM support to follow_hugetlb_page From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Enable high-granularity mapping support in GUP. In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units) that vaddr points to within the subpage that hpte points to. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7321c6602d6f..c26b040f4fb5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6634,11 +6634,9 @@ static void record_subpages_vmas(struct page *page, = struct vm_area_struct *vma, } =20 static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma, - unsigned int flags, pte_t *pte, + unsigned int flags, pte_t pteval, bool *unshare) { - pte_t pteval =3D huge_ptep_get(pte); - *unshare =3D false; if (is_swap_pte(pteval)) return true; @@ -6713,11 +6711,13 @@ long follow_hugetlb_page(struct mm_struct *mm, stru= ct vm_area_struct *vma, int err =3D -EFAULT, refs; =20 while (vaddr < vma->vm_end && remainder) { - pte_t *pte; + pte_t *ptep, pte; spinlock_t *ptl =3D NULL; bool unshare =3D false; int absent; - struct page *page; + unsigned long pages_per_hpte; + struct page *page, *subpage; + struct hugetlb_pte hpte; =20 /* * If we have a pending SIGKILL, don't keep faulting pages and @@ -6734,13 +6734,19 @@ long follow_hugetlb_page(struct mm_struct *mm, stru= ct vm_area_struct *vma, * each hugepage. We have to make sure we get the * first, for the page indexing below to work. * - * Note that page table lock is not held when pte is null. + * hugetlb_full_walk will mask the address appropriately. + * + * Note that page table lock is not held when ptep is null. */ - pte =3D hugetlb_walk(vma, vaddr & huge_page_mask(h), - huge_page_size(h)); - if (pte) - ptl =3D huge_pte_lock(h, mm, pte); - absent =3D !pte || huge_pte_none(huge_ptep_get(pte)); + if (hugetlb_full_walk(&hpte, vma, vaddr)) { + ptep =3D NULL; + absent =3D true; + } else { + ptl =3D hugetlb_pte_lock(&hpte); + ptep =3D hpte.ptep; + pte =3D huge_ptep_get(ptep); + absent =3D huge_pte_none(pte); + } =20 /* * When coredumping, it suits get_dump_page if we just return @@ -6751,13 +6757,21 @@ long follow_hugetlb_page(struct mm_struct *mm, stru= ct vm_area_struct *vma, */ if (absent && (flags & FOLL_DUMP) && !hugetlbfs_pagecache_present(h, vma, vaddr)) { - if (pte) + if (ptep) spin_unlock(ptl); hugetlb_vma_unlock_read(vma); remainder =3D 0; break; } =20 + if (!absent && pte_present(pte) && + !hugetlb_pte_present_leaf(&hpte, pte)) { + /* We raced with someone splitting the PTE, so retry. */ + spin_unlock(ptl); + hugetlb_vma_unlock_read(vma); + continue; + } + /* * We need call hugetlb_fault for both hugepages under migration * (in which case hugetlb_fault waits for the migration,) and @@ -6773,7 +6787,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct= vm_area_struct *vma, vm_fault_t ret; unsigned int fault_flags =3D 0; =20 - if (pte) + if (ptep) spin_unlock(ptl); hugetlb_vma_unlock_read(vma); =20 @@ -6822,8 +6836,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struc= t vm_area_struct *vma, continue; } =20 - pfn_offset =3D (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT; - page =3D pte_page(huge_ptep_get(pte)); + pfn_offset =3D (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT; + subpage =3D pte_page(pte); + pages_per_hpte =3D hugetlb_pte_size(&hpte) / PAGE_SIZE; + page =3D compound_head(subpage); =20 VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); @@ -6833,22 +6849,22 @@ long follow_hugetlb_page(struct mm_struct *mm, stru= ct vm_area_struct *vma, * and skip the same_page loop below. */ if (!pages && !vmas && !pfn_offset && - (vaddr + huge_page_size(h) < vma->vm_end) && - (remainder >=3D pages_per_huge_page(h))) { - vaddr +=3D huge_page_size(h); - remainder -=3D pages_per_huge_page(h); - i +=3D pages_per_huge_page(h); + (vaddr + hugetlb_pte_size(&hpte) < vma->vm_end) && + (remainder >=3D pages_per_hpte)) { + vaddr +=3D hugetlb_pte_size(&hpte); + remainder -=3D pages_per_hpte; + i +=3D pages_per_hpte; spin_unlock(ptl); hugetlb_vma_unlock_read(vma); continue; } =20 /* vaddr may not be aligned to PAGE_SIZE */ - refs =3D min3(pages_per_huge_page(h) - pfn_offset, remainder, + refs =3D min3(pages_per_hpte - pfn_offset, remainder, (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT); =20 if (pages || vmas) - record_subpages_vmas(nth_page(page, pfn_offset), + record_subpages_vmas(nth_page(subpage, pfn_offset), vma, refs, likely(pages) ? pages + i : NULL, vmas ? vmas + i : NULL); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E0F6C05027 for ; Sat, 18 Feb 2023 00:30:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230196AbjBRAaT (ORCPT ); Fri, 17 Feb 2023 19:30:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43108 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230009AbjBRA3l (ORCPT ); Fri, 17 Feb 2023 19:29:41 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E148360F8D for ; Fri, 17 Feb 2023 16:29:06 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-536566339d6so26929197b3.11 for ; Fri, 17 Feb 2023 16:29:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=xFEWCFEVqNzp10q8weB+17bzU0TWKtM8UIz/Xx7DMio=; b=jsCsMDMV78mjE8aA5hDj4ELvx3b3FYDfUCGtdB3AxH26EGW3FR4Wa7s8y12aCgjJAY xZHdXI0S9zLwyoyC2XXFQ/t878xZ76UJxTZdgaT1614gKQ8S86PPdMjKwYlq97dle3xZ X18JJ5MDR3KedDVpG1uX/f8VVj8pEdf241wOEt1/1ioMygZNzU7w6K5FfESBrPr6rcwL oyCOP/Tffhfxzt33/TtBRunbLaQrs+s+Ng2bav+9PXMaH8T+pd/4F//IKbtKNQ3pXtLn zm3LA/sIhVTcMgTue+SyM1JfkLU6x3PjoLgkHIa6SXkJODW1AhFZZ0jaKNWlzrRQQiMx uOCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xFEWCFEVqNzp10q8weB+17bzU0TWKtM8UIz/Xx7DMio=; b=HCe+oOi7FFFkPp3lIVuSIXEHLbjpBvrIAvnt9jZYus2jkriYkwRllplyGHUpSXFxiL GWUGb8D/3+tkjpktm2XkCHc1YEkKvYPMX/IO0xm7dftlu5zGjRwAxNB84uahGxvQZaqP KdMaDxYidZkpCPiw5EMJ2VJL/5wnJj29Bq2UAMgG7fJVCn40Dur5nntmSlw61aFwWIbY LQT8DGJAPYVdlWnzUDxmUc9qtfeewf8Ae3veTQKbat5dpGQutgqDBXHTSna2lmwX8cIE vQmnrDWl2k3KTkEt8x1GKdBxu+SXlvK/0tjaoNxRskeW7wmL64IzsX13omodT4mAKbzk fxCg== X-Gm-Message-State: AO0yUKWSFdTOlwxBJ9HUj+z/bXjJsr4DwRatM/wYlIOhaVBRDWe0QnWn /o8mgUNwSh0svXpEMpmHCTu7b5Cy29dkuxtN X-Google-Smtp-Source: AK7set+l/nyB7A0r6pHwCnr6M8cb+COlb7s5TL7EsXfsO+GgeuyPTV/etrosFrCflWYhmk4cAs1xco6548KS9NkT X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:10e:0:b0:94a:ebba:cba6 with SMTP id 14-20020a5b010e000000b0094aebbacba6mr249759ybx.9.1676680144192; Fri, 17 Feb 2023 16:29:04 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:54 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-22-jthoughton@google.com> Subject: [PATCH v2 21/46] hugetlb: add HGM support to hugetlb_follow_page_mask From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The change here is very simple: do a high-granularity walk. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c26b040f4fb5..693332b7e186 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6655,11 +6655,10 @@ struct page *hugetlb_follow_page_mask(struct vm_are= a_struct *vma, unsigned long address, unsigned int flags) { struct hstate *h =3D hstate_vma(vma); - struct mm_struct *mm =3D vma->vm_mm; - unsigned long haddr =3D address & huge_page_mask(h); struct page *page =3D NULL; spinlock_t *ptl; - pte_t *pte, entry; + pte_t entry; + struct hugetlb_pte hpte; =20 /* * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via @@ -6669,13 +6668,24 @@ struct page *hugetlb_follow_page_mask(struct vm_are= a_struct *vma, return NULL; =20 hugetlb_vma_lock_read(vma); - pte =3D hugetlb_walk(vma, haddr, huge_page_size(h)); - if (!pte) + + if (hugetlb_full_walk(&hpte, vma, address)) goto out_unlock; =20 - ptl =3D huge_pte_lock(h, mm, pte); - entry =3D huge_ptep_get(pte); +retry: + ptl =3D hugetlb_pte_lock(&hpte); + entry =3D huge_ptep_get(hpte.ptep); if (pte_present(entry)) { + if (unlikely(!hugetlb_pte_present_leaf(&hpte, entry))) { + /* + * We raced with someone splitting from under us. + * Keep walking to get to the real leaf. + */ + spin_unlock(ptl); + hugetlb_full_walk_continue(&hpte, vma, address); + goto retry; + } + page =3D pte_page(entry) + ((address & ~huge_page_mask(h)) >> PAGE_SHIFT); /* --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 183CFC05027 for ; Sat, 18 Feb 2023 00:30:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230159AbjBRAaR (ORCPT ); Fri, 17 Feb 2023 19:30:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41956 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229731AbjBRA3l (ORCPT ); Fri, 17 Feb 2023 19:29:41 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C3D4F6A077 for ; Fri, 17 Feb 2023 16:29:05 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-536582abb72so22515337b3.5 for ; Fri, 17 Feb 2023 16:29:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=JfAQD2GnLgelJPsgCkKvGQkDAwhBg880UG9Zo5xkj4o=; b=GhwC2T28C0lmYxfHkr1F39H6tq+748WK5+znOZC5ojr/ThSSPvOkW3JfFS+N39ixGB ghbNpsvcjrL12G/rlTpuIaTgDVDrM0g8p1X3gpqXE1+PzjgwrE4/bveclrEHpITrsH6/ vCh2zPSW0omBWie4q3qfP1M6ZsaWlNDa4GFKgD1OeOlMW8F6eFJFEyAiAd7ZbqdicqY8 v4DpXcrSMZw5s++QcfPtVYVTTZBANPOzW0cTGWCstj+659DoDrTPU4/tbwMfCr3764rU 0IShYiy8Ml8Z3PlRkk3ux+gdZUWzixjHuGIN16t1b0SrnlJSfYwfS4QIWpPPtcB+ore8 jTlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=JfAQD2GnLgelJPsgCkKvGQkDAwhBg880UG9Zo5xkj4o=; b=OVoCZRkCV2+XQYu6XMSUmSuSG0tCfAvK/JtVEsx9gcdd+PDenH8h84LkdwMnMOrERI MvSG5sXB6eLdG63IUB89t4y9DPKxGibVUsRh830Mv0qhyrZJ0FoXUklSiwxO4FD8u/4L xwzNGNcPHxMLgPQlpqWuKMnQUC+8f+IyQWAF8EM+f3zJPEFXhde4iBgFVPJat6Ck8fhq 0FKhjZBYGJruw+0KoE+oBMWu2xIeuPrKAQDElJwFRPS3PvMO0SW8/Wn+rpQzq/MGK2pW ohsyxOGSVGP+0Mq2/owJbRaDK2uSwTgqPUsREpTbXqGduEAMVIXbHmrgcnfxpO2GUEtG 1JyQ== X-Gm-Message-State: AO0yUKU1tSg8XwmEPzKtskkVGJnpmluShBvycBciUYWUPLJrCUhI/9fs 65lqhPx3QU0sXs5TcCtCxJLVBGoXjJlv5daq X-Google-Smtp-Source: AK7set+yokzqkL4u720B7KYDqcTj9sYjOsXNmiF1GrEsGqhb2zUZSxmLSlU9X3KMDaalRNX0T2p99Q9vNaJNFASA X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:28c:b0:997:bdfe:78c5 with SMTP id v12-20020a056902028c00b00997bdfe78c5mr59430ybh.6.1676680145067; Fri, 17 Feb 2023 16:29:05 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:55 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-23-jthoughton@google.com> Subject: [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This allows fork() to work with high-granularity mappings. The page table structure is copied such that partially mapped regions will remain partially mapped in the same way for the new process. A page's reference count is incremented for *each* portion of it that is mapped in the page table. For example, if you have a PMD-mapped 1G page, the reference count will be incremented by 512. mapcount is handled similar to THPs: if you're completely mapping a hugepage, then the compound_mapcount is incremented. If you're mapping a part of it, the subpages that are getting mapped will have their mapcounts incremented. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 1a1a71868dfd..2fe1eb6897d4 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -162,6 +162,8 @@ void hugepage_put_subpool(struct hugepage_subpool *spoo= l); =20 void hugetlb_remove_rmap(struct page *subpage, unsigned long shift, struct hstate *h, struct vm_area_struct *vma); +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift, + struct hstate *h, struct vm_area_struct *vma); =20 void hugetlb_dup_vma_private(struct vm_area_struct *vma); void clear_vma_resv_huge_pages(struct vm_area_struct *vma); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 693332b7e186..210c6f2b16a5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -141,6 +141,37 @@ void hugetlb_remove_rmap(struct page *subpage, unsigne= d long shift, page_remove_rmap(subpage, vma, false); } } +/* + * hugetlb_add_file_rmap() - increment the mapcounts for file-backed huget= lb + * pages appropriately. + * + * For pages that are being mapped with their hstate-level PTE (e.g., a 1G= page + * being mapped with a 1G PUD), then we increment the compound_mapcount fo= r the + * head page. + * + * For pages that are being mapped with high-granularity, we increment the + * mapcounts for the individual subpages that are getting mapped. + */ +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift, + struct hstate *h, struct vm_area_struct *vma) +{ + struct page *hpage =3D compound_head(subpage); + + if (shift =3D=3D huge_page_shift(h)) { + VM_BUG_ON_PAGE(subpage !=3D hpage, subpage); + page_add_file_rmap(hpage, vma, true); + } else { + unsigned long nr_subpages =3D 1UL << (shift - PAGE_SHIFT); + struct page *final_page =3D &subpage[nr_subpages]; + + VM_BUG_ON_PAGE(HPageVmemmapOptimized(hpage), hpage); + /* + * Increment the mapcount on each page that is getting mapped. + */ + for (; subpage < final_page; ++subpage) + page_add_file_rmap(subpage, vma, false); + } +} =20 static inline bool subpool_is_free(struct hugepage_subpool *spool) { @@ -5210,7 +5241,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, struct vm_area_struct *src_vma) { pte_t *src_pte, *dst_pte, entry; - struct page *ptepage; + struct hugetlb_pte src_hpte, dst_hpte; + struct page *ptepage, *hpage; unsigned long addr; bool cow =3D is_cow_mapping(src_vma->vm_flags); struct hstate *h =3D hstate_vma(src_vma); @@ -5238,18 +5270,24 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, } =20 last_addr_mask =3D hugetlb_mask_last_page(h); - for (addr =3D src_vma->vm_start; addr < src_vma->vm_end; addr +=3D sz) { + addr =3D src_vma->vm_start; + while (addr < src_vma->vm_end) { spinlock_t *src_ptl, *dst_ptl; - src_pte =3D hugetlb_walk(src_vma, addr, sz); - if (!src_pte) { - addr |=3D last_addr_mask; + unsigned long hpte_sz; + + if (hugetlb_full_walk(&src_hpte, src_vma, addr)) { + addr =3D (addr | last_addr_mask) + sz; continue; } - dst_pte =3D huge_pte_alloc(dst, dst_vma, addr, sz); - if (!dst_pte) { - ret =3D -ENOMEM; + ret =3D hugetlb_full_walk_alloc(&dst_hpte, dst_vma, addr, + hugetlb_pte_size(&src_hpte)); + if (ret) break; - } + + src_pte =3D src_hpte.ptep; + dst_pte =3D dst_hpte.ptep; + + hpte_sz =3D hugetlb_pte_size(&src_hpte); =20 /* * If the pagetables are shared don't copy or take references. @@ -5259,13 +5297,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, * another vma. So page_count of ptep page is checked instead * to reliably determine whether pte is shared. */ - if (page_count(virt_to_page(dst_pte)) > 1) { - addr |=3D last_addr_mask; + if (hugetlb_pte_size(&dst_hpte) =3D=3D sz && + page_count(virt_to_page(dst_pte)) > 1) { + addr =3D (addr | last_addr_mask) + sz; continue; } =20 - dst_ptl =3D huge_pte_lock(h, dst, dst_pte); - src_ptl =3D huge_pte_lockptr(huge_page_shift(h), src, src_pte); + dst_ptl =3D hugetlb_pte_lock(&dst_hpte); + src_ptl =3D hugetlb_pte_lockptr(&src_hpte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); entry =3D huge_ptep_get(src_pte); again: @@ -5309,10 +5348,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, */ if (userfaultfd_wp(dst_vma)) set_huge_pte_at(dst, addr, dst_pte, entry); + } else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) { + /* Retry the walk. */ + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + continue; } else { - entry =3D huge_ptep_get(src_pte); ptepage =3D pte_page(entry); - get_page(ptepage); + hpage =3D compound_head(ptepage); + get_page(hpage); =20 /* * Failing to duplicate the anon rmap is a rare case @@ -5324,13 +5368,34 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, * need to be without the pgtable locks since we could * sleep during the process. */ - if (!PageAnon(ptepage)) { - page_add_file_rmap(ptepage, src_vma, true); - } else if (page_try_dup_anon_rmap(ptepage, true, + if (!PageAnon(hpage)) { + hugetlb_add_file_rmap(ptepage, + src_hpte.shift, h, src_vma); + } + /* + * It is currently impossible to get anonymous HugeTLB + * high-granularity mappings, so we use 'hpage' here. + * + * This will need to be changed when HGM support for + * anon mappings is added. + */ + else if (page_try_dup_anon_rmap(hpage, true, src_vma)) { pte_t src_pte_old =3D entry; struct folio *new_folio; =20 + /* + * If we are mapped at high granularity, we + * may end up allocating lots and lots of + * hugepages when we only need one. Bail out + * now. + */ + if (hugetlb_pte_size(&src_hpte) !=3D sz) { + put_page(hpage); + ret =3D -EINVAL; + break; + } + spin_unlock(src_ptl); spin_unlock(dst_ptl); /* Do not use reserve as it's private owned */ @@ -5342,7 +5407,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, } copy_user_huge_page(&new_folio->page, ptepage, addr, dst_vma, npages); - put_page(ptepage); + put_page(hpage); =20 /* Install the new hugetlb folio if src pte stable */ dst_ptl =3D huge_pte_lock(h, dst, dst_pte); @@ -5360,6 +5425,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, hugetlb_install_folio(dst_vma, dst_pte, addr, new_folio); spin_unlock(src_ptl); spin_unlock(dst_ptl); + addr +=3D hugetlb_pte_size(&src_hpte); continue; } =20 @@ -5376,10 +5442,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, } =20 set_huge_pte_at(dst, addr, dst_pte, entry); - hugetlb_count_add(npages, dst); + hugetlb_count_add( + hugetlb_pte_size(&dst_hpte) / PAGE_SIZE, + dst); } spin_unlock(src_ptl); spin_unlock(dst_ptl); + addr +=3D hugetlb_pte_size(&src_hpte); } =20 if (cow) { --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AEDA2C05027 for ; Sat, 18 Feb 2023 00:30:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230200AbjBRAaX (ORCPT ); Fri, 17 Feb 2023 19:30:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230086AbjBRA3m (ORCPT ); Fri, 17 Feb 2023 19:29:42 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0F2DC6CA0B for ; Fri, 17 Feb 2023 16:29:08 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-53659bb998bso19841337b3.9 for ; Fri, 17 Feb 2023 16:29:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=LnPCXYu0bp3jbi4b+RX4dTJj+ODlc15FNHREFY5xs1M=; b=eyB1wewXCAx1HDERPMCO0zwZAas4YhYFgV+T58Xo/niWsDRG5UC3Rs6XOZmH1VN6an BDeTgeZmAtV2U/RJBmPsaw2Pzs+502TrDV5JW7uSQjIUphjyqQ6JBLVLZiSywnbCwmN7 glxseLDjZd3DHe25r9gB54clLLLOql2zDYooJkeUUmIQE39+XJaDt5UKqvRy0vPMNP/Y dlrZCv9EsYeOE222dzKUD7vcvUi3Z842TugU3j8cUBo6f42mvUEeN6sqIIjBwYUd3WND 0vcWoKQXj/p0vdSibSMZSUoLzsYHkjTdgECKTuMGuOdbRhA9Y6lS7gs45B0ytQxsj3iT jZcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LnPCXYu0bp3jbi4b+RX4dTJj+ODlc15FNHREFY5xs1M=; b=OLyhVAiIqtK+yl2Irpuzxp2eYUip01qlGu5vPCmxhn7NQg9GV/vsg63ElMTvbiYNkR ZwgHTxNRBvrHj9CRhrVC7RodZ5i6mCjagpNixUv5UhHPteN5THoCM+xGr0cgtPMAaHiE 04A3mtDU44qCsU+NUwU8hsLjJjRCMT823oVljq0TghxOBtE4Z6CnGCHzVb43iYsQdVDI pPPjTv5QnyOEONh2j3iR7Q5KEYy9lr550xZfUJjAKbJHXCqkIh6gaK47NtZrF8+zRYe/ rtQPir8F6JXYVDQn2FyJjSPE1fL8NM+r40PKNDU32G6GxsRYM3fp8y3aXRHveUHAAqfL QRuA== X-Gm-Message-State: AO0yUKXlACgaFw7dzuLliG3E6cMeqyqLyryWYx3OcZJ9otOD9J17vH72 2zKOHYZdl73bH6ygctbbR8OFvjBIdGzULVQo X-Google-Smtp-Source: AK7set8Hmr9k0Rf2FDN/phGApjkx1AsdUQUwGd2TSqsgrixKWblzVEB77ByAXj7RTzk+3xBvLZoYclWlE/0WLKho X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:ec7:0:b0:965:bac9:d458 with SMTP id a7-20020a5b0ec7000000b00965bac9d458mr8139ybs.11.1676680146246; Fri, 17 Feb 2023 16:29:06 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:56 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-24-jthoughton@google.com> Subject: [PATCH v2 23/46] hugetlb: add HGM support to move_hugetlb_page_tables From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This is very similar to the support that was added to copy_hugetlb_page_range. We simply do a high-granularity walk now, and most of the rest of the code stays the same. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 210c6f2b16a5..6c4678b7a07d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5461,16 +5461,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, return ret; } =20 -static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_ad= dr, - unsigned long new_addr, pte_t *src_pte, pte_t *dst_pte) +static void move_hugetlb_pte(struct vm_area_struct *vma, unsigned long old= _addr, + unsigned long new_addr, struct hugetlb_pte *src_hpte, + struct hugetlb_pte *dst_hpte) { - struct hstate *h =3D hstate_vma(vma); struct mm_struct *mm =3D vma->vm_mm; spinlock_t *src_ptl, *dst_ptl; pte_t pte; =20 - dst_ptl =3D huge_pte_lock(h, mm, dst_pte); - src_ptl =3D huge_pte_lockptr(huge_page_shift(h), mm, src_pte); + dst_ptl =3D hugetlb_pte_lock(dst_hpte); + src_ptl =3D hugetlb_pte_lockptr(src_hpte); =20 /* * We don't have to worry about the ordering of src and dst ptlocks @@ -5479,8 +5479,8 @@ static void move_huge_pte(struct vm_area_struct *vma,= unsigned long old_addr, if (src_ptl !=3D dst_ptl) spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); =20 - pte =3D huge_ptep_get_and_clear(mm, old_addr, src_pte); - set_huge_pte_at(mm, new_addr, dst_pte, pte); + pte =3D huge_ptep_get_and_clear(mm, old_addr, src_hpte->ptep); + set_huge_pte_at(mm, new_addr, dst_hpte->ptep, pte); =20 if (src_ptl !=3D dst_ptl) spin_unlock(src_ptl); @@ -5498,9 +5498,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *v= ma, struct mm_struct *mm =3D vma->vm_mm; unsigned long old_end =3D old_addr + len; unsigned long last_addr_mask; - pte_t *src_pte, *dst_pte; struct mmu_notifier_range range; bool shared_pmd =3D false; + struct hugetlb_pte src_hpte, dst_hpte; =20 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr, old_end); @@ -5516,28 +5516,35 @@ int move_hugetlb_page_tables(struct vm_area_struct = *vma, /* Prevent race with file truncation */ hugetlb_vma_lock_write(vma); i_mmap_lock_write(mapping); - for (; old_addr < old_end; old_addr +=3D sz, new_addr +=3D sz) { - src_pte =3D hugetlb_walk(vma, old_addr, sz); - if (!src_pte) { - old_addr |=3D last_addr_mask; - new_addr |=3D last_addr_mask; + while (old_addr < old_end) { + if (hugetlb_full_walk(&src_hpte, vma, old_addr)) { + /* The hstate-level PTE wasn't allocated. */ + old_addr =3D (old_addr | last_addr_mask) + sz; + new_addr =3D (new_addr | last_addr_mask) + sz; continue; } - if (huge_pte_none(huge_ptep_get(src_pte))) + + if (huge_pte_none(huge_ptep_get(src_hpte.ptep))) { + old_addr +=3D hugetlb_pte_size(&src_hpte); + new_addr +=3D hugetlb_pte_size(&src_hpte); continue; + } =20 - if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) { + if (hugetlb_pte_size(&src_hpte) =3D=3D sz && + huge_pmd_unshare(mm, vma, old_addr, src_hpte.ptep)) { shared_pmd =3D true; - old_addr |=3D last_addr_mask; - new_addr |=3D last_addr_mask; + old_addr =3D (old_addr | last_addr_mask) + sz; + new_addr =3D (new_addr | last_addr_mask) + sz; continue; } =20 - dst_pte =3D huge_pte_alloc(mm, new_vma, new_addr, sz); - if (!dst_pte) + if (hugetlb_full_walk_alloc(&dst_hpte, new_vma, new_addr, + hugetlb_pte_size(&src_hpte))) break; =20 - move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte); + move_hugetlb_pte(vma, old_addr, new_addr, &src_hpte, &dst_hpte); + old_addr +=3D hugetlb_pte_size(&src_hpte); + new_addr +=3D hugetlb_pte_size(&src_hpte); } =20 if (shared_pmd) --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3526C05027 for ; Sat, 18 Feb 2023 00:30:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230206AbjBRAa2 (ORCPT ); Fri, 17 Feb 2023 19:30:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42034 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229540AbjBRA3q (ORCPT ); Fri, 17 Feb 2023 19:29:46 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E8446A042 for ; Fri, 17 Feb 2023 16:29:08 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id g63-20020a25db42000000b00889c54916f2so1740410ybf.14 for ; Fri, 17 Feb 2023 16:29:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=WTnKSuVZCt+rhuzvxEm8L73SzZuvMukxFScaxZ9D8Co=; b=L30EdOY7ZdwK//mLU+7411LW4r7AXP+dL/wo9+jXO/WcJ04CoJuWdXFBiiCSNGb46t sBaVLxi21hhAwKqlQ42PTpoktSV3w8ZCG6GtaV6zNLoCZsWyIozI7LT3SGds8vUSGtvL 8bn+t4XSOU3//vae5pDb5V2Wt0nqx+ASlUpTQxUcwOKp6auK0QYbvPxXLS//Whkg8Xep plK+8niTC21BXzXzAR2uTAn/GBxuv7yY3ZRhMgy6+GQbwBq7yNFIB3yLNd8/V4O+5B+l K1VKQkZL1YOcNbEpaCrjieJj1keaKiBGGAuUTdjO/zQEfWTY5Cm/wW0Vm3cp6E52Cyi6 zD3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WTnKSuVZCt+rhuzvxEm8L73SzZuvMukxFScaxZ9D8Co=; b=0EQQyy4WyF59saCcw0ncJxjXFGXYuH36e6jym9jikNLUgOcyGZC9wj+pGWiP0m2+4z efmGDNmuoVvC1j/k6cnoiU+pvNGHfmaBu9hGE7KTh5JlSdrMo0jVDbobn3Yd/4oGukWy m8qqCriHUfZivi7GG5+YNxijINapKefmFHASrjjHvK8rpnOnbjPnMmJtkOcNkOf2WfdR v9TJcuQBKtTxIrlvmOtcWSY4/NEf3RUphasbvvfQsbC5Gm0LpUxkT+wJ0OUs2YGRe6HM o9VEbxNb+lowBW3JW9d9yHkPIApioohy2wSvlJeHsfbQBAvLSXAj9VgZBg6Va0edWU2t VkbA== X-Gm-Message-State: AO0yUKWOY7y+VEIZd7bwNB67p59ohrAZ3sbEylwvKbHPkKK3a1wGYTJX LlRXCkchX1XaOBHrQNUJlWhEA30QLmSsZKwO X-Google-Smtp-Source: AK7set+JZbRX7CDnj+Gizzbk/DH4kxZkKGMSGcEb4+M9JWKbZSU4JEyz12TmXNRwFx+ySFC7KGlj2zgmA6wzw7Gh X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:154b:b0:97a:ebd:a594 with SMTP id r11-20020a056902154b00b0097a0ebda594mr79653ybu.3.1676680147254; Fri, 17 Feb 2023 16:29:07 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:57 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-25-jthoughton@google.com> Subject: [PATCH v2 24/46] hugetlb: add HGM support to hugetlb_fault and hugetlb_no_page From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update the page fault handler to support high-granularity page faults. While handling a page fault on a partially-mapped HugeTLB page, if the PTE we find with hugetlb_pte_walk is none, then we will replace it with a leaf-level PTE to map the page. To give some examples: 1. For a completely unmapped 1G page, it will be mapped with a 1G PUD. 2. For a 1G page that has its first 512M mapped, any faults on the unmapped sections will result in 2M PMDs mapping each unmapped 2M section. 3. For a 1G page that has only its first 4K mapped, a page fault on its second 4K section will get a 4K PTE to map it. Unless high-granularity mappings are created via UFFDIO_CONTINUE, it is impossible for hugetlb_fault to create high-granularity mappings. This commit does not handle hugetlb_wp right now, and it doesn't handle HugeTLB page migration and swap entries. The BUG_ON in huge_pte_alloc is removed, as it is not longer valid when HGM is possible. HGM can be disabled if the VMA lock cannot be allocated after a VMA is split, yet high-granularity mappings may still exist. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6c4678b7a07d..86cd51beb02c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -173,6 +173,18 @@ void hugetlb_add_file_rmap(struct page *subpage, unsig= ned long shift, } } =20 +/* + * Find the subpage that corresponds to `addr` in `folio`. + */ +static struct page *hugetlb_find_subpage(struct hstate *h, struct folio *f= olio, + unsigned long addr) +{ + size_t idx =3D (addr & ~huge_page_mask(h))/PAGE_SIZE; + + BUG_ON(idx >=3D pages_per_huge_page(h)); + return folio_page(folio, idx); +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -6072,14 +6084,14 @@ static inline vm_fault_t hugetlb_handle_userfault(s= truct vm_area_struct *vma, * Recheck pte with pgtable lock. Returns true if pte didn't change, or * false if pte changed or is changing. */ -static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, - pte_t *ptep, pte_t old_pte) +static bool hugetlb_pte_stable(struct hstate *h, struct hugetlb_pte *hpte, + pte_t old_pte) { spinlock_t *ptl; bool same; =20 - ptl =3D huge_pte_lock(h, mm, ptep); - same =3D pte_same(huge_ptep_get(ptep), old_pte); + ptl =3D hugetlb_pte_lock(hpte); + same =3D pte_same(huge_ptep_get(hpte->ptep), old_pte); spin_unlock(ptl); =20 return same; @@ -6088,7 +6100,7 @@ static bool hugetlb_pte_stable(struct hstate *h, stru= ct mm_struct *mm, static vm_fault_t hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, struct address_space *mapping, pgoff_t idx, - unsigned long address, pte_t *ptep, + unsigned long address, struct hugetlb_pte *hpte, pte_t old_pte, unsigned int flags) { struct hstate *h =3D hstate_vma(vma); @@ -6096,10 +6108,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct = *mm, int anon_rmap =3D 0; unsigned long size; struct folio *folio; + struct page *subpage; pte_t new_pte; spinlock_t *ptl; unsigned long haddr =3D address & huge_page_mask(h); bool new_folio, new_pagecache_folio =3D false; + unsigned long haddr_hgm =3D address & hugetlb_pte_mask(hpte); u32 hash =3D hugetlb_fault_mutex_hash(mapping, idx); =20 /* @@ -6143,7 +6157,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *m= m, * never happen on the page after UFFDIO_COPY has * correctly installed the page and returned. */ - if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) { + if (!hugetlb_pte_stable(h, hpte, old_pte)) { ret =3D 0; goto out; } @@ -6167,7 +6181,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *m= m, * here. Before returning error, get ptl and make * sure there really is no pte entry. */ - if (hugetlb_pte_stable(h, mm, ptep, old_pte)) + if (hugetlb_pte_stable(h, hpte, old_pte)) ret =3D vmf_error(PTR_ERR(folio)); else ret =3D 0; @@ -6217,7 +6231,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *m= m, folio_unlock(folio); folio_put(folio); /* See comment in userfaultfd_missing() block above */ - if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) { + if (!hugetlb_pte_stable(h, hpte, old_pte)) { ret =3D 0; goto out; } @@ -6242,30 +6256,46 @@ static vm_fault_t hugetlb_no_page(struct mm_struct = *mm, vma_end_reservation(h, vma, haddr); } =20 - ptl =3D huge_pte_lock(h, mm, ptep); + ptl =3D hugetlb_pte_lock(hpte); ret =3D 0; - /* If pte changed from under us, retry */ - if (!pte_same(huge_ptep_get(ptep), old_pte)) + /* + * If pte changed from under us, retry. + * + * When dealing with high-granularity-mapped PTEs, it's possible that + * a non-contiguous PTE within our contiguous PTE group gets populated, + * in which case, we need to retry here. This is NOT caught here, and + * will need to be addressed when HGM is supported for architectures + * that support contiguous PTEs. + */ + if (!pte_same(huge_ptep_get(hpte->ptep), old_pte)) goto backout; =20 - if (anon_rmap) + subpage =3D hugetlb_find_subpage(h, folio, haddr_hgm); + + if (anon_rmap) { + VM_BUG_ON(&folio->page !=3D subpage); hugepage_add_new_anon_rmap(folio, vma, haddr); + } else - page_add_file_rmap(&folio->page, vma, true); - new_pte =3D make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE) - && (vma->vm_flags & VM_SHARED))); + hugetlb_add_file_rmap(subpage, hpte->shift, h, vma); + + new_pte =3D make_huge_pte_with_shift(vma, subpage, + ((vma->vm_flags & VM_WRITE) + && (vma->vm_flags & VM_SHARED)), + hpte->shift); /* * If this pte was previously wr-protected, keep it wr-protected even * if populated. */ if (unlikely(pte_marker_uffd_wp(old_pte))) new_pte =3D huge_pte_mkuffd_wp(new_pte); - set_huge_pte_at(mm, haddr, ptep, new_pte); + set_huge_pte_at(mm, haddr_hgm, hpte->ptep, new_pte); =20 - hugetlb_count_add(pages_per_huge_page(h), mm); + hugetlb_count_add(hugetlb_pte_size(hpte) / PAGE_SIZE, mm); if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { + WARN_ON_ONCE(hugetlb_pte_size(hpte) !=3D huge_page_size(h)); /* Optimization, do the COW without a second fault */ - ret =3D hugetlb_wp(mm, vma, address, ptep, flags, folio, ptl); + ret =3D hugetlb_wp(mm, vma, address, hpte->ptep, flags, folio, ptl); } =20 spin_unlock(ptl); @@ -6322,17 +6352,19 @@ u32 hugetlb_fault_mutex_hash(struct address_space *= mapping, pgoff_t idx) vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags) { - pte_t *ptep, entry; + pte_t entry; spinlock_t *ptl; vm_fault_t ret; u32 hash; pgoff_t idx; - struct page *page =3D NULL; - struct folio *pagecache_folio =3D NULL; + struct page *subpage =3D NULL; + struct folio *pagecache_folio =3D NULL, *folio =3D NULL; struct hstate *h =3D hstate_vma(vma); struct address_space *mapping; int need_wait_lock =3D 0; unsigned long haddr =3D address & huge_page_mask(h); + unsigned long haddr_hgm; + struct hugetlb_pte hpte; =20 /* * Serialize hugepage allocation and instantiation, so that we don't @@ -6346,26 +6378,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, stru= ct vm_area_struct *vma, =20 /* * Acquire vma lock before calling huge_pte_alloc and hold - * until finished with ptep. This prevents huge_pmd_unshare from - * being called elsewhere and making the ptep no longer valid. + * until finished with hpte. This prevents huge_pmd_unshare from + * being called elsewhere and making the hpte no longer valid. */ hugetlb_vma_lock_read(vma); - ptep =3D huge_pte_alloc(mm, vma, haddr, huge_page_size(h)); - if (!ptep) { + if (hugetlb_full_walk_alloc(&hpte, vma, address, 0)) { hugetlb_vma_unlock_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); return VM_FAULT_OOM; } =20 - entry =3D huge_ptep_get(ptep); + entry =3D huge_ptep_get(hpte.ptep); /* PTE markers should be handled the same way as none pte */ - if (huge_pte_none_mostly(entry)) + if (huge_pte_none_mostly(entry)) { /* * hugetlb_no_page will drop vma lock and hugetlb fault * mutex internally, which make us return immediately. */ - return hugetlb_no_page(mm, vma, mapping, idx, address, ptep, + return hugetlb_no_page(mm, vma, mapping, idx, address, &hpte, entry, flags); + } =20 ret =3D 0; =20 @@ -6386,7 +6418,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, * be released there. */ mutex_unlock(&hugetlb_fault_mutex_table[hash]); - migration_entry_wait_huge(vma, ptep); + migration_entry_wait_huge(vma, hpte.ptep); return 0; } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) ret =3D VM_FAULT_HWPOISON_LARGE | @@ -6394,6 +6426,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, goto out_mutex; } =20 + if (!hugetlb_pte_present_leaf(&hpte, entry)) + /* We raced with someone splitting the entry. */ + goto out_mutex; + /* * If we are going to COW/unshare the mapping later, we examine the * pending reservations for this page now. This will ensure that any @@ -6413,14 +6449,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, stru= ct vm_area_struct *vma, pagecache_folio =3D filemap_lock_folio(mapping, idx); } =20 - ptl =3D huge_pte_lock(h, mm, ptep); + ptl =3D hugetlb_pte_lock(&hpte); =20 /* Check for a racing update before calling hugetlb_wp() */ - if (unlikely(!pte_same(entry, huge_ptep_get(ptep)))) + if (unlikely(!pte_same(entry, huge_ptep_get(hpte.ptep)))) goto out_ptl; =20 + /* haddr_hgm is the base address of the region that hpte maps. */ + haddr_hgm =3D address & hugetlb_pte_mask(&hpte); + /* Handle userfault-wp first, before trying to lock more pages */ - if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) && + if (userfaultfd_wp(vma) && huge_pte_uffd_wp(entry) && (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) { struct vm_fault vmf =3D { .vma =3D vma, @@ -6444,18 +6483,21 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, stru= ct vm_area_struct *vma, * pagecache_folio, so here we need take the former one * when page !=3D pagecache_folio or !pagecache_folio. */ - page =3D pte_page(entry); - if (page_folio(page) !=3D pagecache_folio) - if (!trylock_page(page)) { + subpage =3D pte_page(entry); + folio =3D page_folio(subpage); + if (folio !=3D pagecache_folio) + if (!trylock_page(&folio->page)) { need_wait_lock =3D 1; goto out_ptl; } =20 - get_page(page); + folio_get(folio); =20 if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!huge_pte_write(entry)) { - ret =3D hugetlb_wp(mm, vma, address, ptep, flags, + WARN_ON_ONCE(hugetlb_pte_size(&hpte) !=3D + huge_page_size(h)); + ret =3D hugetlb_wp(mm, vma, address, hpte.ptep, flags, pagecache_folio, ptl); goto out_put_page; } else if (likely(flags & FAULT_FLAG_WRITE)) { @@ -6463,13 +6505,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, stru= ct vm_area_struct *vma, } } entry =3D pte_mkyoung(entry); - if (huge_ptep_set_access_flags(vma, haddr, ptep, entry, + if (huge_ptep_set_access_flags(vma, haddr_hgm, hpte.ptep, entry, flags & FAULT_FLAG_WRITE)) - update_mmu_cache(vma, haddr, ptep); + update_mmu_cache(vma, haddr_hgm, hpte.ptep); out_put_page: - if (page_folio(page) !=3D pagecache_folio) - unlock_page(page); - put_page(page); + if (folio !=3D pagecache_folio) + folio_unlock(folio); + folio_put(folio); out_ptl: spin_unlock(ptl); =20 @@ -6488,7 +6530,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, * here without taking refcount. */ if (need_wait_lock) - wait_on_page_locked(page); + wait_on_page_locked(&folio->page); return ret; } =20 @@ -7689,6 +7731,9 @@ int hugetlb_full_walk(struct hugetlb_pte *hpte, /* * hugetlb_full_walk_alloc - do a high-granularity walk, potentially alloc= ate * new PTEs. + * + * If @target_sz is 0, then only attempt to allocate the hstate-level PTE = and + * walk as far as we can go. */ int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte, struct vm_area_struct *vma, @@ -7707,6 +7752,12 @@ int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte, if (!ptep) return -ENOMEM; =20 + if (!target_sz) { + WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, + PAGE_SIZE, false)); + return 0; + } + return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true); } =20 @@ -7735,7 +7786,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm= _area_struct *vma, pte =3D (pte_t *)pmd_alloc(mm, pud, addr); } } - BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte)); =20 return pte; } --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D83A4C636D6 for ; Sat, 18 Feb 2023 00:30:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229868AbjBRAao (ORCPT ); Fri, 17 Feb 2023 19:30:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43046 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230096AbjBRA3t (ORCPT ); Fri, 17 Feb 2023 19:29:49 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B5B356C015 for ; Fri, 17 Feb 2023 16:29:12 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id o137-20020a25418f000000b009419f64f6afso2165044yba.2 for ; Fri, 17 Feb 2023 16:29:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=UjVXJ8hdtbHTf+lUAZ4f6x6F3SynSj2CG1xD1+Yj8f8=; b=pv9sgXbRSK+ulujLGfdPPPAff8mO5bDv7HpAt6sVLZ6JoZzQXqgTKRKandAavm2wSz GaPYHHRdcJpckwms1PVrFc6vNiVTcxAlG6oCUMRdsxDqYTYXzVurmJEml8+ql0/TDF3v UCJ95R9PBOaSmpPE4Ic9Jp4SeoUjQgy3yUVhXKTk40u73yY7EVS1bRLPiHZZ1nL8e3sb q6IAgfazCkla4E5CBbTWQRtlKAAvX2g0orVfxrreagJHjk3BLCZ3ldDp7zxJzy4vym4x NWi0886Mcej8s6rWUU/kKQxFgDeHHLlKOLYF3V4C/36HC2UNIB85Ve9+c4SQOQU3rEkv hcBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UjVXJ8hdtbHTf+lUAZ4f6x6F3SynSj2CG1xD1+Yj8f8=; b=G/JMY2yUjZjyzJX5icYnwqF2rruwEoV0qq11xPtOKbjcWNvNFT28PQg6oQwi6QWA6u DYdZC6bwU2QVtz/RNMix0Z9E6ge0NVV7K/or/bEfttDDjzU+JagJSDWYnVbc7A56jf0I PBMSZoKNJxaR5ZQqGqlfKZe4uoRABy4brJA5iAW29V3/ini/7rYBQlXPEFngKk7mX4s9 gnD6NXP/mo995EZFDCbcQjk8RTzDX65p/mqiE+CZbDRtTutchPzALlqNK0iaGBgUNypT 7B1uSbia/rVkFwWuCQqLZTzU5QR2wItsX1eTUmS8zPEnzQ88WRTZCpCIaDaUsYU8KI8F Qh5Q== X-Gm-Message-State: AO0yUKVnlpcJ9F7IO0PPIqbZfrB+1t1P3EeoQiF6e3xghQAn5qwR6r3y vWZrFDi4jZF0uJHBV0IUffcRygvF7Rux++7/ X-Google-Smtp-Source: AK7set9HSTLEVb4H/dZ4jXxsGK/LepFZp0hDfPakpFLzNJu6RpLhAhLCkF7ri6HXdFjyEvBAOB+uhx84ME7Ia1t3 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:c4b:0:b0:8c3:7bc8:7f0e with SMTP id d11-20020a5b0c4b000000b008c37bc87f0emr1152747ybr.588.1676680148502; Fri, 17 Feb 2023 16:29:08 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:58 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-26-jthoughton@google.com> Subject: [PATCH v2 25/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The main change in this commit is to walk_hugetlb_range to support walking HGM mappings, but all walk_hugetlb_range callers must be updated to use the new API and take the correct action. Listing all the changes to the callers: For s390 changes, we simply BUILD_BUG_ON if HGM is enabled. For smaps, shared_hugetlb (and private_hugetlb, although private mappings don't support HGM) may now not be divisible by the hugepage size. The appropriate changes have been made to support analyzing HGM PTEs. For pagemap, we ignore non-leaf PTEs by treating that as if they were none PTEs. We can only end up with non-leaf PTEs if they had just been updated from a none PTE. For show_numa_map, the challenge is that, if any of a hugepage is mapped, we have to count that entire page exactly once, as the results are given in units of hugepages. To support HGM mappings, we keep track of the last page that we looked it. If the hugepage we are currently looking at is the same as the last one, then we must be looking at an HGM-mapped page that has been mapped at high-granularity, and we've already accounted for it. For DAMON, we treat non-leaf PTEs as if they were blank, for the same reason as pagemap. For hwpoison, we proactively update the logic to support the case when hpte is pointing to a subpage within the poisoned hugepage. For queue_pages_hugetlb/migration, we ignore all HGM-enabled VMAs for now. For mincore, we ignore non-leaf PTEs for the same reason as pagemap. For mprotect/prot_none_hugetlb_entry, we retry the walk when we get a non-leaf PTE. Signed-off-by: James Houghton diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 5a716bdcba05..e1d41caa8504 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -2629,14 +2629,20 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsig= ned long addr, return 0; } =20 -static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr, - unsigned long hmask, unsigned long next, +static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { - pmd_t *pmd =3D (pmd_t *)pte; + pmd_t *pmd =3D (pmd_t *)hpte->ptep; unsigned long start, end; struct page *page =3D pmd_page(*pmd); =20 + /* + * We don't support high-granularity mappings yet. If we did, the + * pmd_page() call above would be unsafe. + */ + BUILD_BUG_ON(IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)); + /* * The write check makes sure we do not set a key on shared * memory. This is needed as the walker does not differentiate diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 77b72f42556a..2f293b5dabc0 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -731,27 +731,39 @@ static void show_smap_vma_flags(struct seq_file *m, s= truct vm_area_struct *vma) } =20 #ifdef CONFIG_HUGETLB_PAGE -static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, - struct mm_walk *walk) +static int smaps_hugetlb_range(struct hugetlb_pte *hpte, + unsigned long addr, + struct mm_walk *walk) { struct mem_size_stats *mss =3D walk->private; struct vm_area_struct *vma =3D walk->vma; struct page *page =3D NULL; + pte_t pte =3D huge_ptep_get(hpte->ptep); =20 - if (pte_present(*pte)) { - page =3D vm_normal_page(vma, addr, *pte); - } else if (is_swap_pte(*pte)) { - swp_entry_t swpent =3D pte_to_swp_entry(*pte); + if (pte_present(pte)) { + /* We only care about leaf-level PTEs. */ + if (!hugetlb_pte_present_leaf(hpte, pte)) + /* + * The only case where hpte is not a leaf is that + * it was originally none, but it was split from + * under us. It was originally none, so exclude it. + */ + return 0; + + page =3D vm_normal_page(vma, addr, pte); + } else if (is_swap_pte(pte)) { + swp_entry_t swpent =3D pte_to_swp_entry(pte); =20 if (is_pfn_swap_entry(swpent)) page =3D pfn_swap_entry_to_page(swpent); } if (page) { - if (page_mapcount(page) >=3D 2 || hugetlb_pmd_shared(pte)) - mss->shared_hugetlb +=3D huge_page_size(hstate_vma(vma)); + unsigned long sz =3D hugetlb_pte_size(hpte); + + if (page_mapcount(page) >=3D 2 || hugetlb_pmd_shared(hpte->ptep)) + mss->shared_hugetlb +=3D sz; else - mss->private_hugetlb +=3D huge_page_size(hstate_vma(vma)); + mss->private_hugetlb +=3D sz; } return 0; } @@ -1569,22 +1581,31 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned = long addr, unsigned long end, =20 #ifdef CONFIG_HUGETLB_PAGE /* This function walks within one hugetlb entry in the single call */ -static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, - unsigned long addr, unsigned long end, +static int pagemap_hugetlb_range(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { struct pagemapread *pm =3D walk->private; struct vm_area_struct *vma =3D walk->vma; u64 flags =3D 0, frame =3D 0; int err =3D 0; - pte_t pte; + unsigned long hmask =3D hugetlb_pte_mask(hpte); + unsigned long end =3D addr + hugetlb_pte_size(hpte); + pte_t pte =3D huge_ptep_get(hpte->ptep); + struct page *page; =20 if (vma->vm_flags & VM_SOFTDIRTY) flags |=3D PM_SOFT_DIRTY; =20 - pte =3D huge_ptep_get(ptep); if (pte_present(pte)) { - struct page *page =3D pte_page(pte); + /* + * We raced with this PTE being split, which can only happen if + * it was blank before. Treat it is as if it were blank. + */ + if (!hugetlb_pte_present_leaf(hpte, pte)) + return 0; + + page =3D pte_page(pte); =20 if (!PageAnon(page)) flags |=3D PM_FILE; @@ -1865,10 +1886,16 @@ static struct page *can_gather_numa_stats_pmd(pmd_t= pmd, } #endif =20 +struct show_numa_map_private { + struct numa_maps *md; + struct page *last_page; +}; + static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct numa_maps *md =3D walk->private; + struct show_numa_map_private *priv =3D walk->private; + struct numa_maps *md =3D priv->md; struct vm_area_struct *vma =3D walk->vma; spinlock_t *ptl; pte_t *orig_pte; @@ -1880,6 +1907,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long= addr, struct page *page; =20 page =3D can_gather_numa_stats_pmd(*pmd, vma, addr); + priv->last_page =3D page; if (page) gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); @@ -1893,6 +1921,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long= addr, orig_pte =3D pte =3D pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { struct page *page =3D can_gather_numa_stats(*pte, vma, addr); + priv->last_page =3D page; if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); @@ -1903,19 +1932,25 @@ static int gather_pte_stats(pmd_t *pmd, unsigned lo= ng addr, return 0; } #ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, struct mm_walk *walk) +static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long ad= dr, + struct mm_walk *walk) { - pte_t huge_pte =3D huge_ptep_get(pte); + struct show_numa_map_private *priv =3D walk->private; + pte_t huge_pte =3D huge_ptep_get(hpte->ptep); struct numa_maps *md; struct page *page; =20 - if (!pte_present(huge_pte)) + if (!hugetlb_pte_present_leaf(hpte, huge_pte)) + return 0; + + page =3D compound_head(pte_page(huge_pte)); + if (priv->last_page =3D=3D page) + /* we've already accounted for this page */ return 0; =20 - page =3D pte_page(huge_pte); + priv->last_page =3D page; =20 - md =3D walk->private; + md =3D priv->md; gather_stats(page, md, pte_dirty(huge_pte), 1); return 0; } @@ -1945,9 +1980,15 @@ static int show_numa_map(struct seq_file *m, void *v) struct file *file =3D vma->vm_file; struct mm_struct *mm =3D vma->vm_mm; struct mempolicy *pol; + char buffer[64]; int nid; =20 + struct show_numa_map_private numa_map_private; + + numa_map_private.md =3D md; + numa_map_private.last_page =3D NULL; + if (!mm) return 0; =20 @@ -1977,7 +2018,7 @@ static int show_numa_map(struct seq_file *m, void *v) seq_puts(m, " huge"); =20 /* mmap_lock is held by m_start */ - walk_page_vma(vma, &show_numa_ops, md); + walk_page_vma(vma, &show_numa_ops, &numa_map_private); =20 if (!md->pages) goto out; diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 27a6df448ee5..f4bddad615c2 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -3,6 +3,7 @@ #define _LINUX_PAGEWALK_H =20 #include +#include =20 struct mm_walk; =20 @@ -31,6 +32,10 @@ struct mm_walk; * ptl after dropping the vma lock, or else revalidate * those items after re-acquiring the vma lock and before * accessing them. + * In the presence of high-granularity hugetlb entries, + * @hugetlb_entry is called only for leaf-level entries + * (hstate-level entries are ignored if they are not + * leaves). * @test_walk: caller specific callback function to determine whether * we walk over the current vma or not. Returning 0 means * "do page table walk over the current vma", returning @@ -58,9 +63,8 @@ struct mm_walk_ops { unsigned long next, struct mm_walk *walk); int (*pte_hole)(unsigned long addr, unsigned long next, int depth, struct mm_walk *walk); - int (*hugetlb_entry)(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long next, - struct mm_walk *walk); + int (*hugetlb_entry)(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk); int (*test_walk)(unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pre_vma)(unsigned long start, unsigned long end, diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 1fec16d7263e..0f001950498a 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -330,11 +330,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned= long addr, } =20 #ifdef CONFIG_HUGETLB_PAGE -static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm, +static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry, + struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr) { bool referenced =3D false; - pte_t entry =3D huge_ptep_get(pte); struct folio *folio =3D pfn_folio(pte_pfn(entry)); =20 folio_get(folio); @@ -342,12 +342,12 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm= _struct *mm, if (pte_young(entry)) { referenced =3D true; entry =3D pte_mkold(entry); - set_huge_pte_at(mm, addr, pte, entry); + set_huge_pte_at(mm, addr, hpte->ptep, entry); } =20 #ifdef CONFIG_MMU_NOTIFIER if (mmu_notifier_clear_young(mm, addr, - addr + huge_page_size(hstate_vma(vma)))) + addr + hugetlb_pte_size(hpte))) referenced =3D true; #endif /* CONFIG_MMU_NOTIFIER */ =20 @@ -358,20 +358,26 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm= _struct *mm, folio_put(folio); } =20 -static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, +static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { - struct hstate *h =3D hstate_vma(walk->vma); spinlock_t *ptl; pte_t entry; =20 - ptl =3D huge_pte_lock(h, walk->mm, pte); - entry =3D huge_ptep_get(pte); + ptl =3D hugetlb_pte_lock(hpte); + entry =3D huge_ptep_get(hpte->ptep); if (!pte_present(entry)) goto out; =20 - damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr); + if (!hugetlb_pte_present_leaf(hpte, entry)) + /* + * We raced with someone splitting a blank PTE. Treat this PTE + * as if it were blank. + */ + goto out; + + damon_hugetlb_mkold(hpte, entry, walk->mm, walk->vma, addr); =20 out: spin_unlock(ptl); @@ -483,8 +489,8 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned l= ong addr, } =20 #ifdef CONFIG_HUGETLB_PAGE -static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, +static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { struct damon_young_walk_private *priv =3D walk->private; @@ -493,11 +499,18 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsi= gned long hmask, spinlock_t *ptl; pte_t entry; =20 - ptl =3D huge_pte_lock(h, walk->mm, pte); - entry =3D huge_ptep_get(pte); + ptl =3D hugetlb_pte_lock(hpte); + entry =3D huge_ptep_get(hpte->ptep); if (!pte_present(entry)) goto out; =20 + if (!hugetlb_pte_present_leaf(hpte, entry)) + /* + * We raced with someone splitting a blank PTE. Treat this PTE + * as if it were blank. + */ + goto out; + folio =3D pfn_folio(pte_pfn(entry)); folio_get(folio); =20 diff --git a/mm/hmm.c b/mm/hmm.c index 6a151c09de5e..d3e40cfdd4cb 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -468,8 +468,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long = start, unsigned long end, #endif =20 #ifdef CONFIG_HUGETLB_PAGE -static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long start, unsigned long end, +static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long start, struct mm_walk *walk) { unsigned long addr =3D start, i, pfn; @@ -479,16 +479,24 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, uns= igned long hmask, unsigned int required_fault; unsigned long pfn_req_flags; unsigned long cpu_flags; + unsigned long hmask =3D hugetlb_pte_mask(hpte); + unsigned int order =3D hpte->shift - PAGE_SHIFT; + unsigned long end =3D start + hugetlb_pte_size(hpte); spinlock_t *ptl; pte_t entry; =20 - ptl =3D huge_pte_lock(hstate_vma(vma), walk->mm, pte); - entry =3D huge_ptep_get(pte); + ptl =3D hugetlb_pte_lock(hpte); + entry =3D huge_ptep_get(hpte->ptep); + + if (!hugetlb_pte_present_leaf(hpte, entry)) { + spin_unlock(ptl); + return -EAGAIN; + } =20 i =3D (start - range->start) >> PAGE_SHIFT; pfn_req_flags =3D range->hmm_pfns[i]; cpu_flags =3D pte_to_hmm_pfn_flags(range, entry) | - hmm_pfn_flags_order(huge_page_order(hstate_vma(vma))); + hmm_pfn_flags_order(order); required_fault =3D hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags); if (required_fault) { @@ -605,7 +613,7 @@ int hmm_range_fault(struct hmm_range *range) * in pfns. All entries < last in the pfn array are set to their * output, and all >=3D are still at their input values. */ - } while (ret =3D=3D -EBUSY); + } while (ret =3D=3D -EBUSY || ret =3D=3D -EAGAIN); return ret; } EXPORT_SYMBOL(hmm_range_fault); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a1ede7bdce95..0b37cbc6e8ae 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -676,6 +676,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned l= ong addr, short shift, unsigned long poisoned_pfn, struct to_kill *tk) { unsigned long pfn =3D 0; + unsigned long base_pages_poisoned =3D (1UL << shift) / PAGE_SIZE; =20 if (pte_present(pte)) { pfn =3D pte_pfn(pte); @@ -686,7 +687,8 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned l= ong addr, short shift, pfn =3D swp_offset_pfn(swp); } =20 - if (!pfn || pfn !=3D poisoned_pfn) + if (!pfn || pfn < poisoned_pfn || + pfn >=3D poisoned_pfn + base_pages_poisoned) return 0; =20 set_to_kill(tk, addr, shift); @@ -752,16 +754,15 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned l= ong addr, } =20 #ifdef CONFIG_HUGETLB_PAGE -static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask, - unsigned long addr, unsigned long end, - struct mm_walk *walk) +static int hwpoison_hugetlb_range(struct hugetlb_pte *hpte, + unsigned long addr, + struct mm_walk *walk) { struct hwp_walk *hwp =3D walk->private; - pte_t pte =3D huge_ptep_get(ptep); - struct hstate *h =3D hstate_vma(walk->vma); + pte_t pte =3D huge_ptep_get(hpte->ptep); =20 - return check_hwpoisoned_entry(pte, addr, huge_page_shift(h), - hwp->pfn, &hwp->tk); + return check_hwpoisoned_entry(pte, addr & hugetlb_pte_mask(hpte), + hpte->shift, hwp->pfn, &hwp->tk); } #else #define hwpoison_hugetlb_range NULL diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a256a241fd1d..0f91be88392b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -558,8 +558,8 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned = long addr, return addr !=3D end ? -EIO : 0; } =20 -static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, +static int queue_folios_hugetlb(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { int ret =3D 0; @@ -570,8 +570,12 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned l= ong hmask, spinlock_t *ptl; pte_t entry; =20 - ptl =3D huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); - entry =3D huge_ptep_get(pte); + /* We don't migrate high-granularity HugeTLB mappings for now. */ + if (hugetlb_hgm_enabled(walk->vma)) + return -EINVAL; + + ptl =3D hugetlb_pte_lock(hpte); + entry =3D huge_ptep_get(hpte->ptep); if (!pte_present(entry)) goto unlock; folio =3D pfn_folio(pte_pfn(entry)); @@ -608,7 +612,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned lo= ng hmask, */ if (flags & (MPOL_MF_MOVE_ALL) || (flags & MPOL_MF_MOVE && folio_estimated_sharers(folio) =3D=3D 1 && - !hugetlb_pmd_shared(pte))) { + !hugetlb_pmd_shared(hpte->ptep))) { if (!isolate_hugetlb(folio, qp->pagelist) && (flags & MPOL_MF_STRICT)) /* diff --git a/mm/mincore.c b/mm/mincore.c index a085a2aeabd8..0894965b3944 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -22,18 +22,29 @@ #include #include "swap.h" =20 -static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long = addr, - unsigned long end, struct mm_walk *walk) +static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr, + struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE unsigned char present; + unsigned long end =3D addr + hugetlb_pte_size(hpte); unsigned char *vec =3D walk->private; + pte_t pte =3D huge_ptep_get(hpte->ptep); =20 /* * Hugepages under user process are always in RAM and never * swapped out, but theoretically it needs to be checked. */ - present =3D pte && !huge_pte_none(huge_ptep_get(pte)); + present =3D !huge_pte_none(pte); + + /* + * If the pte is present but not a leaf, we raced with someone + * splitting it. For someone to have split it, it must have been + * huge_pte_none before, so treat it as such. + */ + if (pte_present(pte) && !hugetlb_pte_present_leaf(hpte, pte)) + present =3D false; + for (; addr !=3D end; vec++, addr +=3D PAGE_SIZE) *vec =3D present; walk->private =3D vec; diff --git a/mm/mprotect.c b/mm/mprotect.c index 1d4843c97c2a..61263ce9d925 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -564,12 +564,16 @@ static int prot_none_pte_entry(pte_t *pte, unsigned l= ong addr, 0 : -EACCES; } =20 -static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long next, +static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { - return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ? - 0 : -EACCES; + pte_t pte =3D huge_ptep_get(hpte->ptep); + + if (!hugetlb_pte_present_leaf(hpte, pte)) + return -EAGAIN; + return pfn_modify_allowed(pte_pfn(pte), + *(pgprot_t *)(walk->private)) ? 0 : -EACCES; } =20 static int prot_none_test(unsigned long addr, unsigned long next, @@ -612,8 +616,10 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_ga= ther *tlb, (newflags & VM_ACCESS_FLAGS) =3D=3D 0) { pgprot_t new_pgprot =3D vm_get_page_prot(newflags); =20 - error =3D walk_page_range(current->mm, start, end, - &prot_none_walk_ops, &new_pgprot); + do { + error =3D walk_page_range(current->mm, start, end, + &prot_none_walk_ops, &new_pgprot); + } while (error =3D=3D -EAGAIN); if (error) return error; } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index cb23f8a15c13..05ce242f8b7e 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,6 +3,7 @@ #include #include #include +#include =20 /* * We want to know the real level where a entry is located ignoring any @@ -296,20 +297,21 @@ static int walk_hugetlb_range(unsigned long addr, uns= igned long end, struct vm_area_struct *vma =3D walk->vma; struct hstate *h =3D hstate_vma(vma); unsigned long next; - unsigned long hmask =3D huge_page_mask(h); - unsigned long sz =3D huge_page_size(h); - pte_t *pte; const struct mm_walk_ops *ops =3D walk->ops; int err =3D 0; + struct hugetlb_pte hpte; =20 hugetlb_vma_lock_read(vma); do { - next =3D hugetlb_entry_end(h, addr, end); - pte =3D hugetlb_walk(vma, addr & hmask, sz); - if (pte) - err =3D ops->hugetlb_entry(pte, hmask, addr, next, walk); - else if (ops->pte_hole) - err =3D ops->pte_hole(addr, next, -1, walk); + if (hugetlb_full_walk(&hpte, vma, addr)) { + next =3D hugetlb_entry_end(h, addr, end); + if (ops->pte_hole) + err =3D ops->pte_hole(addr, next, -1, walk); + } else { + err =3D ops->hugetlb_entry( + &hpte, addr, walk); + next =3D min(addr + hugetlb_pte_size(&hpte), end); + } if (err) break; } while (addr =3D next, addr !=3D end); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6D6DC636D6 for ; Sat, 18 Feb 2023 00:30:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230243AbjBRAay (ORCPT ); Fri, 17 Feb 2023 19:30:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43080 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229694AbjBRA35 (ORCPT ); Fri, 17 Feb 2023 19:29:57 -0500 Received: from mail-vk1-xa4a.google.com (mail-vk1-xa4a.google.com [IPv6:2607:f8b0:4864:20::a4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C86368E63 for ; Fri, 17 Feb 2023 16:29:19 -0800 (PST) Received: by mail-vk1-xa4a.google.com with SMTP id g1-20020ac5c5c1000000b00401b81d313bso828558vkl.6 for ; Fri, 17 Feb 2023 16:29:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1676680149; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=R2na1A3NSq7U/MTCBc/o5RGQOMMtDEqSAD5VSD0flvE=; b=qVyA3qUqF5DHCviCI1nzfYS5Svk9ua2LpWd4a99ORevkgEpjI4dqg4lKQMb+bwNcNw 61c9ND7a9iH3goStVFiSljoFRjNY2j1AhhPNcUU/a1rv5fdHxnLbmQMVtti88Q+JYZyU bhe8TGck/nYHhFokZ4sDeNAoyxCO64fiBBLXbHmrXkJokroObrlP3/h9q8bTGXc3eo/F vBbNamFNBm7arbW972qTZ0yke+aKbVBelBbPkEYjM+gx8b1eyLRItOwCX8k0eelqQ1oo E6X1WzYSjvE5m5sQqtV1xVs+q8GAKHv2jxgseAPVQAnj0ZSKEohG90BS+LNhWTC4WS5w 8pnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676680149; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=R2na1A3NSq7U/MTCBc/o5RGQOMMtDEqSAD5VSD0flvE=; b=zNstsPHLth5kt06+gbX294bYdljsYHrbEn1ZyZmjZbRpk2DAWWwaRf0Y2h7MT65kN2 kNHatYxMZgbFyFlyTrBfedM0rkt7Ek734QZ+fcjmqCnsqVHAECF2WPksdITnosqjV4h3 zxE0cOyzh6XWAEIPHqrWZ105hzsUv89g4kFKd77JZ+skW0U6979Vpw24DdaZH+1zRXyu 83kuwnz2ay800WWZtwgs/Sn5agzv+89C9sNyHfXHnVSDksHWMzk6CGPxLDxpt1unRdku 2CgflRKwPuq0AEt08x48c2zXhL9AwJLb61oivTmHqEjaI7bpWpQiVa2vpc7L6jftEHTo d6Rw== X-Gm-Message-State: AO0yUKVSlJQiq9oKLyadBYwHXnE9rcSiNF9o2xRaVP2oyg+MPgj7Tvj5 xFJame8ugmcpiZehtfoSuUEGMTcZlhHKAopU X-Google-Smtp-Source: AK7set+z+B4A4WuUUUfTfMJ3HPbCsVBPAnASJYuSG63wOfEVQp9exRkXKeR5pjRe9t+Ysvo+wmN1lJQLga4KIPTk X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a67:c485:0:b0:412:4cf3:d0ed with SMTP id d5-20020a67c485000000b004124cf3d0edmr38832vsk.32.1676680149589; Fri, 17 Feb 2023 16:29:09 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:59 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-27-jthoughton@google.com> Subject: [PATCH v2 26/46] mm: rmap: provide pte_order in page_vma_mapped_walk From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" page_vma_mapped_walk callers will need this information to know how HugeTLB pages are mapped. pte_order only applies if pte is not NULL. Signed-off-by: James Houghton diff --git a/include/linux/rmap.h b/include/linux/rmap.h index a4570da03e58..87a2c7f422bf 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -387,6 +387,7 @@ struct page_vma_mapped_walk { pmd_t *pmd; pte_t *pte; spinlock_t *ptl; + unsigned int pte_order; unsigned int flags; }; =20 diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 4e448cfbc6ef..08295b122ad6 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -16,6 +16,7 @@ static inline bool not_found(struct page_vma_mapped_walk = *pvmw) static bool map_pte(struct page_vma_mapped_walk *pvmw) { pvmw->pte =3D pte_offset_map(pvmw->pmd, pvmw->address); + pvmw->pte_order =3D 0; if (!(pvmw->flags & PVMW_SYNC)) { if (pvmw->flags & PVMW_MIGRATION) { if (!is_swap_pte(*pvmw->pte)) @@ -177,6 +178,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *= pvmw) if (!pvmw->pte) return false; =20 + pvmw->pte_order =3D huge_page_order(hstate); pvmw->ptl =3D huge_pte_lock(hstate, mm, pvmw->pte); if (!check_pte(pvmw)) return not_found(pvmw); @@ -272,6 +274,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *= pvmw) } pte_unmap(pvmw->pte); pvmw->pte =3D NULL; + pvmw->pte_order =3D 0; goto restart; } pvmw->pte++; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53CD8C636D6 for ; Sat, 18 Feb 2023 00:30:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230248AbjBRAa5 (ORCPT ); Fri, 17 Feb 2023 19:30:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43524 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230124AbjBRA36 (ORCPT ); Fri, 17 Feb 2023 19:29:58 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7D6268E64 for ; Fri, 17 Feb 2023 16:29:19 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 84-20020a251457000000b0091231592671so2246930ybu.1 for ; Fri, 17 Feb 2023 16:29:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=C4SpvvUSe/+H51r29czUV8oyanUNWvkyfpCFhzIRnLk=; b=URp4YrNRsXrEMcz5vF68rwrPTWL0r/1PsOn0iwRwzUv8fGWjA2Zj3WsCDtOHnnEOXI 9xQcdgSYjiC0L8FMkrysuWd8Z4rTcFfESiyb98ZDWB3Rpv8/7b3sLIWRQtcVzhoZSCmY 1q2oSqG9SQQKP9k9222Pkp5CKnK0J5shuyy4KwIs9snSVsRw3F9uu39yYj2MkyXE/1KV b/75TnZWsolWJ9o2Dg5OQzts7dH32JqEhtpz0u50LcpTAXPOuZwDBizDHafxqGuZbm79 w0zdlsUS9tlTAwK9A8se/3bTurakeHRO+WC3ta4JwV799mBHwIfC4VY+Lrp2ra7exXyR sdeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=C4SpvvUSe/+H51r29czUV8oyanUNWvkyfpCFhzIRnLk=; b=l4nyNGIA82EOLkVfVbGmyZ9oYoEjpM40IK7WRIpknJNFsF/FP9w7aE+RyFdITyxf5H IDuSHPscFo9kfhk9KotlF3KUwEJp2tASZiuxALhCMli0w3MHqpNRrOQyD+ZWkCrQl+32 Bz80GmRAutjWPpyOYFXC/EuWGbWcs4LuEXF8pWq8d/H5XJa1YaOG3KlC0sGZ9ZGXk3ZV qV6OftL5xkUykX+OOCnRryY9EAHXo8eO1IfTf2Crnse9JG1RuhLTCOxOZ71LXjd74sSf ruPZl3aFAOKpDNimYzkPfDkMkZImrgekpJdCx5H9xwcSghzT3qLQKdMtT0bDAWz1+7Oc 5ziA== X-Gm-Message-State: AO0yUKWVmX4X6bypioKQk/qtPd2LB6a34+Q3PaW1GdTUBP56Se3qLf1w 1MR8nUk3Yel4xrM5wNoGvU7rfWnLQcPfv5gn X-Google-Smtp-Source: AK7set/928+LwO6Lgp5pRPZtsNiWTzN/B8EG4aLn4PvCGnxGZvLIKoC9wr4DtxldmQEi9jOl1xJbJJDcIzKsLS/r X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:ac66:0:b0:52e:ed7f:6e82 with SMTP id z38-20020a81ac66000000b0052eed7f6e82mr257319ywj.9.1676680150801; Fri, 17 Feb 2023 16:29:10 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:00 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-28-jthoughton@google.com> Subject: [PATCH v2 27/46] mm: rmap: update try_to_{migrate,unmap} to handle mapcount for HGM From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Make use of the new pvmw->pte_order field to determine the size of the PTE we're unmapping/migrating. Signed-off-by: James Houghton diff --git a/mm/migrate.c b/mm/migrate.c index 9b4a7e75f6e6..616afcc40fdc 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -247,7 +247,7 @@ static bool remove_migration_pte(struct folio *folio, =20 #ifdef CONFIG_HUGETLB_PAGE if (folio_test_hugetlb(folio)) { - unsigned int shift =3D huge_page_shift(hstate_vma(vma)); + unsigned int shift =3D pvmw.pte_order + PAGE_SHIFT; =20 pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); if (folio_test_anon(folio)) diff --git a/mm/rmap.c b/mm/rmap.c index c010d0af3a82..0a019ae32f04 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1609,7 +1609,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) { pteval =3D swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { - hugetlb_count_sub(folio_nr_pages(folio), mm); + hugetlb_count_sub(1UL << pvmw.pte_order, mm); set_huge_pte_at(mm, address, pvmw.pte, pteval); } else { dec_mm_counter(mm, mm_counter(&folio->page)); @@ -1757,7 +1757,13 @@ static bool try_to_unmap_one(struct folio *folio, st= ruct vm_area_struct *vma, * * See Documentation/mm/mmu_notifier.rst */ - page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); + if (folio_test_hugetlb(folio)) + hugetlb_remove_rmap(subpage, + pvmw.pte_order + PAGE_SHIFT, + hstate_vma(vma), vma); + else + page_remove_rmap(subpage, vma, false); + if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); folio_put(folio); @@ -2020,7 +2026,7 @@ static bool try_to_migrate_one(struct folio *folio, s= truct vm_area_struct *vma, } else if (PageHWPoison(subpage)) { pteval =3D swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { - hugetlb_count_sub(folio_nr_pages(folio), mm); + hugetlb_count_sub(1L << pvmw.pte_order, mm); set_huge_pte_at(mm, address, pvmw.pte, pteval); } else { dec_mm_counter(mm, mm_counter(&folio->page)); @@ -2112,7 +2118,12 @@ static bool try_to_migrate_one(struct folio *folio, = struct vm_area_struct *vma, * * See Documentation/mm/mmu_notifier.rst */ - page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); + if (folio_test_hugetlb(folio)) + hugetlb_remove_rmap(subpage, + pvmw.pte_order + PAGE_SHIFT, + hstate_vma(vma), vma); + else + page_remove_rmap(subpage, vma, false); if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); folio_put(folio); @@ -2196,6 +2207,8 @@ static bool page_make_device_exclusive_one(struct fol= io *folio, args->owner); mmu_notifier_invalidate_range_start(&range); =20 + VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio); + while (page_vma_mapped_walk(&pvmw)) { /* Unexpected PMD-mapped THP? */ VM_BUG_ON_FOLIO(!pvmw.pte, folio); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6786C05027 for ; Sat, 18 Feb 2023 00:31:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230264AbjBRAbE (ORCPT ); Fri, 17 Feb 2023 19:31:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43536 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230131AbjBRAaG (ORCPT ); Fri, 17 Feb 2023 19:30:06 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 003196929B for ; Fri, 17 Feb 2023 16:29:20 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id o14-20020a25810e000000b0095d2ada3d26so1817181ybk.5 for ; Fri, 17 Feb 2023 16:29:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=1rCBHumdufOlTcJkqiypODYa1r7LDyl4S1gdc35TorQ=; b=YN0PytyiHYPIu8m+M2E6v1NUQCnHXY6ytZcGKXyuh7z/uqk1m7+gCvJsZq1js1+Nyk aWy2chcaY0oSjM7oplmnZTRau4yrS7eatEn1k9ey5wfeHKB+fHXyH3hpyjtsmTvEZjCC 943Dvm4iECKIUtQU6bNEzs0C/hZeHT7vfxnkuu/ImNucvgQ6NZzZ5BLuTjeDWoY/lNEN mppTbWX+d43SPhJRbQNNU06lgt5y4Yc251XxwreuYdKf65ASB8UzsacjLW/LKF5eAHUL AdfT5Xxv2SVRLajtE01G0LVCPjbrRrRShVjKlJ9kluNRT6aaGpxY/8iNFGTbZhsxLO7k +hCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1rCBHumdufOlTcJkqiypODYa1r7LDyl4S1gdc35TorQ=; b=tXI3GDKuAYV5c8vxt6CnXuPv00N1WYMT8XD3DkCRZN3GOwnMBNb4gaujSAVI221c3C LrigrvZL4+pNNPEKw+64zYUUzY0Wd2wIJUMk6C6/rKosX+iVm9BRjPKbuWJtezui38Jj vZ7N0sWQREXyag4g2ueTXWkJR05fWdVGFGdDoGeb3i9Vjd8K4kEeMH24tLekgLfWNzCs HtFEvPg2jWYIGKoSk9LSBUnXjesrbDxlE5XhLLks1CAvlVPpXvhUL9qpqGqNHLBKPbRx xEyyxpAdpfy5Ky6QecdhWXPzP+U/Bp4yFnAl2ulCfpHGwIXJxy8+GuetQbZWdRswH1uz ejPQ== X-Gm-Message-State: AO0yUKU32frAzHvZmGcJDye2BFA7PPVKEJOjfu/bksooCdVU71E7ucOl P1RdK95QkAC2fy9sdIity/VYNSWV/kA4TPh/ X-Google-Smtp-Source: AK7set+LO+JnPnboVPieFkXX7WpathP3qjh8yyUr3ixLzVkV8bA9em/CG8b88d9+skAi+Y+zl4Jc6Pv1vnp/FebD X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:910:0:b0:927:a3c1:b2de with SMTP id a16-20020a5b0910000000b00927a3c1b2demr200123ybq.7.1676680151721; Fri, 17 Feb 2023 16:29:11 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:01 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-29-jthoughton@google.com> Subject: [PATCH v2 28/46] mm: rmap: in try_to_{migrate,unmap}, check head page for hugetlb page flags From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The main complication here is that HugeTLB pages have their poison status stored in the head page as the HWPoison page flag. Because HugeTLB high-granularity mapping can create PTEs that point to subpages instead of always the head of a hugepage, we need to check the compound_head for page flags. Signed-off-by: James Houghton diff --git a/mm/rmap.c b/mm/rmap.c index 0a019ae32f04..4908ede83173 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1456,10 +1456,11 @@ static bool try_to_unmap_one(struct folio *folio, s= truct vm_area_struct *vma, struct mm_struct *mm =3D vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; - struct page *subpage; + struct page *subpage, *page_flags_page; bool anon_exclusive, ret =3D true; struct mmu_notifier_range range; enum ttu_flags flags =3D (enum ttu_flags)(long)arg; + bool page_poisoned; =20 /* * When racing against e.g. zap_pte_range() on another cpu, @@ -1512,9 +1513,17 @@ static bool try_to_unmap_one(struct folio *folio, st= ruct vm_area_struct *vma, =20 subpage =3D folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio)); + /* + * We check the page flags of HugeTLB pages by checking the + * head page. + */ + page_flags_page =3D folio_test_hugetlb(folio) + ? &folio->page + : subpage; + page_poisoned =3D PageHWPoison(page_flags_page); address =3D pvmw.address; anon_exclusive =3D folio_test_anon(folio) && - PageAnonExclusive(subpage); + PageAnonExclusive(page_flags_page); =20 if (folio_test_hugetlb(folio)) { bool anon =3D folio_test_anon(folio); @@ -1523,7 +1532,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, * The try_to_unmap() is only passed a hugetlb page * in the case where the hugetlb page is poisoned. */ - VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage); + VM_BUG_ON_FOLIO(!page_poisoned, folio); /* * huge_pmd_unshare may unmap an entire PMD page. * There is no way of knowing exactly which PMDs may @@ -1606,7 +1615,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, /* Update high watermark before we lower rss */ update_hiwater_rss(mm); =20 - if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) { + if (page_poisoned && !(flags & TTU_IGNORE_HWPOISON)) { pteval =3D swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { hugetlb_count_sub(1UL << pvmw.pte_order, mm); @@ -1632,7 +1641,9 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); } else if (folio_test_anon(folio)) { - swp_entry_t entry =3D { .val =3D page_private(subpage) }; + swp_entry_t entry =3D { + .val =3D page_private(page_flags_page) + }; pte_t swp_pte; /* * Store the swap location in the pte. @@ -1822,7 +1833,7 @@ static bool try_to_migrate_one(struct folio *folio, s= truct vm_area_struct *vma, struct mm_struct *mm =3D vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; - struct page *subpage; + struct page *subpage, *page_flags_page; bool anon_exclusive, ret =3D true; struct mmu_notifier_range range; enum ttu_flags flags =3D (enum ttu_flags)(long)arg; @@ -1902,9 +1913,16 @@ static bool try_to_migrate_one(struct folio *folio, = struct vm_area_struct *vma, subpage =3D folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio)); } + /* + * We check the page flags of HugeTLB pages by checking the + * head page. + */ + page_flags_page =3D folio_test_hugetlb(folio) + ? &folio->page + : subpage; address =3D pvmw.address; anon_exclusive =3D folio_test_anon(folio) && - PageAnonExclusive(subpage); + PageAnonExclusive(page_flags_page); =20 if (folio_test_hugetlb(folio)) { bool anon =3D folio_test_anon(folio); @@ -2023,7 +2041,7 @@ static bool try_to_migrate_one(struct folio *folio, s= truct vm_area_struct *vma, * No need to invalidate here it will synchronize on * against the special swap migration pte. */ - } else if (PageHWPoison(subpage)) { + } else if (PageHWPoison(page_flags_page)) { pteval =3D swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { hugetlb_count_sub(1L << pvmw.pte_order, mm); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38D80C636D6 for ; Sat, 18 Feb 2023 00:30:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230229AbjBRAas (ORCPT ); Fri, 17 Feb 2023 19:30:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42054 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229647AbjBRA3y (ORCPT ); Fri, 17 Feb 2023 19:29:54 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0224F6BDF1 for ; Fri, 17 Feb 2023 16:29:12 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id l206-20020a25ccd7000000b006fdc6aaec4fso2657262ybf.20 for ; Fri, 17 Feb 2023 16:29:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=mRwFktkeJYIdKMusf4qC81y0TR6AbHEOFSy4VjO6bDw=; b=Zm6/7/QaEOKxvrDuMfeB7vwmF4lLu6UTGBUmswTJcUAlfDSMlIReG5bhg0wTj24gmC pyzKX/4iSeu0TJCnoaa5IeZeAwpPJP4t7J15FYcDft0GCDUyIekB7QiiA5d6VpYF6r1H nOxlRQfswvR4QTg+iFofMR17/iZTfwCKWd+Emo/OWhpb51k0Epcgf3/x2g21X+2fJvir E6KpK03FxgLoBdBKPFmKqjWGZ7c7Mw5BSnv8qT+zAW5LhNVyj/X7uCXj4k6ZrqZKQzWs jRtMwhX56IYmGhi9pxW4bL/oSI6T41SeQ+p8o0DKCkHwnC/wa7X1kwvdzUJ3ryjcOdKA 7yIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mRwFktkeJYIdKMusf4qC81y0TR6AbHEOFSy4VjO6bDw=; b=6zuD6+cTjnAo9ZOucMhX7KQzXA0lrc+Pnzkkx1zo3zjGhW1v7vu1oOwz8luUMzo0Bb cguKoK10ttoPykd2z8k9mCb3jN359dzrDSreopHDt75jpgXqnzvBodDmNOWTRzFnaFr+ UctmflXE5c+ImJdEpe87yKhlYAouXYQ9ZqYT2kNqORuLOZLD3woHdoIdYuOWaO6gDOVn QaT7R9IpdoZoFn8aKKEOg4vCifyMEhweAAtSA5I+Ai+yT4tbmk5oQ7UPCg7Q9Ub1Yd7W 1PVa62HePTkLW/EmQ/n+94m2ZIIwDyBpB2asqkJRbjR20LRvgLHvTLjdI54BeRRgyh9u s5mQ== X-Gm-Message-State: AO0yUKUGwLrspjBtDwxF0DEKMKyEUUMMZx13z6ehxrJ1SMLn+OE66lYv jEHshnsE11LUSQ+iJjUP3LJp1D1N1Hg7Tctx X-Google-Smtp-Source: AK7set8dwJ3KQ5pRHgFDkybx18bdI4wSFLevnVCzXA1Or4MzQVqGPdFvHJZtBGks9A5Y7+5HxTvdx/c0MRnYUx9S X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:1103:b0:8ed:3426:8a69 with SMTP id o3-20020a056902110300b008ed34268a69mr91121ybu.1.1676680152644; Fri, 17 Feb 2023 16:29:12 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:02 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-30-jthoughton@google.com> Subject: [PATCH v2 29/46] hugetlb: update page_vma_mapped to do high-granularity walks From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update the HugeTLB logic to look a lot more like the PTE-mapped THP logic. When a user calls us in a loop, we will update pvmw->address to walk to each page table entry that could possibly map the hugepage containing pvmw->pfn. Make use of the new pte_order so callers know what size PTE they're getting. The !pte failure case is changed to call not_found() instead of just returning false. This should be a no-op, but if somehow the hstate-level PTE were deallocated between iterations, not_found() should be called to drop locks. Signed-off-by: James Houghton diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 08295b122ad6..03e8a4987272 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -133,7 +133,8 @@ static void step_forward(struct page_vma_mapped_walk *p= vmw, unsigned long size) * * Returns true if the page is mapped in the vma. @pvmw->pmd and @pvmw->pt= e point * to relevant page table entries. @pvmw->ptl is locked. @pvmw->address is - * adjusted if needed (for PTE-mapped THPs). + * adjusted if needed (for PTE-mapped THPs and high-granularity-mapped Hug= eTLB + * pages). * * If @pvmw->pmd is set but @pvmw->pte is not, you have found PMD-mapped p= age * (usually THP). For PTE-mapped THP, you should run page_vma_mapped_walk(= ) in @@ -165,23 +166,47 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk= *pvmw) =20 if (unlikely(is_vm_hugetlb_page(vma))) { struct hstate *hstate =3D hstate_vma(vma); - unsigned long size =3D huge_page_size(hstate); - /* The only possible mapping was handled on last iteration */ - if (pvmw->pte) - return not_found(pvmw); - /* - * All callers that get here will already hold the - * i_mmap_rwsem. Therefore, no additional locks need to be - * taken before calling hugetlb_walk(). - */ - pvmw->pte =3D hugetlb_walk(vma, pvmw->address, size); - if (!pvmw->pte) - return false; + struct hugetlb_pte hpte; + pte_t pteval; + + end =3D (pvmw->address & huge_page_mask(hstate)) + + huge_page_size(hstate); + + do { + if (pvmw->pte) { + if (pvmw->ptl) + spin_unlock(pvmw->ptl); + pvmw->ptl =3D NULL; + pvmw->address +=3D PAGE_SIZE << pvmw->pte_order; + if (pvmw->address >=3D end) + return not_found(pvmw); + } =20 - pvmw->pte_order =3D huge_page_order(hstate); - pvmw->ptl =3D huge_pte_lock(hstate, mm, pvmw->pte); - if (!check_pte(pvmw)) - return not_found(pvmw); + /* + * All callers that get here will already hold the + * i_mmap_rwsem. Therefore, no additional locks need to + * be taken before calling hugetlb_walk(). + */ + if (hugetlb_full_walk(&hpte, vma, pvmw->address)) + return not_found(pvmw); + +retry: + pvmw->pte =3D hpte.ptep; + pvmw->pte_order =3D hpte.shift - PAGE_SHIFT; + pvmw->ptl =3D hugetlb_pte_lock(&hpte); + pteval =3D huge_ptep_get(hpte.ptep); + if (pte_present(pteval) && !hugetlb_pte_present_leaf( + &hpte, pteval)) { + /* + * Someone split from under us, so keep + * walking. + */ + spin_unlock(pvmw->ptl); + hugetlb_full_walk_continue(&hpte, vma, + pvmw->address); + goto retry; + } + } while (!check_pte(pvmw)); return true; } =20 --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E74D0C636D6 for ; Sat, 18 Feb 2023 00:31:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230283AbjBRAbN (ORCPT ); Fri, 17 Feb 2023 19:31:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229563AbjBRAaJ (ORCPT ); Fri, 17 Feb 2023 19:30:09 -0500 Received: from mail-ua1-x949.google.com (mail-ua1-x949.google.com [IPv6:2607:f8b0:4864:20::949]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0CED6ABE5 for ; Fri, 17 Feb 2023 16:29:23 -0800 (PST) Received: by mail-ua1-x949.google.com with SMTP id v19-20020ab02013000000b0068b9f3e0a2dso625893uak.6 for ; Fri, 17 Feb 2023 16:29:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1676680154; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6R+bM0KtFeTkcEJ8SgP6306LOQusudAcElSasV+tkuA=; b=dX7/40KBqq/CpQhiOpkAQIzPNluCgT+pqoGZRKlN9aa8QluovsQruDCu/wh6+Xi7HL 6kLXZgl+5XKDk7FsMYGYtr2yyFU9erZiSo7V2dyAwIFPJO5DAOooAQa2F7w+5zW0Hqdu QubYMhLcbqNzMIVdTubD/VkEI82Q/CwfufnycIJ0sLbdRiI6AathfyY49TSn1Ly/A6+9 +4K1wirIYyv3shJT02BYKt93mVx2ZH0jSU1+ZgamxSym8xZo1mO/BLj8U+ZTd86LEOwb MxBJulRTBsOmTowXosHxX9g5jLuhbpjY1XM7qMG0aeLiOSSYz9qPT0kULRTK/B6ula9l hMbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676680154; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6R+bM0KtFeTkcEJ8SgP6306LOQusudAcElSasV+tkuA=; b=cJrD3+hq5WY0ANbp4ZCwIe9zpPqlv5jbfkEpLrdx0T1gsqDYCwc0hhmruYOFTLI7Pm A59UPSHYbSEHIv7tGF0oFVITc28Fsrd53io/Tm7WKyCr7bDbvaXcXIDrmtDMjKmsX3vg OpDcGRnAOxxMhFcj8tQgPQXKvKjcpjWJ8BIfZ5vyJMEx6bid1kePz93SHiraPDG7pHYl 6cv1Pwf/NeKNIROZBqwb/2ZI993aXIoLFzsE3/n8Bkt9MN9mQVcVu3/TG9VzntpH+Ycp SQgOrwDn4pWCDGUcBOXtzN7hvZJB0iCm8kLAyn7xuBWFdUEYiZFPTQv98eFhPMw+9OSq N7rg== X-Gm-Message-State: AO0yUKVKb5T3f9fYJiri3zAP/TCYorE8uaLUhO9ayGkdMrPR9JOFnp0+ bvKebIu8p5+axqhIfwhjspGyINs5HM9GZ8pS X-Google-Smtp-Source: AK7set8ydXZ+Dciv1+JvuwKHd4m9xN+HXoJNqOMVwMnv6hYgnClrworMO3PvsXyUer/h8BcS4rjrDLd27tp3RFfQ X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a1f:a041:0:b0:401:7fe9:ff7f with SMTP id j62-20020a1fa041000000b004017fe9ff7fmr213533vke.5.1676680153948; Fri, 17 Feb 2023 16:29:13 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:03 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-31-jthoughton@google.com> Subject: [PATCH v2 30/46] hugetlb: add high-granularity migration support From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" To prevent queueing a hugepage for migration multiple times, we use last_folio to keep track of the last page we saw in queue_pages_hugetlb, and if the page we're looking at is last_folio, then we skip it. For the non-hugetlb cases, last_folio, although unused, is still updated so that it has a consistent meaning with the hugetlb case. Signed-off-by: James Houghton diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 3a451b7afcb3..6ef80763e629 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -68,6 +68,8 @@ =20 static inline bool is_pfn_swap_entry(swp_entry_t entry); =20 +struct hugetlb_pte; + /* Clear all flags but only keep swp_entry_t related information */ static inline pte_t pte_swp_clear_flags(pte_t pte) { @@ -339,7 +341,8 @@ extern void migration_entry_wait(struct mm_struct *mm, = pmd_t *pmd, #ifdef CONFIG_HUGETLB_PAGE extern void __migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *ptep, spinlock_t *ptl); -extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *p= te); +extern void migration_entry_wait_huge(struct vm_area_struct *vma, + struct hugetlb_pte *hpte); #endif /* CONFIG_HUGETLB_PAGE */ #else /* CONFIG_MIGRATION */ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) @@ -369,7 +372,8 @@ static inline void migration_entry_wait(struct mm_struc= t *mm, pmd_t *pmd, #ifdef CONFIG_HUGETLB_PAGE static inline void __migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *ptep, spinlock_t *ptl) { } -static inline void migration_entry_wait_huge(struct vm_area_struct *vma, p= te_t *pte) { } +static inline void migration_entry_wait_huge(struct vm_area_struct *vma, + struct hugetlb_pte *hpte) { } #endif /* CONFIG_HUGETLB_PAGE */ static inline int is_writable_migration_entry(swp_entry_t entry) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 86cd51beb02c..39f541b4a0a8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6418,7 +6418,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, * be released there. */ mutex_unlock(&hugetlb_fault_mutex_table[hash]); - migration_entry_wait_huge(vma, hpte.ptep); + migration_entry_wait_huge(vma, &hpte); return 0; } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) ret =3D VM_FAULT_HWPOISON_LARGE | diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0f91be88392b..43e210181cce 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -424,6 +424,7 @@ struct queue_pages { unsigned long start; unsigned long end; struct vm_area_struct *first; + struct folio *last_folio; }; =20 /* @@ -475,6 +476,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl= , unsigned long addr, flags =3D qp->flags; /* go to folio migration */ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { + qp->last_folio =3D folio; if (!vma_migratable(walk->vma) || migrate_folio_add(folio, qp->pagelist, flags)) { ret =3D 1; @@ -539,6 +541,8 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned = long addr, break; } =20 + qp->last_folio =3D folio; + /* * Do not abort immediately since there may be * temporary off LRU pages in the range. Still @@ -570,15 +574,22 @@ static int queue_folios_hugetlb(struct hugetlb_pte *h= pte, spinlock_t *ptl; pte_t entry; =20 - /* We don't migrate high-granularity HugeTLB mappings for now. */ - if (hugetlb_hgm_enabled(walk->vma)) - return -EINVAL; - ptl =3D hugetlb_pte_lock(hpte); entry =3D huge_ptep_get(hpte->ptep); if (!pte_present(entry)) goto unlock; - folio =3D pfn_folio(pte_pfn(entry)); + + if (!hugetlb_pte_present_leaf(hpte, entry)) { + ret =3D -EAGAIN; + goto unlock; + } + + folio =3D page_folio(pte_page(entry)); + + /* We already queued this page with another high-granularity PTE. */ + if (folio =3D=3D qp->last_folio) + goto unlock; + if (!queue_folio_required(folio, qp)) goto unlock; =20 @@ -747,6 +758,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long s= tart, unsigned long end, .start =3D start, .end =3D end, .first =3D NULL, + .last_folio =3D NULL, }; =20 err =3D walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp); diff --git a/mm/migrate.c b/mm/migrate.c index 616afcc40fdc..b26169990532 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -196,6 +196,9 @@ static bool remove_migration_pte(struct folio *folio, /* pgoff is invalid for ksm pages, but they are never large */ if (folio_test_large(folio) && !folio_test_hugetlb(folio)) idx =3D linear_page_index(vma, pvmw.address) - pvmw.pgoff; + else if (folio_test_hugetlb(folio)) + idx =3D (pvmw.address & ~huge_page_mask(hstate_vma(vma)))/ + PAGE_SIZE; new =3D folio_page(folio, idx); =20 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION @@ -247,14 +250,16 @@ static bool remove_migration_pte(struct folio *folio, =20 #ifdef CONFIG_HUGETLB_PAGE if (folio_test_hugetlb(folio)) { + struct page *hpage =3D folio_page(folio, 0); unsigned int shift =3D pvmw.pte_order + PAGE_SHIFT; =20 pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); if (folio_test_anon(folio)) - hugepage_add_anon_rmap(new, vma, pvmw.address, + hugepage_add_anon_rmap(hpage, vma, pvmw.address, rmap_flags); else - page_add_file_rmap(new, vma, true); + hugetlb_add_file_rmap(new, shift, + hstate_vma(vma), vma); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); } else #endif @@ -270,7 +275,7 @@ static bool remove_migration_pte(struct folio *folio, mlock_drain_local(); =20 trace_remove_migration_pte(pvmw.address, pte_val(pte), - compound_order(new)); + pvmw.pte_order); =20 /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, pvmw.address, pvmw.pte); @@ -361,12 +366,10 @@ void __migration_entry_wait_huge(struct vm_area_struc= t *vma, } } =20 -void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) +void migration_entry_wait_huge(struct vm_area_struct *vma, + struct hugetlb_pte *hpte) { - spinlock_t *ptl =3D huge_pte_lockptr(huge_page_shift(hstate_vma(vma)), - vma->vm_mm, pte); - - __migration_entry_wait_huge(vma, pte, ptl); + __migration_entry_wait_huge(vma, hpte->ptep, hpte->ptl); } #endif =20 --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 410E1C05027 for ; Sat, 18 Feb 2023 00:31:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230136AbjBRAbQ (ORCPT ); Fri, 17 Feb 2023 19:31:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43086 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229577AbjBRAaJ (ORCPT ); Fri, 17 Feb 2023 19:30:09 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E47786ABE8 for ; Fri, 17 Feb 2023 16:29:23 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id o137-20020a25418f000000b009419f64f6afso2165216yba.2 for ; Fri, 17 Feb 2023 16:29:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=h5QeZmWStm8MBBmvLyhf4o+hksgZUS71zRGtmoeJ5N4=; b=iqrumGyh+1DXgn+jsoYD5tlyxEnhTKa0RiDswgYlulz2R+ePcUkS4hk7pJr1l5TN4Z t4OGyf3Ee88VCdy7Q/P2cmEh2dAWdOVlOvBP1o2vJ8WyqBhGxkkoh3k8/Itb4ZdbmqKm cc2rGHBwWYUHY+CQ/nhxnlepr/hLoBQcN2mefhb5d+Skxpoc3mQ6YYqwqicwAf3fMJi9 IbApGiC57N5YDkCsKcP9870idXAalXkC8/897RuNBgRUzMFJC0wrZjC6bZRbVALKQPA4 KUFkbVYqdOLVPYutCh7Ev7uHgUEdcvwEGi0C/ONTXZeJkJHpEuUFB85nzCIxbQ0alG9k tAvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=h5QeZmWStm8MBBmvLyhf4o+hksgZUS71zRGtmoeJ5N4=; b=HqcE7V/mOynNTi4Zfix8nkvSx5PtsBEOrGTWEu4TQXqYEzEp8U4xkROeLkztMjHf1p yIjMDZc5pzw3j06TNkFDRU7UWKEFImW8MEUDn4rf4KftV26Wc6DRBusyxRHmzVOn0jKg oYDxGhBbhLkXcQaJZTtKufEnYdFhOTbM6E2Y0OEZ0GwekjXcimyMYNx7P2ZbRYaEVG8V 8QlC6SbpWmu1LJ2HYo7Vqo2k6mtypnPqy30gV4k6iIlgHvsYs16uWyaMh3/O/nh75Hnw aGk/F60D7gKRBtImVR+HfkpKYokBaqxpq74/9j4+qLA9VkC2jFSnQQknJvVekYCdsSHD 3LHg== X-Gm-Message-State: AO0yUKV7EV/bGmkUlOhWyLNUIbi2MqZPsiU0cLnY1o5BJUkgQrY2EtZ3 6VujJs9X1Hx71/sFrt3okEuSjqBggfrUlxV1 X-Google-Smtp-Source: AK7set8cs/bMSPkcig+yHBN2nUuLXJMfaVz/DRZ9ytGQCyW0wyjGaQbg0OHjwkIcX38+L9v/Io1eBcNX9GOn9imw X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a25:e211:0:b0:872:465e:2cbf with SMTP id h17-20020a25e211000000b00872465e2cbfmr1298716ybe.264.1676680154885; Fri, 17 Feb 2023 16:29:14 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:04 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-32-jthoughton@google.com> Subject: [PATCH v2 31/46] hugetlb: sort hstates in hugetlb_init_hstates From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When using HugeTLB high-granularity mapping, we need to go through the supported hugepage sizes in decreasing order so that we pick the largest size that works. Consider the case where we're faulting in a 1G hugepage for the first time: we want hugetlb_fault/hugetlb_no_page to map it with a PUD. By going through the sizes in decreasing order, we will find that PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too. This commit also changes bootmem hugepages from storing hstate pointers directly to storing the hstate sizes. The hstate pointers used for boot-time-allocated hugepages become invalid after we sort the hstates. `gather_bootmem_prealloc`, called after the hstates have been sorted, now converts the size to the correct hstate. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 2fe1eb6897d4..a344f9d9eba1 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -766,7 +766,7 @@ struct hstate { =20 struct huge_bootmem_page { struct list_head list; - struct hstate *hstate; + unsigned long hstate_sz; }; =20 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *lis= t); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 39f541b4a0a8..e20df8f6216e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -34,6 +34,7 @@ #include #include #include +#include =20 #include #include @@ -49,6 +50,10 @@ =20 int hugetlb_max_hstate __read_mostly; unsigned int default_hstate_idx; +/* + * After hugetlb_init_hstates is called, hstates will be sorted from large= st + * to smallest. + */ struct hstate hstates[HUGE_MAX_HSTATE]; =20 #ifdef CONFIG_CMA @@ -3464,7 +3469,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int n= id) /* Put them into a private list first because mem_map is not up yet */ INIT_LIST_HEAD(&m->list); list_add(&m->list, &huge_boot_pages); - m->hstate =3D h; + m->hstate_sz =3D huge_page_size(h); return 1; } =20 @@ -3479,7 +3484,7 @@ static void __init gather_bootmem_prealloc(void) list_for_each_entry(m, &huge_boot_pages, list) { struct page *page =3D virt_to_page(m); struct folio *folio =3D page_folio(page); - struct hstate *h =3D m->hstate; + struct hstate *h =3D size_to_hstate(m->hstate_sz); =20 VM_BUG_ON(!hstate_is_gigantic(h)); WARN_ON(folio_ref_count(folio) !=3D 1); @@ -3595,9 +3600,38 @@ static void __init hugetlb_hstate_alloc_pages(struct= hstate *h) kfree(node_alloc_noretry); } =20 +static int compare_hstates_decreasing(const void *a, const void *b) +{ + unsigned long sz_a =3D huge_page_size((const struct hstate *)a); + unsigned long sz_b =3D huge_page_size((const struct hstate *)b); + + if (sz_a < sz_b) + return 1; + if (sz_a > sz_b) + return -1; + return 0; +} + +static void sort_hstates(void) +{ + unsigned long default_hstate_sz =3D huge_page_size(&default_hstate); + + /* Sort from largest to smallest. */ + sort(hstates, hugetlb_max_hstate, sizeof(*hstates), + compare_hstates_decreasing, NULL); + + /* + * We may have changed the location of the default hstate, so we need to + * update it. + */ + default_hstate_idx =3D hstate_index(size_to_hstate(default_hstate_sz)); +} + static void __init hugetlb_init_hstates(void) { - struct hstate *h, *h2; + struct hstate *h; + + sort_hstates(); =20 for_each_hstate(h) { /* oversize hugepages were init'ed in early boot */ @@ -3616,13 +3650,8 @@ static void __init hugetlb_init_hstates(void) continue; if (hugetlb_cma_size && h->order <=3D HUGETLB_PAGE_ORDER) continue; - for_each_hstate(h2) { - if (h2 =3D=3D h) - continue; - if (h2->order < h->order && - h2->order > h->demote_order) - h->demote_order =3D h2->order; - } + if (h - 1 >=3D &hstates[0]) + h->demote_order =3D huge_page_order(h - 1); } } =20 --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11FE6C05027 for ; Sat, 18 Feb 2023 00:31:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230292AbjBRAbR (ORCPT ); Fri, 17 Feb 2023 19:31:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230169AbjBRAaJ (ORCPT ); Fri, 17 Feb 2023 19:30:09 -0500 Received: from mail-ua1-x94a.google.com (mail-ua1-x94a.google.com [IPv6:2607:f8b0:4864:20::94a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D5F606B30E for ; Fri, 17 Feb 2023 16:29:24 -0800 (PST) Received: by mail-ua1-x94a.google.com with SMTP id x2-20020ab03802000000b0060d5bfd73b5so939645uav.16 for ; Fri, 17 Feb 2023 16:29:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Hf03eS1BobxnoyUk+0ZYrZ8zRMgrel2ztMkHa50OYSI=; b=n5krxTWva4Sf4Vsvk57VrejZS7qh83I29bbOukhXhjxnbyiLVwsIU1jYSB+TyjblCL MilLcyhHjSUyY+1YLmOg+OnBETPCIOw/L4b8Fi1wC+MKnO4VLbaVIbLLSIpqvEF5fKek S7PVzlegxhgRs1TdpdgnswFYQr0DAofh6bzKR/4aYfTbY2USEybj9EjCijALtRxrpFmY oRt8kPX/U/3EtMyNyG9s3xqsF3Jx65+txOMviZr70m7fuLxiVCeDc004dBLqVT+P2gLb aZ2Lhu7IFeKdLx2Du68oGZvPZtxzFiuZ9OH+JAdOUn2J1T26W3LFbmXb5sUZtm1q1FgI +6FA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Hf03eS1BobxnoyUk+0ZYrZ8zRMgrel2ztMkHa50OYSI=; b=SnaMBtiKGN2ANxXr/9z/eR0zlhMrtyddzCaFLYP8QdJrALxCJbhIhmYN67y2Rfmmhh ogRgLVT1PGQ52OkZnWU1ww/C2d1e2wLI7GJeltWpG3pK1tLJXQ8NDASKebT2x1kyykYt ankOLa4gPsfrr8nRVJ2qNg3xwW9TbqnhaO520tAESkV/fpfiEW68BBDzMsZFjx1Ffuwh v0LmWstW6vYAY6J30lhvyrIUHk41btfHNx+agnC+Fr2qge4eaas9+ARCYvmHEUqiYd/f covEkFUUWmsb3suqCrt5+FBw7u/73bJJnuYO2tBMUaMVQVszU0NOC1DFd81z+Qe8bEZh iRVQ== X-Gm-Message-State: AO0yUKXzI1IYAhT4szge27jnIfl0HYwNw4jEN2Qy1pYyuGApg42S5zO8 zU7uG42k4UqgOjRlDX81FasMKc9qNR0MzE5G X-Google-Smtp-Source: AK7set/xeVdMk4/mlhl6jkXC1OYI04ENP7/RQfinMSut9jH5SSiIHmeo9doCVF6zrjX21weL0A1G0FvBpBxmEp3t X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6102:356b:b0:415:48ce:8597 with SMTP id bh11-20020a056102356b00b0041548ce8597mr942711vsb.8.1676680155862; Fri, 17 Feb 2023 16:29:15 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:05 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-33-jthoughton@google.com> Subject: [PATCH v2 32/46] hugetlb: add for_each_hgm_shift From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This is a helper macro to loop through all the usable page sizes for a high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will loop, in descending order, through the page sizes that HugeTLB supports for this architecture. It always includes PAGE_SIZE. This is done by looping through the hstates; however, there is no hstate for PAGE_SIZE. To handle this case, the loop intentionally goes out of bounds, and the out-of-bounds pointer is mapped to PAGE_SIZE. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e20df8f6216e..667e82b7a0ff 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7941,6 +7941,24 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma) { return vma && (vma->vm_flags & VM_HUGETLB_HGM); } +/* Should only be used by the for_each_hgm_shift macro. */ +static unsigned int __shift_for_hstate(struct hstate *h) +{ + /* If h is out of bounds, we have reached the end, so give PAGE_SIZE */ + if (h >=3D &hstates[hugetlb_max_hstate]) + return PAGE_SHIFT; + return huge_page_shift(h); +} + +/* + * Intentionally go out of bounds. An out-of-bounds hstate will be convert= ed to + * PAGE_SIZE. + */ +#define for_each_hgm_shift(hstate, tmp_h, shift) \ + for ((tmp_h) =3D hstate; (shift) =3D __shift_for_hstate(tmp_h), \ + (tmp_h) <=3D &hstates[hugetlb_max_hstate]; \ + (tmp_h)++) + #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ =20 /* --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B2BBC05027 for ; Sat, 18 Feb 2023 00:30:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230236AbjBRAaw (ORCPT ); Fri, 17 Feb 2023 19:30:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229902AbjBRA3z (ORCPT ); Fri, 17 Feb 2023 19:29:55 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 35B7F68ADE for ; Fri, 17 Feb 2023 16:29:17 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5365a2b9e4fso19060707b3.15 for ; Fri, 17 Feb 2023 16:29:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wrKlKZ6ZakLkTbIJ1oX0i57iaJAjsrq7KWEmG7MzQmY=; b=mCbQikAQ3LLbb2MaLcCfnhff3PiBABnQwSQf9LWl30IUncwsLCnlc7QcDea1N/QLsC vfw0K2950rSrhIHrRGtEn7vPNXstlIlDJme+x9X1Tl4rCeL5TE1nmFxc78IZG7kUcuGZ gWnC/qKD/qWIT+WrgA2/XtwBBmqj4uwswcowjU0CsCy/gmEKNQpD5PEKpeYu26k0gax4 wFF1AM3BcKLwDBazsiY337gPGCphNPaNpVDV9c2Psll+/miIIyYrv4fm4pZwi/PDcjtw 80annwRMmYnG6AnPnD+aevt1DhBv/jq/mj8pS2S8t5MLt/H13U85E5EeFDuqgEKiranZ zR4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wrKlKZ6ZakLkTbIJ1oX0i57iaJAjsrq7KWEmG7MzQmY=; b=BQqYQjQYsrYR78XhUJM+9WcpXXlCwoHZ50K+ougp4+iO7ha8WvWPNVv5w4zDwTI4GX f2k6+AXF9xNvVYrp1QEINkQruQB7RB9ZrVw1hCbOSO0fdKQGGuq6w1TM8rDn1aCXcB+Y Ke9/5igPuLCGgiQ0TkglXehG4UP2qMcd33zRk+7DvLYcAiSKcxE/JLUDToJ+DdyTxGbx w/0V6aI3wbwwaHvMv9/1EoJmK0F0oNDxzD/UsvR/PLqVReMjskrJ3w119uxXgvqHhbrN 8JIahQYps8AoNxwoWjNP22Xypa7LHwA613sZMN2zf59APYJsEVPGjn1SvqH+fp8DhbDj 4kZg== X-Gm-Message-State: AO0yUKWB+7ampr/f/F5D+bl/cwBQdBYcpgf7v2zNyko/ksXmuWxrnkxS 8AUSjiJohdhImxNzdGwFQvS/LaguEviyiPug X-Google-Smtp-Source: AK7set/ChT8w1lRpdLXETNdy5AqAngOnJfnNB3h58M/w44KB78rgBAvvKVVzRYFSwR4a8k+Uo0NqEoJrPiNJtx6Y X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:1024:b0:8fc:686c:cf87 with SMTP id x4-20020a056902102400b008fc686ccf87mr53605ybt.4.1676680156885; Fri, 17 Feb 2023 16:29:16 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:06 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-34-jthoughton@google.com> Subject: [PATCH v2 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Changes here are similar to the changes made for hugetlb_no_page. Pass vmf->real_address to userfaultfd_huge_must_wait because vmf->address may be rounded down to the hugepage size, and a high-granularity page table walk would look up the wrong PTE. Also change the call to userfaultfd_must_wait in the same way for consistency. This commit introduces hugetlb_alloc_largest_pte which is used to find the appropriate PTE size to map pages with UFFDIO_CONTINUE. When MADV_SPLIT is provided, page fault events will report PAGE_SIZE-aligned address instead of huge_page_size(h)-aligned addresses, regardless of if UFFD_FEATURE_EXACT_ADDRESS is used. Signed-off-by: James Houghton diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 44d1ee429eb0..bb30001b63ba 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -252,17 +252,17 @@ static inline bool userfaultfd_huge_must_wait(struct = userfaultfd_ctx *ctx, unsigned long flags, unsigned long reason) { - pte_t *ptep, pte; + pte_t pte; bool ret =3D true; + struct hugetlb_pte hpte; =20 mmap_assert_locked(ctx->mm); =20 - ptep =3D hugetlb_walk(vma, address, vma_mmu_pagesize(vma)); - if (!ptep) + if (hugetlb_full_walk(&hpte, vma, address)) goto out; =20 ret =3D false; - pte =3D huge_ptep_get(ptep); + pte =3D huge_ptep_get(hpte.ptep); =20 /* * Lockless access: we're in a wait_event so it's ok if it @@ -531,11 +531,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, uns= igned long reason) spin_unlock_irq(&ctx->fault_pending_wqh.lock); =20 if (!is_vm_hugetlb_page(vma)) - must_wait =3D userfaultfd_must_wait(ctx, vmf->address, vmf->flags, - reason); + must_wait =3D userfaultfd_must_wait(ctx, vmf->real_address, + vmf->flags, reason); else must_wait =3D userfaultfd_huge_must_wait(ctx, vma, - vmf->address, + vmf->real_address, vmf->flags, reason); if (is_vm_hugetlb_page(vma)) hugetlb_vma_unlock_read(vma); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a344f9d9eba1..e0e51bb06112 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -201,7 +201,8 @@ unsigned long hugetlb_total_pages(void); vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); #ifdef CONFIG_USERFAULTFD -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, +int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, + struct hugetlb_pte *dst_hpte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, @@ -1272,16 +1273,31 @@ static inline enum hugetlb_level hpage_size_to_leve= l(unsigned long sz) =20 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING bool hugetlb_hgm_enabled(struct vm_area_struct *vma); +bool hugetlb_hgm_advised(struct vm_area_struct *vma); bool hugetlb_hgm_eligible(struct vm_area_struct *vma); +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *= mm, + struct vm_area_struct *vma, unsigned long start, + unsigned long end); #else static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma) { return false; } +static inline bool hugetlb_hgm_advised(struct vm_area_struct *vma) +{ + return false; +} static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma) { return false; } +static inline +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *= mm, + struct vm_area_struct *vma, unsigned long start, + unsigned long end) +{ + return -EINVAL; +} #endif =20 static inline spinlock_t *huge_pte_lock(struct hstate *h, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 667e82b7a0ff..a00b4ac07046 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6083,9 +6083,15 @@ static inline vm_fault_t hugetlb_handle_userfault(st= ruct vm_area_struct *vma, unsigned long reason) { u32 hash; + /* + * Don't use the hpage-aligned address if the user has explicitly + * enabled HGM. + */ + bool round_to_pagesize =3D hugetlb_hgm_advised(vma) && + reason =3D=3D VM_UFFD_MINOR; struct vm_fault vmf =3D { .vma =3D vma, - .address =3D haddr, + .address =3D round_to_pagesize ? addr & PAGE_MASK : haddr, .real_address =3D addr, .flags =3D flags, =20 @@ -6569,7 +6575,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, * modifications for huge pages. */ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, - pte_t *dst_pte, + struct hugetlb_pte *dst_hpte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, @@ -6580,13 +6586,15 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, bool is_continue =3D (mode =3D=3D MCOPY_ATOMIC_CONTINUE); struct hstate *h =3D hstate_vma(dst_vma); struct address_space *mapping =3D dst_vma->vm_file->f_mapping; - pgoff_t idx =3D vma_hugecache_offset(h, dst_vma, dst_addr); + unsigned long haddr =3D dst_addr & huge_page_mask(h); + pgoff_t idx =3D vma_hugecache_offset(h, dst_vma, haddr); unsigned long size; int vm_shared =3D dst_vma->vm_flags & VM_SHARED; pte_t _dst_pte; spinlock_t *ptl; int ret =3D -ENOMEM; struct folio *folio; + struct page *subpage; int writable; bool folio_in_pagecache =3D false; =20 @@ -6601,12 +6609,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, * a non-missing case. Return -EEXIST. */ if (vm_shared && - hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + hugetlbfs_pagecache_present(h, dst_vma, haddr)) { ret =3D -EEXIST; goto out; } =20 - folio =3D alloc_hugetlb_folio(dst_vma, dst_addr, 0); + folio =3D alloc_hugetlb_folio(dst_vma, haddr, 0); if (IS_ERR(folio)) { ret =3D -ENOMEM; goto out; @@ -6622,13 +6630,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, /* Free the allocated folio which may have * consumed a reservation. */ - restore_reserve_on_error(h, dst_vma, dst_addr, folio); + restore_reserve_on_error(h, dst_vma, haddr, folio); folio_put(folio); =20 /* Allocate a temporary folio to hold the copied * contents. */ - folio =3D alloc_hugetlb_folio_vma(h, dst_vma, dst_addr); + folio =3D alloc_hugetlb_folio_vma(h, dst_vma, haddr); if (!folio) { ret =3D -ENOMEM; goto out; @@ -6642,14 +6650,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, } } else { if (vm_shared && - hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + hugetlbfs_pagecache_present(h, dst_vma, haddr)) { put_page(*pagep); ret =3D -EEXIST; *pagep =3D NULL; goto out; } =20 - folio =3D alloc_hugetlb_folio(dst_vma, dst_addr, 0); + folio =3D alloc_hugetlb_folio(dst_vma, haddr, 0); if (IS_ERR(folio)) { put_page(*pagep); ret =3D -ENOMEM; @@ -6697,7 +6705,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, folio_in_pagecache =3D true; } =20 - ptl =3D huge_pte_lock(h, dst_mm, dst_pte); + ptl =3D hugetlb_pte_lock(dst_hpte); =20 ret =3D -EIO; if (folio_test_hwpoison(folio)) @@ -6709,11 +6717,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, * page backing it, then access the page. */ ret =3D -EEXIST; - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) + if (!huge_pte_none_mostly(huge_ptep_get(dst_hpte->ptep))) goto out_release_unlock; =20 + subpage =3D hugetlb_find_subpage(h, folio, dst_addr); + if (folio_in_pagecache) - page_add_file_rmap(&folio->page, dst_vma, true); + hugetlb_add_file_rmap(subpage, dst_hpte->shift, h, dst_vma); else hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr); =20 @@ -6726,7 +6736,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, else writable =3D dst_vma->vm_flags & VM_WRITE; =20 - _dst_pte =3D make_huge_pte(dst_vma, &folio->page, writable); + _dst_pte =3D make_huge_pte_with_shift(dst_vma, subpage, writable, + dst_hpte->shift); /* * Always mark UFFDIO_COPY page dirty; note that this may not be * extremely important for hugetlbfs for now since swapping is not @@ -6739,12 +6750,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, if (wp_copy) _dst_pte =3D huge_pte_mkuffd_wp(_dst_pte); =20 - set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte); =20 - hugetlb_count_add(pages_per_huge_page(h), dst_mm); + hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm); =20 /* No need to invalidate - it was non-present before */ - update_mmu_cache(dst_vma, dst_addr, dst_pte); + update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep); =20 spin_unlock(ptl); if (!is_continue) @@ -7941,6 +7952,18 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma) { return vma && (vma->vm_flags & VM_HUGETLB_HGM); } +bool hugetlb_hgm_advised(struct vm_area_struct *vma) +{ + /* + * Right now, the only way for HGM to be enabled is if a user + * explicitly enables it via MADV_SPLIT, but in the future, there + * may be cases where it gets enabled automatically. + * + * Provide hugetlb_hgm_advised() now for call sites where care that the + * user explicitly enabled HGM. + */ + return hugetlb_hgm_enabled(vma); +} /* Should only be used by the for_each_hgm_shift macro. */ static unsigned int __shift_for_hstate(struct hstate *h) { @@ -7959,6 +7982,38 @@ static unsigned int __shift_for_hstate(struct hstate= *h) (tmp_h) <=3D &hstates[hugetlb_max_hstate]; \ (tmp_h)++) =20 +/* + * Find the HugeTLB PTE that maps as much of [start, end) as possible with= a + * single page table entry. It is returned in @hpte. + */ +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *= mm, + struct vm_area_struct *vma, unsigned long start, + unsigned long end) +{ + struct hstate *h =3D hstate_vma(vma), *tmp_h; + unsigned int shift; + unsigned long sz; + int ret; + + for_each_hgm_shift(h, tmp_h, shift) { + sz =3D 1UL << shift; + + if (!IS_ALIGNED(start, sz) || start + sz > end) + continue; + goto found; + } + return -EINVAL; +found: + ret =3D hugetlb_full_walk_alloc(hpte, vma, start, sz); + if (ret) + return ret; + + if (hpte->shift > shift) + return -EEXIST; + + return 0; +} + #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ =20 /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 53c3d916ff66..b56bc12f600e 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -320,14 +320,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb= (struct mm_struct *dst_mm, { int vm_shared =3D dst_vma->vm_flags & VM_SHARED; ssize_t err; - pte_t *dst_pte; unsigned long src_addr, dst_addr; long copied; struct page *page; - unsigned long vma_hpagesize; + unsigned long vma_hpagesize, target_pagesize; pgoff_t idx; u32 hash; struct address_space *mapping; + bool use_hgm =3D hugetlb_hgm_advised(dst_vma) && + mode =3D=3D MCOPY_ATOMIC_CONTINUE; + struct hstate *h =3D hstate_vma(dst_vma); =20 /* * There is no default zero huge page for all huge page sizes as @@ -345,12 +347,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb= (struct mm_struct *dst_mm, copied =3D 0; page =3D NULL; vma_hpagesize =3D vma_kernel_pagesize(dst_vma); + target_pagesize =3D use_hgm ? PAGE_SIZE : vma_hpagesize; =20 /* - * Validate alignment based on huge page size + * Validate alignment based on the targeted page size. */ err =3D -EINVAL; - if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1)) + if (dst_start & (target_pagesize - 1) || len & (target_pagesize - 1)) goto out_unlock; =20 retry: @@ -381,13 +384,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb= (struct mm_struct *dst_mm, } =20 while (src_addr < src_start + len) { + struct hugetlb_pte hpte; BUG_ON(dst_addr >=3D dst_start + len); =20 /* * Serialize via vma_lock and hugetlb_fault_mutex. - * vma_lock ensures the dst_pte remains valid even - * in the case of shared pmds. fault mutex prevents - * races with other faulting threads. + * vma_lock ensures the hpte.ptep remains valid even + * in the case of shared pmds and page table collapsing. + * fault mutex prevents races with other faulting threads. */ idx =3D linear_page_index(dst_vma, dst_addr); mapping =3D dst_vma->vm_file->f_mapping; @@ -395,23 +399,28 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb= (struct mm_struct *dst_mm, mutex_lock(&hugetlb_fault_mutex_table[hash]); hugetlb_vma_lock_read(dst_vma); =20 - err =3D -ENOMEM; - dst_pte =3D huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize); - if (!dst_pte) { + if (use_hgm) + err =3D hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma, + dst_addr, + dst_start + len); + else + err =3D hugetlb_full_walk_alloc(&hpte, dst_vma, dst_addr, + vma_hpagesize); + if (err) { hugetlb_vma_unlock_read(dst_vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); goto out_unlock; } =20 if (mode !=3D MCOPY_ATOMIC_CONTINUE && - !huge_pte_none_mostly(huge_ptep_get(dst_pte))) { + !huge_pte_none_mostly(huge_ptep_get(hpte.ptep))) { err =3D -EEXIST; hugetlb_vma_unlock_read(dst_vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); goto out_unlock; } =20 - err =3D hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, + err =3D hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma, dst_addr, src_addr, mode, &page, wp_copy); =20 @@ -423,6 +432,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(s= truct mm_struct *dst_mm, if (unlikely(err =3D=3D -ENOENT)) { mmap_read_unlock(dst_mm); BUG_ON(!page); + WARN_ON_ONCE(hpte.shift !=3D huge_page_shift(h)); =20 err =3D copy_huge_page_from_user(page, (const void __user *)src_addr, @@ -440,9 +450,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(s= truct mm_struct *dst_mm, BUG_ON(page); =20 if (!err) { - dst_addr +=3D vma_hpagesize; - src_addr +=3D vma_hpagesize; - copied +=3D vma_hpagesize; + dst_addr +=3D hugetlb_pte_size(&hpte); + src_addr +=3D hugetlb_pte_size(&hpte); + copied +=3D hugetlb_pte_size(&hpte); =20 if (fatal_signal_pending(current)) err =3D -EINTR; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BF52C05027 for ; Sat, 18 Feb 2023 00:31:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229960AbjBRAb2 (ORCPT ); Fri, 17 Feb 2023 19:31:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43108 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230148AbjBRAaM (ORCPT ); Fri, 17 Feb 2023 19:30:12 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92C5A6A073 for ; Fri, 17 Feb 2023 16:29:31 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id y33-20020a25ad21000000b00953ffdfbe1aso2197142ybi.23 for ; Fri, 17 Feb 2023 16:29:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=H5Ss3Q8p6Tm/C0evsWap3MC9xTDp36cPJKkNtoaVKGQ=; b=PyoCsPHZKLuSeDOLvIw+lkPHKwn65TPzd0CfsjOfssDvwh0+NtYbibsVfgVsSmNssl qradQv+EgXsSDZiodu5lljG1t2o7w6TQ698wIv5JgVXeUBOB3Wnz/Bcq5yevJLKg+nve VjJMx+Txfk3k/LgV2w2hnh91g802PdgQo2AseWuhEWbpXHvGj7DQ7qL06+fGAKkwmyeX eLgkDxIvUSbfleeEETs9PQLkB3WeTXxJkEgHYOVpxDZGz96scTefQeiZRq4DuIrveLRP xg9PkBMYr1jykjB8E7MRM4w0AuZAeJ4hrxxNWXPe2GYe6/W0zfWOPuKMd0nX/tiporx7 sscg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=H5Ss3Q8p6Tm/C0evsWap3MC9xTDp36cPJKkNtoaVKGQ=; b=DCv3v9qUbqnOBwm9qYfSZ2sB8aFXYQT/35Kswiqe0Ez7FplF9Mj052MCrOJlJ7iSCs CMkmCzKUR64pzuFEFnDE2VRF2RnZIXDuvOGXMKM3/gE0s/B8NGAe/TsUhHgxSm58BLQv DF5xB/19y37frKRva10Y3ABSWL3VTO08mg5qPCXXpOFEMG8lsiqwta6DphEIk7Vl0Psh cFYeNSS57o0/4FghkjEvRwO9BpQtQtvrO5GJ9cqd4jNhs5wRJpJJbhKeqVKsG7jtxBwS FykV3yZhQZ8tr6GD25HCTkTWPb4Lhp1/Zx3waSQpG6Mry6aIQmLqPn49U/5F/cSwdI38 ekvw== X-Gm-Message-State: AO0yUKVHQ1lkRVjjfTZVU6xRhxOBCSSucoSFd0IyTaIvTdSMORF9GpnL GUoU9YOH2f97fWiELPQMYMuewzsH7uwJbfs7 X-Google-Smtp-Source: AK7set9/STmUDdPiVjVMlyX9HVfyhKJ5jo2Gv2WTisaTY68C7uxLi9/y0Dvn93Bgqo0KqJ6aTwnqpO+gxhHqoPwi X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:107:b0:914:1ef3:e98a with SMTP id o7-20020a056902010700b009141ef3e98amr168149ybh.213.1676680158302; Fri, 17 Feb 2023 16:29:18 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:07 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-35-jthoughton@google.com> Subject: [PATCH v2 34/46] hugetlb: add MADV_COLLAPSE for hugetlb From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This is a necessary extension to the UFFDIO_CONTINUE changes. When userspace finishes mapping an entire hugepage with UFFDIO_CONTINUE, the kernel has no mechanism to automatically collapse the page table to map the whole hugepage normally. We require userspace to inform us that they would like the mapping to be collapsed; they do this with MADV_COLLAPSE. If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but only some, hugetlb_collapse will cause the requested range to be mapped as if it were UFFDIO_CONTINUE'd already. The effects of any UFFDIO_WRITEPROTECT calls may be undone by a call to MADV_COLLAPSE for intersecting address ranges. This commit is co-opting the same madvise mode that has been introduced to synchronously collapse THPs. The function that does THP collapsing has been renamed to madvise_collapse_thp. As with the rest of the high-granularity mapping support, MADV_COLLAPSE is only supported for shared VMAs right now. MADV_COLLAPSE for HugeTLB takes the mmap_lock for writing. It is important that we check PageHWPoison before checking !HPageMigratable, as PageHWPoison implies !HPageMigratable. !PageHWPoison && !HPageMigratable means that the page has been isolated for migration. Signed-off-by: James Houghton diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 70bd867eba94..fa63a56ebaf0 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -218,9 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t= *pud, =20 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice); -int madvise_collapse(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end); +int madvise_collapse_thp(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end); void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, long adjust_next); spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); @@ -358,9 +358,9 @@ static inline int hugepage_madvise(struct vm_area_struc= t *vma, return -EINVAL; } =20 -static inline int madvise_collapse(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end) +static inline int madvise_collapse_thp(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) { return -EINVAL; } diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index e0e51bb06112..6cd4ae08d84d 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -1278,6 +1278,8 @@ bool hugetlb_hgm_eligible(struct vm_area_struct *vma); int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *= mm, struct vm_area_struct *vma, unsigned long start, unsigned long end); +int hugetlb_collapse(struct mm_struct *mm, unsigned long start, + unsigned long end); #else static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma) { @@ -1298,6 +1300,12 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hp= te, struct mm_struct *mm, { return -EINVAL; } +static inline +int hugetlb_collapse(struct mm_struct *mm, unsigned long start, + unsigned long end) +{ + return -EINVAL; +} #endif =20 static inline spinlock_t *huge_pte_lock(struct hstate *h, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a00b4ac07046..c4d189e5f1fd 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -8014,6 +8014,158 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *h= pte, struct mm_struct *mm, return 0; } =20 +/* + * Collapse the address range from @start to @end to be mapped optimally. + * + * This is only valid for shared mappings. The main use case for this func= tion + * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire huge= page + * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't = know + * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we le= ave + * it up to userspace to tell us to do so, via MADV_COLLAPSE. + * + * Any holes in the mapping will be filled. If there is no page in the + * pagecache for a region we're collapsing, the PTEs will be cleared. + * + * If high-granularity PTEs are uffd-wp markers, those markers will be dro= pped. + */ +static int __hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct = *vma, + unsigned long start, unsigned long end) +{ + struct hstate *h =3D hstate_vma(vma); + struct address_space *mapping =3D vma->vm_file->f_mapping; + struct mmu_notifier_range range; + struct mmu_gather tlb; + unsigned long curr =3D start; + int ret =3D 0; + struct folio *folio; + struct page *subpage; + pgoff_t idx; + bool writable =3D vma->vm_flags & VM_WRITE; + struct hugetlb_pte hpte; + pte_t entry; + spinlock_t *ptl; + + /* + * This is only supported for shared VMAs, because we need to look up + * the page to use for any PTEs we end up creating. + */ + if (!(vma->vm_flags & VM_MAYSHARE)) + return -EINVAL; + + /* If HGM is not enabled, there is nothing to collapse. */ + if (!hugetlb_hgm_enabled(vma)) + return 0; + + tlb_gather_mmu(&tlb, mm); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start, end); + mmu_notifier_invalidate_range_start(&range); + + while (curr < end) { + ret =3D hugetlb_alloc_largest_pte(&hpte, mm, vma, curr, end); + if (ret) + goto out; + + entry =3D huge_ptep_get(hpte.ptep); + + /* + * There is no work to do if the PTE doesn't point to page + * tables. + */ + if (!pte_present(entry)) + goto next_hpte; + if (hugetlb_pte_present_leaf(&hpte, entry)) + goto next_hpte; + + idx =3D vma_hugecache_offset(h, vma, curr); + folio =3D filemap_get_folio(mapping, idx); + + if (folio && folio_test_hwpoison(folio)) { + /* + * Don't collapse a mapping to a page that is + * hwpoisoned. The entire page will be poisoned. + * + * When HugeTLB supports poisoning PAGE_SIZE bits of + * the hugepage, the logic here can be improved. + * + * Skip this page, and continue to collapse the rest + * of the mapping. + */ + folio_put(folio); + curr =3D (curr & huge_page_mask(h)) + huge_page_size(h); + continue; + } + + if (folio && !folio_test_hugetlb_migratable(folio)) { + /* + * Don't collapse a mapping to a page that is pending + * a migration. Migration swap entries may have placed + * in the page table. + */ + ret =3D -EBUSY; + folio_put(folio); + goto out; + } + + /* + * Clear all the PTEs, and drop ref/mapcounts + * (on tlb_finish_mmu). + */ + __unmap_hugepage_range(&tlb, vma, curr, + curr + hugetlb_pte_size(&hpte), + NULL, + ZAP_FLAG_DROP_MARKER); + /* Free the PTEs. */ + hugetlb_free_pgd_range(&tlb, + curr, curr + hugetlb_pte_size(&hpte), + curr, curr + hugetlb_pte_size(&hpte)); + + ptl =3D hugetlb_pte_lock(&hpte); + + if (!folio) { + huge_pte_clear(mm, curr, hpte.ptep, + hugetlb_pte_size(&hpte)); + spin_unlock(ptl); + goto next_hpte; + } + + subpage =3D hugetlb_find_subpage(h, folio, curr); + entry =3D make_huge_pte_with_shift(vma, subpage, + writable, hpte.shift); + hugetlb_add_file_rmap(subpage, hpte.shift, h, vma); + set_huge_pte_at(mm, curr, hpte.ptep, entry); + spin_unlock(ptl); +next_hpte: + curr +=3D hugetlb_pte_size(&hpte); + } +out: + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb); + + return ret; +} + +int hugetlb_collapse(struct mm_struct *mm, unsigned long start, + unsigned long end) +{ + int ret =3D 0; + struct vm_area_struct *vma; + + mmap_write_lock(mm); + while (start < end || ret) { + vma =3D find_vma(mm, start); + if (!vma || !is_vm_hugetlb_page(vma)) { + ret =3D -EINVAL; + break; + } + ret =3D __hugetlb_collapse(mm, vma, start, + end < vma->vm_end ? end : vma->vm_end); + start =3D vma->vm_end; + } + mmap_write_unlock(mm); + return ret; +} + #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ =20 /* diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 8dbc39896811..58cda5020537 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2750,8 +2750,8 @@ static int madvise_collapse_errno(enum scan_result r) } } =20 -int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **p= rev, - unsigned long start, unsigned long end) +int madvise_collapse_thp(struct vm_area_struct *vma, struct vm_area_struct= **prev, + unsigned long start, unsigned long end) { struct collapse_control *cc; struct mm_struct *mm =3D vma->vm_mm; diff --git a/mm/madvise.c b/mm/madvise.c index 8c004c678262..e121d135252a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1028,6 +1028,24 @@ static int madvise_split(struct vm_area_struct *vma, #endif } =20 +static int madvise_collapse(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + if (is_vm_hugetlb_page(vma)) { + struct mm_struct *mm =3D vma->vm_mm; + int ret; + + *prev =3D NULL; /* tell sys_madvise we dropped the mmap lock */ + mmap_read_unlock(mm); + ret =3D hugetlb_collapse(mm, start, end); + mmap_read_lock(mm); + return ret; + } + + return madvise_collapse_thp(vma, prev, start, end); +} + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its= own @@ -1204,6 +1222,9 @@ madvise_behavior_valid(int behavior) #ifdef CONFIG_TRANSPARENT_HUGEPAGE case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: +#endif +#if defined(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) || \ + defined(CONFIG_TRANSPARENT_HUGEPAGE) case MADV_COLLAPSE: #endif #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING @@ -1397,7 +1418,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsig= ned long start, * MADV_NOHUGEPAGE - mark the given range as not worth being backed by * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. - * MADV_COLLAPSE - synchronously coalesce pages into new THP. + * MADV_COLLAPSE - synchronously coalesce pages into new THP, or, for Hug= eTLB + * pages, collapse the mapping. * MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows * UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions. * MADV_DONTDUMP - the application wants to prevent pages in the given ra= nge --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F4F1C636D6 for ; Sat, 18 Feb 2023 00:31:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230144AbjBRAb0 (ORCPT ); Fri, 17 Feb 2023 19:31:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41956 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230146AbjBRAaM (ORCPT ); Fri, 17 Feb 2023 19:30:12 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F6F168ADA for ; Fri, 17 Feb 2023 16:29:32 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id l206-20020a25ccd7000000b006fdc6aaec4fso2657430ybf.20 for ; Fri, 17 Feb 2023 16:29:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6FTaaiplXAWGfEVqPjHy0St/TFfLKP8Xa0naqtqNKxQ=; b=lcSeD08U2fduS0Ppbd/dwZ1sVutpIAR5utesmektFgF0YNOfket/WkjCBRvoXlNLiV +o2Aogs/gc+f50ybLSWxFhCjImIYIGSTQeUOhgRWJa7vrQRBVyoRIEpYpw2YRWcA9U41 WyAW0Ab++KWiCATJErqWrBwpBVdFZBa+QSN736IFEusFdICh8RUH5fL8c0v/Y8PEr8Br Qe8eJUa76VSmD3HP1n1esGFDMJpGGHWAJXm6Yqaf73w38JldN0uF+PQwqsmagB2uCrvE JVZuFpSJuNpEY8qrsEqwZB/8hVTroKRCuulZOvJKAfU7a7tyMxK2NmKF8QMW0V3Aw8B3 CM8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6FTaaiplXAWGfEVqPjHy0St/TFfLKP8Xa0naqtqNKxQ=; b=K1KSJZ9YFAyJsz43Br442nFW0U4ssibTHJMG7pQsQKWIxnesYRgA4wbbqr4Z881fBX dK3MS7I6pU7Mx55dHflnuxeOedKjDxKBSJoOJGYlReg8dEhj6qCykr2rpqz11f20qzTZ AWun5V5kLmGlHHgCZjCYpqfl8TB2k5Dawf+E9Gd2W5ess2+BqOXQbbE+uKFnd6CYO+D8 9aCpBStExiuIPJTCX3CBTPeHq959bzWB93vUGjyW9x4HU6ej3/F6bavTFZcCyAIEIObC CJOLuEH+XlUkokbYN48R6FMY2f/GrbrFg7fCVERIiAh+aD7nYZftpZZonr2+q0mL7Wgt OWzQ== X-Gm-Message-State: AO0yUKVv8EFChHr8M62685HM6hGYzYQbGp8QJDyfAiMMJ3lH4dFHPedm NKs2JTgyBBwpDbe59h7dRNvmANcDObnJxF5q X-Google-Smtp-Source: AK7set8ceT73sh6JAXytc6nngKglB94I/eez0+U3CbTB6Hmq6cgS64DE3iyooTxML7Ni0PUJg/39Zu2G+xo8+s0D X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:f49:0:b0:995:ccb:1aae with SMTP id y9-20020a5b0f49000000b009950ccb1aaemr85936ybr.13.1676680159411; Fri, 17 Feb 2023 16:29:19 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:08 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-36-jthoughton@google.com> Subject: [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With high-granularity mappings, it becomes quite trivial for userspace to overflow a page's refcount or mapcount. It can be done like so: 1. Create a 1G hugetlbfs file with a single 1G page. 2. Create 8192 mappings of the file. 3. Use UFFDIO_CONTINUE to map every mapping at entirely 4K. Each time step 3 is done for a mapping, the refcount and mapcount will increase by 2^19 (512 * 512). Do that 2^13 times (8192), and you reach 2^31. To avoid this, WARN_ON_ONCE when the refcount goes negative. If this happens as a result of a page fault, return VM_FAULT_SIGBUS, and if it happens as a result of a UFFDIO_CONTINUE, return EFAULT. We can also create too many mappings by fork()ing a lot with VMAs setup such that page tables must be copied at fork()-time (like if we have VM_UFFD_WP). Use try_get_page() in copy_hugetlb_page_range() to deal with this. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c4d189e5f1fd..34368072dabe 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5397,7 +5397,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, s= truct mm_struct *src, } else { ptepage =3D pte_page(entry); hpage =3D compound_head(ptepage); - get_page(hpage); + if (try_get_page(hpage)) { + ret =3D -EFAULT; + break; + } =20 /* * Failing to duplicate the anon rmap is a rare case @@ -6132,6 +6135,30 @@ static bool hugetlb_pte_stable(struct hstate *h, str= uct hugetlb_pte *hpte, return same; } =20 +/* + * Like filemap_lock_folio, but check the refcount of the page afterwards = to + * check if we are at risk of overflowing refcount back to 0. + * + * This should be used in places that can be used to easily overflow refco= unt, + * like places that create high-granularity mappings. + */ +static struct folio *hugetlb_try_find_lock_folio(struct address_space *map= ping, + pgoff_t idx) +{ + struct folio *folio =3D filemap_lock_folio(mapping, idx); + + /* + * This check is very similar to the one in try_get_page(). + * + * This check is inherently racy, so WARN_ON_ONCE() if this condition + * ever occurs. + */ + if (WARN_ON_ONCE(folio && folio_ref_count(folio) <=3D 0)) + return ERR_PTR(-EFAULT); + + return folio; +} + static vm_fault_t hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, struct address_space *mapping, pgoff_t idx, @@ -6168,7 +6195,15 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *= mm, * before we get page_table_lock. */ new_folio =3D false; - folio =3D filemap_lock_folio(mapping, idx); + folio =3D hugetlb_try_find_lock_folio(mapping, idx); + if (IS_ERR(folio)) { + /* + * We don't want to invoke the OOM killer here, as we aren't + * actually OOMing. + */ + ret =3D VM_FAULT_SIGBUS; + goto out; + } if (!folio) { size =3D i_size_read(mapping->host) >> huge_page_shift(h); if (idx >=3D size) @@ -6600,8 +6635,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, =20 if (is_continue) { ret =3D -EFAULT; - folio =3D filemap_lock_folio(mapping, idx); - if (!folio) + folio =3D hugetlb_try_find_lock_folio(mapping, idx); + if (IS_ERR_OR_NULL(folio)) goto out; folio_in_pagecache =3D true; } else if (!*pagep) { --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 340FEC6379F for ; Sat, 18 Feb 2023 00:31:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230316AbjBRAbc (ORCPT ); Fri, 17 Feb 2023 19:31:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43030 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230174AbjBRAaM (ORCPT ); Fri, 17 Feb 2023 19:30:12 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 349BA6CA1D for ; Fri, 17 Feb 2023 16:29:33 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-5366333bdb5so15830587b3.19 for ; Fri, 17 Feb 2023 16:29:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=HPj8If6i9Ge054c2yi88NtPUxX7MkhEBgezIKsLW8z8=; b=Kd84UzoUTRPGM2pruNTbVhhDlhlYmwG9wMjKAkrrxmwn/CJAyxsS6jZ/LzaQVwYC3z S69dpVgrGNsnQ2VWeasqchjAp3jDCn9n8jxsJDYFL1UCwLksmO5Kw+6uUwl7zSSgvaCl CihuI8Yf+eChioQnTWfweqD4fr36yuVoZcedwO4ze5S8nNa5G9axS81vDsPaWdtLrroV 8UuCTi7HBc8+DG1q9YpGK/r09QKm4gOgG0V2UzM7mQ71xE6K5hJlSxLD1cNbFDXLleV5 yEaFpVguoifalvGf8MAWgne1cnF863zM85vjZWzRVitIqTT0Ms2eNQkJwx5cA7n7LbdD 9FRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HPj8If6i9Ge054c2yi88NtPUxX7MkhEBgezIKsLW8z8=; b=cdAD0PJMBpZmSMX9kpEE1e591YBc8vBZk/37zYRi+/nIjI/M5bLOGnX28nO3fSOTWr rRtap00aSisfJ/zGbAGrFRG0gwl3qc3dUvxAloHaOuMB3N5dcx4QonWSs5DJP65JQS2N CLfCnDCPpIG2OncnKrQYeUHZMKYFw8qQAXWP4PgRabHyCrJaqwG3p+suxUZQ1E+k1e8E NSWFkTSKknYZf7/s5+jBohunRl6B/yCnDQC9yJX82Xtax+zu1jRX3D2fHW613QadXebb Tg7YlmwWnYRbDrK77hImSQkOB6vAzel/af+kuV3AcGAoRHUqQCP94EZfaw+ehbg6R3CX Y1Cg== X-Gm-Message-State: AO0yUKWEIqqYnKcARBAUWaU5VPoEqvWxKtvYPMBhDWU8zbXu3ZuPQU/M SpS1biMp+v1hVYDgtUqgXuv5e4g16cQCfdgz X-Google-Smtp-Source: AK7set/mcEFSHmue9cMCcVJMB1Ku9p/0QTBXluCyDE4PFP+ttIIRE+dDut302gTdIA8Yp2aY7RngxiVOCfe3YAYP X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:1cd:b0:985:3b30:f27 with SMTP id u13-20020a05690201cd00b009853b300f27mr245191ybh.13.1676680160446; Fri, 17 Feb 2023 16:29:20 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:09 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-37-jthoughton@google.com> Subject: [PATCH v2 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" They are replaced with hugetlb_pte_lock{,ptr}. All callers that haven't already been replaced don't get called when using HGM, so we handle them by populating hugetlb_ptes with the standard, hstate-sized huge PTEs. Signed-off-by: James Houghton diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 035a0df47af0..c90ac06dc8d9 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -258,11 +258,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct = *vma, =20 #ifdef CONFIG_PPC_BOOK3S_64 struct hstate *h =3D hstate_vma(vma); + struct hugetlb_pte hpte; =20 psize =3D hstate_get_psize(h); #ifdef CONFIG_DEBUG_VM - assert_spin_locked(huge_pte_lockptr(huge_page_shift(h), - vma->vm_mm, ptep)); + /* HGM is not supported for powerpc yet. */ + hugetlb_pte_init(&hpte, ptep, huge_page_shift(h), + hpage_size_to_level(psize)); + assert_spin_locked(hpte.ptl); #endif =20 #else diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 6cd4ae08d84d..742e7f2cb170 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -1012,14 +1012,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hs= tate *h, gfp_t gfp_mask) return modified_mask; } =20 -static inline spinlock_t *huge_pte_lockptr(unsigned int shift, - struct mm_struct *mm, pte_t *pte) -{ - if (shift =3D=3D PMD_SHIFT) - return pmd_lockptr(mm, (pmd_t *) pte); - return &mm->page_table_lock; -} - #ifndef hugepages_supported /* * Some platform decide whether they support huge pages at boot @@ -1228,12 +1220,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hs= tate *h, gfp_t gfp_mask) return 0; } =20 -static inline spinlock_t *huge_pte_lockptr(unsigned int shift, - struct mm_struct *mm, pte_t *pte) -{ - return &mm->page_table_lock; -} - static inline void hugetlb_count_init(struct mm_struct *mm) { } @@ -1308,16 +1294,6 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned = long start, } #endif =20 -static inline spinlock_t *huge_pte_lock(struct hstate *h, - struct mm_struct *mm, pte_t *pte) -{ - spinlock_t *ptl; - - ptl =3D huge_pte_lockptr(huge_page_shift(h), mm, pte); - spin_lock(ptl); - return ptl; -} - static inline spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte) { @@ -1353,8 +1329,22 @@ void hugetlb_pte_init(struct mm_struct *mm, struct h= ugetlb_pte *hpte, pte_t *ptep, unsigned int shift, enum hugetlb_level level) { - __hugetlb_pte_init(hpte, ptep, shift, level, - huge_pte_lockptr(shift, mm, ptep)); + spinlock_t *ptl; + + /* + * For contiguous HugeTLB PTEs that can contain other HugeTLB PTEs + * on the same level, the same PTL for both must be used. + * + * For some architectures that implement hugetlb_walk_step, this + * version of hugetlb_pte_populate() may not be correct to use for + * high-granularity PTEs. Instead, call __hugetlb_pte_populate() + * directly. + */ + if (level =3D=3D HUGETLB_LEVEL_PMD) + ptl =3D pmd_lockptr(mm, (pmd_t *) ptep); + else + ptl =3D &mm->page_table_lock; + __hugetlb_pte_init(hpte, ptep, shift, level, ptl); } =20 #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 34368072dabe..e0a92e7c1755 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5454,9 +5454,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, put_page(hpage); =20 /* Install the new hugetlb folio if src pte stable */ - dst_ptl =3D huge_pte_lock(h, dst, dst_pte); - src_ptl =3D huge_pte_lockptr(huge_page_shift(h), - src, src_pte); + dst_ptl =3D hugetlb_pte_lock(&dst_hpte); + src_ptl =3D hugetlb_pte_lockptr(&src_hpte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); entry =3D huge_ptep_get(src_pte); if (!pte_same(src_pte_old, entry)) { @@ -7582,7 +7581,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm= _area_struct *vma, unsigned long saddr; pte_t *spte =3D NULL; pte_t *pte; - spinlock_t *ptl; + struct hugetlb_pte hpte; + struct hstate *shstate; =20 i_mmap_lock_read(mapping); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { @@ -7603,7 +7603,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct v= m_area_struct *vma, if (!spte) goto out; =20 - ptl =3D huge_pte_lock(hstate_vma(vma), mm, spte); + shstate =3D hstate_vma(svma); + + hugetlb_pte_init(mm, &hpte, spte, huge_page_shift(shstate), + hpage_size_to_level(huge_page_size(shstate))); + spin_lock(hpte.ptl); if (pud_none(*pud)) { pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK)); @@ -7611,7 +7615,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm= _area_struct *vma, } else { put_page(virt_to_page(spte)); } - spin_unlock(ptl); + spin_unlock(hpte.ptl); out: pte =3D (pte_t *)pmd_alloc(mm, pud, addr); i_mmap_unlock_read(mapping); @@ -8315,6 +8319,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struc= t *vma, unsigned long address; spinlock_t *ptl; pte_t *ptep; + struct hugetlb_pte hpte; =20 if (!(vma->vm_flags & VM_MAYSHARE)) return; @@ -8336,7 +8341,10 @@ static void hugetlb_unshare_pmds(struct vm_area_stru= ct *vma, ptep =3D hugetlb_walk(vma, address, sz); if (!ptep) continue; - ptl =3D huge_pte_lock(h, mm, ptep); + + hugetlb_pte_init(mm, &hpte, ptep, huge_page_shift(h), + hpage_size_to_level(sz)); + ptl =3D hugetlb_pte_lock(&hpte); huge_pmd_unshare(mm, vma, address, ptep); spin_unlock(ptl); } --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 163F9C636D6 for ; Sat, 18 Feb 2023 00:31:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230223AbjBRAbd (ORCPT ); Fri, 17 Feb 2023 19:31:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43142 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230055AbjBRAaQ (ORCPT ); Fri, 17 Feb 2023 19:30:16 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A257964B22 for ; Fri, 17 Feb 2023 16:29:37 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5365a8e6d8dso18114837b3.7 for ; Fri, 17 Feb 2023 16:29:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=GaGw9/EzNbJa/6t4pJ7yTdVKyp/NxSZHrGSlaKMUK+E=; b=JocgWYINYVmULfBzV4L8RQDVrC/ksz9Y8dI+J+nlAsc/Rp1x6bcRPeZoTXr9WfAAcE 637vbboTsrEowR0AFIppjUO9sj33exLylRWnzGJRB4Ke4REBcvqZSCDhAGTVp/j5YufQ YgJobCIYoQmdFEhMN4GhlXFaOyyGAAfpa0yi+wp2F9k5f0rcjtQQjbs0cPJHzrVzFUhk 50adDYi9tmH1ngFR2JYpBWBk6Q0MNk3dZbjF//o0K6YVOKO0ibfrqlf5m/HRlbVFSL22 zZbQkatRHAWrwIFzYMCpOzJ34fpVzrSFt1DFYR7+DUtjlcSXBgnd1OMWELyjBlQGE0L4 N6FQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GaGw9/EzNbJa/6t4pJ7yTdVKyp/NxSZHrGSlaKMUK+E=; b=BFhwe3q2T8PM0Yr8wQ1Cqa/BHYcdYzAxA0xGpGwXX6DctVUR//wKKoH8z5uuOJQGTO 3W3hqX48V9aKR4G9H9ksHPutSSh3cYJHxyz7xOiDEyLeGoYTV6qNnuar2tCoIDkmgg1X Xl71DxR9aFonk/tpwTX2C0ekHh5Q5pECu7G8uA3UcqJw/MCm7YA26j5IpWGvMA/6BGjZ MJOLgrz0eQbMUVVzSHReTSb9Ac9EtOk/OZU+qbe++7VEnppEo8HzkmbAFmiewkkffDBW O19lAB/bAA4MkXUaCP2m72/tbQwtrProSD3/VLf9xrnGYJT65OMtqvoVFyhMSW50tt1x gMqA== X-Gm-Message-State: AO0yUKVAz+dU636JJcw48hVFNoHmrPqzGjFomNzoTz3lZWs+D6AEmf3F QE6vX3e5z69mUWVb2/BKh0k2gl1kGt+UTwOO X-Google-Smtp-Source: AK7set9wC+TS+nhluAugC3NIlRB72Nwu/shM7r466HV9itDauVe/hJjpxls4Y6ZfyXihDUoWjGrwgClTAbgrS6MV X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:12c8:b0:8e3:6aea:973 with SMTP id j8-20020a05690212c800b008e36aea0973mr91564ybu.4.1676680161464; Fri, 17 Feb 2023 16:29:21 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:10 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-38-jthoughton@google.com> Subject: [PATCH v2 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This removes the old definition of make_huge_pte, where now we always require the shift to be explicitly given. All callsites are cleaned up. Signed-off-by: James Houghton diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e0a92e7c1755..4c9b3c5379b2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5204,9 +5204,9 @@ const struct vm_operations_struct hugetlb_vm_ops =3D { .pagesize =3D hugetlb_vm_op_pagesize, }; =20 -static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma, - struct page *page, int writable, - int shift) +static pte_t make_huge_pte(struct vm_area_struct *vma, + struct page *page, int writable, + int shift) { pte_t entry; =20 @@ -5222,14 +5222,6 @@ static pte_t make_huge_pte_with_shift(struct vm_area= _struct *vma, return entry; } =20 -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, - int writable) -{ - unsigned int shift =3D huge_page_shift(hstate_vma(vma)); - - return make_huge_pte_with_shift(vma, page, writable, shift); -} - static void set_huge_ptep_writable(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { @@ -5272,7 +5264,9 @@ hugetlb_install_folio(struct vm_area_struct *vma, pte= _t *ptep, unsigned long add { __folio_mark_uptodate(new_folio); hugepage_add_new_anon_rmap(new_folio, vma, addr); - set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, &new_folio->pa= ge, 1)); + set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte( + vma, &new_folio->page, 1, + huge_page_shift(hstate_vma(vma)))); hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm); folio_set_hugetlb_migratable(new_folio); } @@ -6006,7 +6000,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, st= ruct vm_area_struct *vma, hugetlb_remove_rmap(old_page, huge_page_shift(h), h, vma); hugepage_add_new_anon_rmap(new_folio, vma, haddr); set_huge_pte_at(mm, haddr, ptep, - make_huge_pte(vma, &new_folio->page, !unshare)); + make_huge_pte(vma, &new_folio->page, !unshare, + huge_page_shift(h))); folio_set_hugetlb_migratable(new_folio); /* Make the old page be freed below */ new_folio =3D page_folio(old_page); @@ -6348,7 +6343,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *m= m, else hugetlb_add_file_rmap(subpage, hpte->shift, h, vma); =20 - new_pte =3D make_huge_pte_with_shift(vma, subpage, + new_pte =3D make_huge_pte(vma, subpage, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED)), hpte->shift); @@ -6770,8 +6765,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, else writable =3D dst_vma->vm_flags & VM_WRITE; =20 - _dst_pte =3D make_huge_pte_with_shift(dst_vma, subpage, writable, - dst_hpte->shift); + _dst_pte =3D make_huge_pte(dst_vma, subpage, writable, dst_hpte->shift); /* * Always mark UFFDIO_COPY page dirty; note that this may not be * extremely important for hugetlbfs for now since swapping is not @@ -8169,8 +8163,7 @@ static int __hugetlb_collapse(struct mm_struct *mm, s= truct vm_area_struct *vma, } =20 subpage =3D hugetlb_find_subpage(h, folio, curr); - entry =3D make_huge_pte_with_shift(vma, subpage, - writable, hpte.shift); + entry =3D make_huge_pte(vma, subpage, writable, hpte.shift); hugetlb_add_file_rmap(subpage, hpte.shift, h, vma); set_huge_pte_at(mm, curr, hpte.ptep, entry); spin_unlock(ptl); --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE80EC05027 for ; Sat, 18 Feb 2023 00:31:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230327AbjBRAbg (ORCPT ); Fri, 17 Feb 2023 19:31:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43248 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229656AbjBRAan (ORCPT ); Fri, 17 Feb 2023 19:30:43 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 50FED6D264 for ; Fri, 17 Feb 2023 16:29:41 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id 75-20020a250b4e000000b0090f2c84a6a4so1998119ybl.13 for ; Fri, 17 Feb 2023 16:29:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ep+t2caOQSahkPTOUAWt22QPS6ka0wL1iktlvruOFtk=; b=eXxtIitdVZvzFBf04daw9PPgRtXdmihKwPkQjcig6vMYYMtp+LOPKmFy/GNPGDZBuZ hUMbzRNTCrNmLeDYHlIEgRctvtpQ9FpU/o7UPh1ZIp/HRvg6DXU0nPANTJe2xRrjJGwp WbMuaqtCTStJZiCZ1A+9pK76quqdRz7D7vwYcnbCxzAH8ZqKG92mdTDkW+Ff5KgL16Oc bfQ+hpZtyfxk3DBK5nZlhD8pUxO/J77oxCvJiRMYzZvWPl+L1gHyATsIUcMQzvI17Xkm FGAjE5aD9xBtLlj157PrQ56C2U6XXydzea3z40X2BAfKNbG/jjnHVSZCkfPzbPOV1U8m sBLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ep+t2caOQSahkPTOUAWt22QPS6ka0wL1iktlvruOFtk=; b=igxrMWXMAxA7kza0kCO3chs5gaHN3oSNNdh2nuAunXS+8wAL7oOwQ+rhkrjWxVoTf8 EVX0yoJ3V8+kMcjBVjBMszCM2yzc+bs1xWkRaGX7uRoHrIDMX4kB09h8UKm1vRyAVItd WDX2AYVDMNrG7jYGSfAlxK445m0lyfCRCSr2qpy/hEfvkpEEBgsmgDNgeWicOaEYSrH0 G2Xp4e0LhCffszc+NoX+wNQ3vvP7TZwCfP1tEp14wSzygiEWe3KzEygI3e/3ViDgZ9fP sR67ROMJSk3v8XzSYGxFdUVK6e/PRwRl9ZWnsklfRLl//Td+OJJxnmTJ1NptelZmS1Tn /2mQ== X-Gm-Message-State: AO0yUKXwvMCaOPUT9aIEKkf93kLZyazWuikaxQMfNujn+4XYscJrtJf5 bXqtdllh5G05ocgpPnCvo62Q5qWAfQXTF5PC X-Google-Smtp-Source: AK7set+tBwsrzsGfYGkr6mqGYtCrd2pD30kKOdycyIYEofwc08SH08MLRIHONTv2mYLxvGH/o6jI+3RKVfH+ffxg X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:d4b:0:b0:8da:6dc5:ca06 with SMTP id f11-20020a5b0d4b000000b008da6dc5ca06mr215488ybr.7.1676680162408; Fri, 17 Feb 2023 16:29:22 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:11 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-39-jthoughton@google.com> Subject: [PATCH v2 38/46] mm: smaps: add stats for HugeTLB mapping size From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When the kernel is compiled with HUGETLB_HIGH_GRANULARITY_MAPPING, smaps may provide HugetlbPudMapped, HugetlbPmdMapped, and HugetlbPteMapped. Levels that are folded will not be outputted. Signed-off-by: James Houghton diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2f293b5dabc0..1ced7300f8cd 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -412,6 +412,15 @@ struct mem_size_stats { unsigned long swap; unsigned long shared_hugetlb; unsigned long private_hugetlb; +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +#ifndef __PAGETABLE_PUD_FOLDED + unsigned long hugetlb_pud_mapped; +#endif +#ifndef __PAGETABLE_PMD_FOLDED + unsigned long hugetlb_pmd_mapped; +#endif + unsigned long hugetlb_pte_mapped; +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ u64 pss; u64 pss_anon; u64 pss_file; @@ -731,6 +740,33 @@ static void show_smap_vma_flags(struct seq_file *m, st= ruct vm_area_struct *vma) } =20 #ifdef CONFIG_HUGETLB_PAGE + +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +static void smaps_hugetlb_hgm_account(struct mem_size_stats *mss, + struct hugetlb_pte *hpte) +{ + unsigned long size =3D hugetlb_pte_size(hpte); + + switch (hpte->level) { +#ifndef __PAGETABLE_PUD_FOLDED + case HUGETLB_LEVEL_PUD: + mss->hugetlb_pud_mapped +=3D size; + break; +#endif +#ifndef __PAGETABLE_PMD_FOLDED + case HUGETLB_LEVEL_PMD: + mss->hugetlb_pmd_mapped +=3D size; + break; +#endif + case HUGETLB_LEVEL_PTE: + mss->hugetlb_pte_mapped +=3D size; + break; + default: + break; + } +} +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ + static int smaps_hugetlb_range(struct hugetlb_pte *hpte, unsigned long addr, struct mm_walk *walk) @@ -764,6 +800,9 @@ static int smaps_hugetlb_range(struct hugetlb_pte *hpte, mss->shared_hugetlb +=3D sz; else mss->private_hugetlb +=3D sz; +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING + smaps_hugetlb_hgm_account(mss, hpte); +#endif } return 0; } @@ -833,38 +872,47 @@ static void smap_gather_stats(struct vm_area_struct *= vma, static void __show_smap(struct seq_file *m, const struct mem_size_stats *m= ss, bool rollup_mode) { - SEQ_PUT_DEC("Rss: ", mss->resident); - SEQ_PUT_DEC(" kB\nPss: ", mss->pss >> PSS_SHIFT); - SEQ_PUT_DEC(" kB\nPss_Dirty: ", mss->pss_dirty >> PSS_SHIFT); + SEQ_PUT_DEC("Rss: ", mss->resident); + SEQ_PUT_DEC(" kB\nPss: ", mss->pss >> PSS_SHIFT); + SEQ_PUT_DEC(" kB\nPss_Dirty: ", mss->pss_dirty >> PSS_SHIFT); if (rollup_mode) { /* * These are meaningful only for smaps_rollup, otherwise two of * them are zero, and the other one is the same as Pss. */ - SEQ_PUT_DEC(" kB\nPss_Anon: ", + SEQ_PUT_DEC(" kB\nPss_Anon: ", mss->pss_anon >> PSS_SHIFT); - SEQ_PUT_DEC(" kB\nPss_File: ", + SEQ_PUT_DEC(" kB\nPss_File: ", mss->pss_file >> PSS_SHIFT); - SEQ_PUT_DEC(" kB\nPss_Shmem: ", + SEQ_PUT_DEC(" kB\nPss_Shmem: ", mss->pss_shmem >> PSS_SHIFT); } - SEQ_PUT_DEC(" kB\nShared_Clean: ", mss->shared_clean); - SEQ_PUT_DEC(" kB\nShared_Dirty: ", mss->shared_dirty); - SEQ_PUT_DEC(" kB\nPrivate_Clean: ", mss->private_clean); - SEQ_PUT_DEC(" kB\nPrivate_Dirty: ", mss->private_dirty); - SEQ_PUT_DEC(" kB\nReferenced: ", mss->referenced); - SEQ_PUT_DEC(" kB\nAnonymous: ", mss->anonymous); - SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree); - SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp); - SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp); - SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); - SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb); - seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ", + SEQ_PUT_DEC(" kB\nShared_Clean: ", mss->shared_clean); + SEQ_PUT_DEC(" kB\nShared_Dirty: ", mss->shared_dirty); + SEQ_PUT_DEC(" kB\nPrivate_Clean: ", mss->private_clean); + SEQ_PUT_DEC(" kB\nPrivate_Dirty: ", mss->private_dirty); + SEQ_PUT_DEC(" kB\nReferenced: ", mss->referenced); + SEQ_PUT_DEC(" kB\nAnonymous: ", mss->anonymous); + SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree); + SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp); + SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp); + SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); + SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb); + seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ", mss->private_hugetlb >> 10, 7); - SEQ_PUT_DEC(" kB\nSwap: ", mss->swap); - SEQ_PUT_DEC(" kB\nSwapPss: ", +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +#ifndef __PAGETABLE_PUD_FOLDED + SEQ_PUT_DEC(" kB\nHugetlbPudMapped: ", mss->hugetlb_pud_mapped); +#endif +#ifndef __PAGETABLE_PMD_FOLDED + SEQ_PUT_DEC(" kB\nHugetlbPmdMapped: ", mss->hugetlb_pmd_mapped); +#endif + SEQ_PUT_DEC(" kB\nHugetlbPteMapped: ", mss->hugetlb_pte_mapped); +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ + SEQ_PUT_DEC(" kB\nSwap: ", mss->swap); + SEQ_PUT_DEC(" kB\nSwapPss: ", mss->swap_pss >> PSS_SHIFT); - SEQ_PUT_DEC(" kB\nLocked: ", + SEQ_PUT_DEC(" kB\nLocked: ", mss->pss_locked >> PSS_SHIFT); seq_puts(m, " kB\n"); } @@ -880,18 +928,18 @@ static int show_smap(struct seq_file *m, void *v) =20 show_map_vma(m, vma); =20 - SEQ_PUT_DEC("Size: ", vma->vm_end - vma->vm_start); - SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma)); - SEQ_PUT_DEC(" kB\nMMUPageSize: ", vma_mmu_pagesize(vma)); + SEQ_PUT_DEC("Size: ", vma->vm_end - vma->vm_start); + SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma)); + SEQ_PUT_DEC(" kB\nMMUPageSize: ", vma_mmu_pagesize(vma)); seq_puts(m, " kB\n"); =20 __show_smap(m, &mss, false); =20 - seq_printf(m, "THPeligible: %d\n", + seq_printf(m, "THPeligible: %d\n", hugepage_vma_check(vma, vma->vm_flags, true, false, true)); =20 if (arch_pkeys_enabled()) - seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); + seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); show_smap_vma_flags(m, vma); =20 return 0; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4041C636D6 for ; Sat, 18 Feb 2023 00:31:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230333AbjBRAbk (ORCPT ); Fri, 17 Feb 2023 19:31:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229795AbjBRAao (ORCPT ); Fri, 17 Feb 2023 19:30:44 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DDBFE5BDB0 for ; Fri, 17 Feb 2023 16:29:43 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-53655de27a1so28453887b3.14 for ; Fri, 17 Feb 2023 16:29:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YyhF10hcbQMtiIcRNPe2j9wgCL3nu0mYgfomzauIh1M=; b=mdWiwuhXmCO7ZvCnfOzT5kGyeI1B76654OSvGqg0v4SfpBfq8gE8Y2HxCtDlqyAoCe RlH3iNC1O3POA4w07JzQSLH9WOuQMtWGYGTcuZDqfaF7LyXaloMakwXqjQejR67+NzLM ZSebxnW4BLvieTC+ErPKwo8+NE2d2ITkqYp8qePD98ii1puW00fdoHibbpwmm78EnXnc e9mKjUDG0xvqx47GeoE/+zEj2Ty0K8brFhlfqMVjJV2l/MZO0YGU5DUAH3uw/T0E0FYc e6nZsvIWbjAB08oOCLK03Uzk6o/Dc+xelLgiCeHsfs3lPkfiBKCFNyXIzhr8GfZEa1Cd rW1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YyhF10hcbQMtiIcRNPe2j9wgCL3nu0mYgfomzauIh1M=; b=ldTZJnUaB9vjveagyNoKM5vwgl2MAh23/erCGa2jZLxRTuWUtZIZAZJ/Yw+7oA2Wq5 vQIfUkBpg2CT3xPQsJomH8uJZIEmI4nAw9FRI8vKGHXzTNP0H11uiniSbmHI8O/wf6Hx ypMwPIdAQ8vB/YFN8ac9m357B71xxCoJxkc2rGC/uzvNiq/dMtyAKPWM0JhwLlLoecEF +rYz5fzdhZ+qi78ddrqW7DsRyw5F1re3ZpWDF6GfYMSjV+0eaDU1ihBECmY2fzilr60T JSQYMTTiQJa1BYxIy+h3NFcHH6pNrqSzdbX+h198sn3GZhNsbBMJd8ZF15fYxnkaNIHj 2diQ== X-Gm-Message-State: AO0yUKX4Yt//dgMwaoDJjkyVOFg0of0EPrlLBHMTjA5znVF8LSgevKft ATzq4Stpg8E/1bWZNxREnyiUJPp8tTSluhdh X-Google-Smtp-Source: AK7set887EplN25RNMt2wpSeZlZy1r4wkCtXdk+wH3AhIORU9/UsNcsfMi31QN73nnyYQPAtcJJOO2NybrEvv00c X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:3:b0:90d:af77:9ca6 with SMTP id l3-20020a056902000300b0090daf779ca6mr34196ybh.7.1676680163234; Fri, 17 Feb 2023 16:29:23 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:12 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-40-jthoughton@google.com> Subject: [PATCH v2 39/46] hugetlb: x86: enable high-granularity mapping for x86_64 From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that HGM is fully supported for GENERAL_HUGETLB, we can enable it for x86_64. We can only enable it for 64-bit architectures because the vm flag VM_HUGETLB_HGM uses a high bit. The x86 KVM MMU already properly handles HugeTLB HGM pages (it does a page table walk to determine which size to use in the second-stage page table instead of, for example, checking vma_mmu_pagesize, like arm64 does). Signed-off-by: James Houghton diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3604074a878b..fde9ba1dd8d7 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -126,6 +126,7 @@ config X86 select ARCH_WANT_GENERAL_HUGETLB select ARCH_WANT_HUGE_PMD_SHARE select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64 + select ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING if X86_64 select ARCH_WANT_LD_ORPHAN_WARN select ARCH_WANTS_THP_SWAP if X86_64 select ARCH_HAS_PARANOID_L1D_FLUSH --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23106C636D6 for ; Sat, 18 Feb 2023 00:31:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230334AbjBRAbv (ORCPT ); Fri, 17 Feb 2023 19:31:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43536 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230255AbjBRAbA (ORCPT ); Fri, 17 Feb 2023 19:31:00 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E23DB6B303 for ; Fri, 17 Feb 2023 16:29:48 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-5366c22f138so12022157b3.10 for ; Fri, 17 Feb 2023 16:29:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=meim6QzwmgVJzeI2+Wk1IxU1uXO2GoWG7kArQ5zE/oE=; b=jBDVTyUiFjSinjlesXO7Wsr29fC05xWRfrl7nAZ2Vi6vctWsz/7+NsXy6Dx9zE02oD Zi9rU6CuILwSQ9MmYCjyB8BnkobSNAXN9wk5uSKN5uHSYEobIzebiBr+QJLAoe2BO2h7 BvQ5nLWnyJ34pOp7tHtrWAltvM5a5A6wWV9ntVMJs7Np7CHhXIh92iq7fs97WdSGdx7k chX9KZzHAChL7Krh0NrLbnAxcDgyvNutktvSzeK5qogN04D9EfgapHMPtcUexRrTONeu Lv9LpAtAQc7gW1fQL5eLIT4qqKfM5pCyZOYdI8D3vWXHR6ii/aIgrxnOVziUFMcqw1D2 IOGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=meim6QzwmgVJzeI2+Wk1IxU1uXO2GoWG7kArQ5zE/oE=; b=r9kBBNbu5C7zF9kg9HlCbryiCKEMfgSmCMiS9p17srDDiOLXWw0JCWMaaBm7HkUcEk JiEqsGuNuPcyjI7wVqhebUTmNuHCTpzR3AGF6SoDMlrtBApOC+H+tzgm+4Y0Ym/PG8X1 3hl7nWSwToKJYOR7ip6ybr/o4nQReZP1LDPK9SlsD9d5mo4V0/d5AoI8RMKczKbnaOHB /ypS+HKWaQIHQM2NCklGtC0XA8/+HmemPYX9EGKp9a5Tq/dGIj932jlWXUSCmR8oYtyG Lx0IVyk6bVGldLk4t3e57vMocycQhE9P1DiH19J5BKynT9Ge5yLp7wAy+MLQx9b8DWv/ IEqw== X-Gm-Message-State: AO0yUKVLfGWNzaR72MBJRWN42cQ5keEcpIq6yFdv6Nj879acuYCCCJh5 5XlX+hv58eiAg+mwVZHuOJUZomf1QNpkhM7+ X-Google-Smtp-Source: AK7set8GEY/pRGyIuQeSw27GrNwKq4LLQxkMXSq0msXVI/vUEuxHEPRagJyvSuTMY/SNITKnNSR3i2I9F+YeDJM+ X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:144:0:b0:91c:90b6:f48a with SMTP id c4-20020a5b0144000000b0091c90b6f48amr1373069ybp.580.1676680164340; Fri, 17 Feb 2023 16:29:24 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:13 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-41-jthoughton@google.com> Subject: [PATCH v2 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Include information about how MADV_SPLIT should be used to enable high-granularity UFFDIO_CONTINUE operations, and include information about how MADV_COLLAPSE should be used to collapse the mappings at the end. Signed-off-by: James Houghton diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/a= dmin-guide/mm/hugetlbpage.rst index a969a2c742b2..c6eaef785609 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -454,6 +454,10 @@ errno set to EINVAL or exclude hugetlb pages that exte= nd beyond the length if not hugepage aligned. For example, munmap(2) will fail if memory is backe= d by a hugetlb page and the length is smaller than the hugepage size. =20 +It is possible for users to map HugeTLB pages at a higher granularity than +normal using HugeTLB high-granularity mapping (HGM). For example, when usi= ng 1G +pages on x86, a user could map that page with 4K PTEs, 2M PMDs, a combinat= ion of +the two. See Documentation/admin-guide/mm/userfaultfd.rst. =20 Examples =3D=3D=3D=3D=3D=3D=3D=3D diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/a= dmin-guide/mm/userfaultfd.rst index 83f31919ebb3..cc496a307ea2 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -169,7 +169,13 @@ like to do to resolve it: the page cache). Userspace has the option of modifying the page's contents before resolving the fault. Once the contents are correct (modified or not), userspace asks the kernel to map the page and let the - faulting thread continue with ``UFFDIO_CONTINUE``. + faulting thread continue with ``UFFDIO_CONTINUE``. If this is done at the + base-page size in a transparent-hugepage-eligible VMA or in a HugeTLB VMA + (requires ``MADV_SPLIT``), then userspace may want to use + ``MADV_COLLAPSE`` when a hugepage is fully populated to inform the kernel + that it may be able to collapse the mapping. ``MADV_COLLAPSE`` will undo + the effect of any ``UFFDIO_WRITEPROTECT`` calls on the collapsed address + range. =20 Notes: =20 --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 273A4C636D6 for ; Sat, 18 Feb 2023 00:32:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230339AbjBRAb7 (ORCPT ); Fri, 17 Feb 2023 19:31:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43108 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230294AbjBRAbW (ORCPT ); Fri, 17 Feb 2023 19:31:22 -0500 Received: from mail-vk1-xa49.google.com (mail-vk1-xa49.google.com [IPv6:2607:f8b0:4864:20::a49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6D4E36D25F for ; Fri, 17 Feb 2023 16:29:55 -0800 (PST) Received: by mail-vk1-xa49.google.com with SMTP id o73-20020a1f414c000000b0040163d749ecso646898vka.11 for ; Fri, 17 Feb 2023 16:29:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=rVcqYSKj99pGLWaPkurlgRpM4dq8t04RSFrIu2L3zgA=; b=q4c+hojX4+dTZygB2Tks0txtgbVxj7x8TKkFZ4tMc+sqQf6GeN73tsvcqCiW02hDp4 8XPu7pKfyuoWyWf0l6efa5Zl8QTeR0XBVVaGrv+Wd9WG+jAQtWkF9yUFkUA6luQj+6cl jVGauOVy3CkA+wV1fgTRQ99Wfw3xgg3rfIr1+mhVfMwPBkQ+w4aIyIAy3UP92Sz6TP2b rQ4LOt/rb25Xw9+JRDlRHS/Ym0me/mMiwA3HH1eYVYoYEY9NAXA/3czNh8h0WNs5P9nw XagFyrun6V7H5uqPSe2BtAfdYCj11o7oZZljG6RocZ2iOY2K7t3KE+FE2jG7TajzTzWr Fm/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rVcqYSKj99pGLWaPkurlgRpM4dq8t04RSFrIu2L3zgA=; b=pjf/EGM78TOgMpolN8znEAxniik47ZXYh5R5s/capKMnFXX/Q3WNp8/9OwTLD63mQP Q2LrDR27bc/XpMxZe6U5HZv3oMOFpOjUE8HNmB0QiLKMgrvGWYrtWV1mpwpJFH+vVN9m uwQJm74HVJhwFmrgUbQ+hSSbl3yhhSV2/o9UtOJQ3zCLx3tFKJlJmj5JJWcc3hy1YaXq b65RDP6sm43bE3gUoUALCtwNuUEhRRJ3v3+fOTCPYdPZKmS+xSy03O4H7jdrgn8Cn406 q2nuyToL8vegjh3IxWAV4K+l6s/JBrF2qT7oEVgEaA8nKrasx2VCjqqBiazdKH9lddKT gwLg== X-Gm-Message-State: AO0yUKX17HODU8+NeSdCcMejTk5b3+rDJVdm65o8wQvE5HesSQhJEaRr n9KyYfRvneJ4nZ1BNNTVQgj2i+WSWqTmQbnZ X-Google-Smtp-Source: AK7set9Tf2rc8YP45MnilA/p/YM3vrBRBTo5dK3QsJoCd53f/+o/OkfQqN5UhVm88d9/++IMJB7K0rC38dMHYTlb X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:ab0:100c:0:b0:68b:9eed:1c7d with SMTP id f12-20020ab0100c000000b0068b9eed1c7dmr77489uab.0.1676680165444; Fri, 17 Feb 2023 16:29:25 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:14 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-42-jthoughton@google.com> Subject: [PATCH v2 41/46] docs: proc: include information about HugeTLB HGM From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Include the updates that have been made to smaps, specifically, the addition of Hugetlb[Pud,Pmd,Pte]Mapped. Signed-off-by: James Houghton diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems= /proc.rst index e224b6d5b642..1d2a1cd1fe6a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -447,29 +447,32 @@ Memory Area, or VMA) there is a series of lines such = as the following:: =20 08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash =20 - Size: 1084 kB - KernelPageSize: 4 kB - MMUPageSize: 4 kB - Rss: 892 kB - Pss: 374 kB - Pss_Dirty: 0 kB - Shared_Clean: 892 kB - Shared_Dirty: 0 kB - Private_Clean: 0 kB - Private_Dirty: 0 kB - Referenced: 892 kB - Anonymous: 0 kB - LazyFree: 0 kB - AnonHugePages: 0 kB - ShmemPmdMapped: 0 kB - Shared_Hugetlb: 0 kB - Private_Hugetlb: 0 kB - Swap: 0 kB - SwapPss: 0 kB - KernelPageSize: 4 kB - MMUPageSize: 4 kB - Locked: 0 kB - THPeligible: 0 + Size: 1084 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Rss: 892 kB + Pss: 374 kB + Pss_Dirty: 0 kB + Shared_Clean: 892 kB + Shared_Dirty: 0 kB + Private_Clean: 0 kB + Private_Dirty: 0 kB + Referenced: 892 kB + Anonymous: 0 kB + LazyFree: 0 kB + AnonHugePages: 0 kB + ShmemPmdMapped: 0 kB + Shared_Hugetlb: 0 kB + Private_Hugetlb: 0 kB + HugetlbPudMapped: 0 kB + HugetlbPmdMapped: 0 kB + HugetlbPteMapped: 0 kB + Swap: 0 kB + SwapPss: 0 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Locked: 0 kB + THPeligible: 0 VmFlags: rd ex mr mw me dw =20 The first of these lines shows the same information as is displayed for the @@ -510,10 +513,15 @@ implementation. If this is not desirable please file = a bug report. "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by huge pages. =20 -"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed = by +"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historic= al reasons. And these are not included in {Shared,Private}_{Clean,Dirty} fiel= d. =20 +If the kernel was compiled with ``CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING`= `, +"HugetlbPudMapped", "HugetlbPmdMapped", and "HugetlbPteMapped" may appear = and +show the amount of HugeTLB memory mapped with PUDs, PMDs, and PTEs respect= ively. +Folded levels won't appear. See Documentation/admin-guide/mm/hugetlbpage.r= st. + "Swap" shows how much would-be-anonymous memory is also used, but out on s= wap. =20 For shmem mappings, "Swap" includes also the size of the mapped (and not --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 732B1C636D6 for ; Sat, 18 Feb 2023 00:32:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230212AbjBRAcC (ORCPT ); Fri, 17 Feb 2023 19:32:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43136 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230186AbjBRAb0 (ORCPT ); Fri, 17 Feb 2023 19:31:26 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 57F846ABC1 for ; Fri, 17 Feb 2023 16:29:57 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-5365a8dd33aso18706267b3.22 for ; Fri, 17 Feb 2023 16:29:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=+gBzjJ9Vs83x35smR0BnSI5ARmUioiACF/H5X92ziQg=; b=FTUzD2siWOamshg6dDNFdmcX2aYM0u3nvjpwU9uc75sLtLAKVBJAHVwAMjybOOM6aS VCEL8peUI2WwLbzU6ATlSC6+Wvz1/sc5w5zoKT+HEYVEE6xJEDLkP6KrFxiX5oVihBDo s3dn54TE92bh/yI06SLWHvNp5s6gfTErs9rDsWAFHd6g5rwlUEwUOvW2OPB06dtmQD1G M0MjN8FlEcXs2Za2RbC+His2Zwd7Jw8oRFe61lIYiwkcZmU3IZmQOKMZVZ3pE48rUE+w Jbg305rP5BtXo+lCwZaFP+1TrtMNjBmig60ExoWFI/t+qWpnbo+4TveqyihYwoX+lHie dTlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+gBzjJ9Vs83x35smR0BnSI5ARmUioiACF/H5X92ziQg=; b=d0kcJro0ct/CLHJ1ryQvyijgRD04ZHzWCEhlOA/vG+GlzhaYF9R8QXAZSeegBOxOgE jx0fF96tdGOCsC+4putq4nWgAOI1Yfp32zWxboQ/I4HdU3yBOwPD4PftVClD37zEV60t LnOeFF8a8Q0YNsOU7BKqB86hcEzCQdjWQQG/E3J50AJnNUJafvwurp2OM9jtOLA9MTaB pKrA1E+Rmb1lekboTtmLbA3O71QBk70M5j6SkTFrVKFRpV2hxcUwIOfrb0ROUeYCgDa4 KqO0iHoNCP1PXl/2yJaAUkUM0LSMRSx4lqE1kBgneFsIQpj974sr23qOV1XO8MCWFplS ZXQg== X-Gm-Message-State: AO0yUKW9wgp4zI1V83Qie45I5qL3BeepMIfBtygYPj8zX2Gh9mPJKzPr kIBmcZurBZVkxPfivog247kCaoRyO+pMmPuh X-Google-Smtp-Source: AK7set8hCmHkXvdJYET09bfYSTtiB1w0qW6BeHXJttgQrcwICFN7nXKvhHomAteDlPagr6evNf+A5cHI9ctrKXl9 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a81:8706:0:b0:50b:429e:a9ef with SMTP id x6-20020a818706000000b0050b429ea9efmr1329552ywf.434.1676680166676; Fri, 17 Feb 2023 16:29:26 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:15 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-43-jthoughton@google.com> Subject: [PATCH v2 42/46] selftests/mm: add HugeTLB HGM to userfaultfd selftest From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This test case behaves similarly to the regular shared HugeTLB configuration, except that it uses 4K instead of hugepages, and that we ignore the UFFDIO_COPY tests, as UFFDIO_CONTINUE is the only ioctl that supports PAGE_SIZE-aligned regions. This doesn't test MADV_COLLAPSE. Other tests are added later to exercise MADV_COLLAPSE. Signed-off-by: James Houghton diff --git a/tools/testing/selftests/mm/userfaultfd.c b/tools/testing/selft= ests/mm/userfaultfd.c index 7f22844ed704..681c5c5f863b 100644 --- a/tools/testing/selftests/mm/userfaultfd.c +++ b/tools/testing/selftests/mm/userfaultfd.c @@ -73,9 +73,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu= , page_size, hpage_size; #define BOUNCE_POLL (1<<3) static int bounces; =20 -#define TEST_ANON 1 -#define TEST_HUGETLB 2 -#define TEST_SHMEM 3 +#define TEST_ANON 1 +#define TEST_HUGETLB 2 +#define TEST_HUGETLB_HGM 3 +#define TEST_SHMEM 4 static int test_type; =20 #define UFFD_FLAGS (O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY) @@ -93,6 +94,8 @@ static volatile bool test_uffdio_zeropage_eexist =3D true; static bool test_uffdio_wp =3D true; /* Whether to test uffd minor faults */ static bool test_uffdio_minor =3D false; +static bool test_uffdio_copy =3D true; + static bool map_shared; static int mem_fd; static unsigned long long *count_verify; @@ -151,7 +154,7 @@ static void usage(void) fprintf(stderr, "\nUsage: ./userfaultfd " "[hugetlbfs_file]\n\n"); fprintf(stderr, "Supported : anon, hugetlb, " - "hugetlb_shared, shmem\n\n"); + "hugetlb_shared, hugetlb_shared_hgm, shmem\n\n"); fprintf(stderr, "'Test mods' can be joined to the test type string with a= ':'. " "Supported mods:\n"); fprintf(stderr, "\tsyscall - Use userfaultfd(2) (default)\n"); @@ -167,6 +170,11 @@ static void usage(void) exit(1); } =20 +static bool test_is_hugetlb(void) +{ + return test_type =3D=3D TEST_HUGETLB || test_type =3D=3D TEST_HUGETLB_HGM; +} + #define _err(fmt, ...) \ do { \ int ret =3D errno; \ @@ -381,7 +389,7 @@ static struct uffd_test_ops *uffd_test_ops; =20 static inline uint64_t uffd_minor_feature(void) { - if (test_type =3D=3D TEST_HUGETLB && map_shared) + if (test_is_hugetlb() && map_shared) return UFFD_FEATURE_MINOR_HUGETLBFS; else if (test_type =3D=3D TEST_SHMEM) return UFFD_FEATURE_MINOR_SHMEM; @@ -393,7 +401,7 @@ static uint64_t get_expected_ioctls(uint64_t mode) { uint64_t ioctls =3D UFFD_API_RANGE_IOCTLS; =20 - if (test_type =3D=3D TEST_HUGETLB) + if (test_is_hugetlb()) ioctls &=3D ~(1 << _UFFDIO_ZEROPAGE); =20 if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp)) @@ -500,13 +508,16 @@ static void uffd_test_ctx_clear(void) static void uffd_test_ctx_init(uint64_t features) { unsigned long nr, cpu; + uint64_t enabled_features =3D features; =20 uffd_test_ctx_clear(); =20 uffd_test_ops->allocate_area((void **)&area_src, true); uffd_test_ops->allocate_area((void **)&area_dst, false); =20 - userfaultfd_open(&features); + userfaultfd_open(&enabled_features); + if ((enabled_features & features) !=3D features) + err("couldn't enable all features"); =20 count_verify =3D malloc(nr_pages * sizeof(unsigned long long)); if (!count_verify) @@ -726,13 +737,16 @@ static void uffd_handle_page_fault(struct uffd_msg *m= sg, struct uffd_stats *stats) { unsigned long offset; + unsigned long address; =20 if (msg->event !=3D UFFD_EVENT_PAGEFAULT) err("unexpected msg event %u", msg->event); =20 + address =3D msg->arg.pagefault.address; + if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) { /* Write protect page faults */ - wp_range(uffd, msg->arg.pagefault.address, page_size, false); + wp_range(uffd, address, page_size, false); stats->wp_faults++; } else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) { uint8_t *area; @@ -751,11 +765,10 @@ static void uffd_handle_page_fault(struct uffd_msg *m= sg, */ =20 area =3D (uint8_t *)(area_dst + - ((char *)msg->arg.pagefault.address - - area_dst_alias)); + ((char *)address - area_dst_alias)); for (b =3D 0; b < page_size; ++b) area[b] =3D ~area[b]; - continue_range(uffd, msg->arg.pagefault.address, page_size); + continue_range(uffd, address, page_size); stats->minor_faults++; } else { /* @@ -782,7 +795,7 @@ static void uffd_handle_page_fault(struct uffd_msg *msg, if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) err("unexpected write fault"); =20 - offset =3D (char *)(unsigned long)msg->arg.pagefault.address - area_dst; + offset =3D (char *)address - area_dst; offset &=3D ~(page_size-1); =20 if (copy_page(uffd, offset)) @@ -1192,6 +1205,12 @@ static int userfaultfd_events_test(void) char c; struct uffd_stats stats =3D { 0 }; =20 + if (!test_uffdio_copy) { + printf("Skipping userfaultfd events test " + "(test_uffdio_copy=3Dfalse)\n"); + return 0; + } + printf("testing events (fork, remap, remove): "); fflush(stdout); =20 @@ -1245,6 +1264,12 @@ static int userfaultfd_sig_test(void) char c; struct uffd_stats stats =3D { 0 }; =20 + if (!test_uffdio_copy) { + printf("Skipping userfaultfd signal test " + "(test_uffdio_copy=3Dfalse)\n"); + return 0; + } + printf("testing signal delivery: "); fflush(stdout); =20 @@ -1329,6 +1354,11 @@ static int userfaultfd_minor_test(void) =20 uffd_test_ctx_init(uffd_minor_feature()); =20 + if (test_type =3D=3D TEST_HUGETLB_HGM) + /* Enable high-granularity userfaultfd ioctls for HugeTLB */ + if (madvise(area_dst_alias, nr_pages * page_size, MADV_SPLIT)) + err("MADV_SPLIT failed"); + uffdio_register.range.start =3D (unsigned long)area_dst_alias; uffdio_register.range.len =3D nr_pages * page_size; uffdio_register.mode =3D UFFDIO_REGISTER_MODE_MINOR; @@ -1538,6 +1568,12 @@ static int userfaultfd_stress(void) pthread_attr_init(&attr); pthread_attr_setstacksize(&attr, 16*1024*1024); =20 + if (!test_uffdio_copy) { + printf("Skipping userfaultfd stress test " + "(test_uffdio_copy=3Dfalse)\n"); + bounces =3D 0; + } + while (bounces--) { printf("bounces: %d, mode:", bounces); if (bounces & BOUNCE_RANDOM) @@ -1696,6 +1732,16 @@ static void set_test_type(const char *type) uffd_test_ops =3D &hugetlb_uffd_test_ops; /* Minor faults require shared hugetlb; only enable here. */ test_uffdio_minor =3D true; + } else if (!strcmp(type, "hugetlb_shared_hgm")) { + map_shared =3D true; + test_type =3D TEST_HUGETLB_HGM; + uffd_test_ops =3D &hugetlb_uffd_test_ops; + /* + * HugeTLB HGM only changes UFFDIO_CONTINUE, so don't test + * UFFDIO_COPY. + */ + test_uffdio_minor =3D true; + test_uffdio_copy =3D false; } else if (!strcmp(type, "shmem")) { map_shared =3D true; test_type =3D TEST_SHMEM; @@ -1731,6 +1777,7 @@ static void parse_test_type_arg(const char *raw_type) err("Unsupported test: %s", raw_type); =20 if (test_type =3D=3D TEST_HUGETLB) + /* TEST_HUGETLB_HGM gets small pages. */ page_size =3D hpage_size; else page_size =3D sysconf(_SC_PAGE_SIZE); @@ -1813,22 +1860,29 @@ int main(int argc, char **argv) nr_cpus =3D x < y ? x : y; } nr_pages_per_cpu =3D bytes / page_size / nr_cpus; + if (test_type =3D=3D TEST_HUGETLB_HGM) + /* + * `page_size` refers to the page_size we can use in + * UFFDIO_CONTINUE. We still need nr_pages to be appropriately + * aligned, so align it here. + */ + nr_pages_per_cpu -=3D nr_pages_per_cpu % (hpage_size / page_size); if (!nr_pages_per_cpu) { _err("invalid MiB"); usage(); } + nr_pages =3D nr_pages_per_cpu * nr_cpus; =20 bounces =3D atoi(argv[3]); if (bounces <=3D 0) { _err("invalid bounces"); usage(); } - nr_pages =3D nr_pages_per_cpu * nr_cpus; =20 - if (test_type =3D=3D TEST_SHMEM || test_type =3D=3D TEST_HUGETLB) { + if (test_type =3D=3D TEST_SHMEM || test_is_hugetlb()) { unsigned int memfd_flags =3D 0; =20 - if (test_type =3D=3D TEST_HUGETLB) + if (test_is_hugetlb()) memfd_flags =3D MFD_HUGETLB; mem_fd =3D memfd_create(argv[0], memfd_flags); if (mem_fd < 0) --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66B20C05027 for ; Sat, 18 Feb 2023 00:32:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230351AbjBRAcV (ORCPT ); Fri, 17 Feb 2023 19:32:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43138 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230082AbjBRAb6 (ORCPT ); Fri, 17 Feb 2023 19:31:58 -0500 Received: from mail-ua1-x949.google.com (mail-ua1-x949.google.com [IPv6:2607:f8b0:4864:20::949]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 375EA6EBB7 for ; Fri, 17 Feb 2023 16:30:25 -0800 (PST) Received: by mail-ua1-x949.google.com with SMTP id f13-20020ab060ad000000b0068e6c831945so354397uam.12 for ; Fri, 17 Feb 2023 16:30:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8+Dfh3tVbu9NNOAuUQbMjQlcnIuGhIGz3LkhsyysEBM=; b=oC2H6v+tP8sxgSW/qaTkqpk/qQGFu8OqZp2lefzh5tpvjsOdoYS95fEme6g5r45Srg kK+SX7TkKqLd7NIyWtb5jibyXnm1w9QJR9oIYR7KWDFmDwxHgvIJ02BbB1UgyoTqeQL8 89D9RKCX1G+/4ORDV+WYjrLRuKzAfDY/6QJdu8zlAz9TQtaROdykMDea9qi+W5pMHyEx npoMTQfezy/k56FzTnBrsh96SmGvpoH+UHWk3RlvjqE9gAwVIEdHIsW/fo6+TLVqF5XS Xw+BwwbTSNp8vpWaaHiMcSOExGAoahht2ywAUDVHMW8zc5X++I6XujZ6B7mVjrEIJDFO c0nA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8+Dfh3tVbu9NNOAuUQbMjQlcnIuGhIGz3LkhsyysEBM=; b=W4h0YZbq75V0QN5By17cfqrW36X76UtfA4wbewCMkqucHQrWHYG8HTKtPkNDEutdry FAebfldqH6FoXa8JdYiVVptCHqWtpf9u90eJXYhtckya/Kfc5azBbZ8Hd/bUO8MKd3BW ksT0N/gegloxr9Bx61iv+JmE7abi/N+YMkPWTMWzO0MiPU3St32jdv1LwVuJomnDC4HK JFEyf0xCBSBAEpcmv1IJgsqsPIODxR3V7+4TnMBwHvjG2uXR7BPcB1oV909RWB9oeVju 3WxmjiydZ69HIc+vSr3o6iHsdMjzsVIhItXTDsv1SaROmWrjaS0g0M549W/8cwMIaozW 0hgg== X-Gm-Message-State: AO0yUKW7lPU5qmyIC4zdUW9M3HzkelpQVotwhtH2Gji1UZIqrkJLS5o1 +I/+B2SEyCAs9XebYf+0pwKuMlW5+OdRMpJJ X-Google-Smtp-Source: AK7set9ZL3hrEgPVLxygWFVRWkaZt5AY1vSRq45PWq0dcOktmSchjvn/Wzp1mppyMhkKtwEqnwlJ/llO3mZ7TTDg X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a1f:a013:0:b0:401:9bc6:c40c with SMTP id j19-20020a1fa013000000b004019bc6c40cmr552024vke.20.1676680183989; Fri, 17 Feb 2023 16:29:43 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:16 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-44-jthoughton@google.com> Subject: [PATCH v2 43/46] KVM: selftests: add HugeTLB HGM to KVM demand paging selftest From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This test exercises the GUP paths for HGM. MADV_COLLAPSE is not tested. Signed-off-by: James Houghton diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testi= ng/selftests/kvm/demand_paging_test.c index b0e1fc4de9e2..e534f9c927bf 100644 --- a/tools/testing/selftests/kvm/demand_paging_test.c +++ b/tools/testing/selftests/kvm/demand_paging_test.c @@ -170,7 +170,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) uffd_descs[i] =3D uffd_setup_demand_paging( p->uffd_mode, p->uffd_delay, vcpu_hva, vcpu_args->pages * memstress_args.guest_page_size, - &handle_uffd_page_request); + p->src_type, &handle_uffd_page_request); } } =20 diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testin= g/selftests/kvm/include/test_util.h index 80d6416f3012..a2106c19a614 100644 --- a/tools/testing/selftests/kvm/include/test_util.h +++ b/tools/testing/selftests/kvm/include/test_util.h @@ -103,6 +103,7 @@ enum vm_mem_backing_src_type { VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB, VM_MEM_SRC_SHMEM, VM_MEM_SRC_SHARED_HUGETLB, + VM_MEM_SRC_SHARED_HUGETLB_HGM, NUM_SRC_TYPES, }; =20 @@ -121,6 +122,7 @@ size_t get_def_hugetlb_pagesz(void); const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i= ); size_t get_backing_src_pagesz(uint32_t i); bool is_backing_src_hugetlb(uint32_t i); +bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type); void backing_src_help(const char *flag); enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name); long get_run_delay(void); diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools= /testing/selftests/kvm/include/userfaultfd_util.h index 877449c34592..d91528a58245 100644 --- a/tools/testing/selftests/kvm/include/userfaultfd_util.h +++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h @@ -26,9 +26,9 @@ struct uffd_desc { pthread_t thread; }; =20 -struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay, - void *hva, uint64_t len, - uffd_handler_t handler); +struct uffd_desc *uffd_setup_demand_paging( + int uffd_mode, useconds_t delay, void *hva, uint64_t len, + enum vm_mem_backing_src_type src_type, uffd_handler_t handler); =20 void uffd_stop_demand_paging(struct uffd_desc *uffd); =20 diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/sel= ftests/kvm/lib/kvm_util.c index 56d5ea949cbb..b9c398dc295d 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -981,7 +981,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm, region->fd =3D -1; if (backing_src_is_shared(src_type)) region->fd =3D kvm_memfd_alloc(region->mmap_size, - src_type =3D=3D VM_MEM_SRC_SHARED_HUGETLB); + is_backing_src_shared_hugetlb(src_type)); =20 region->mmap_start =3D mmap(NULL, region->mmap_size, PROT_READ | PROT_WRITE, diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/se= lftests/kvm/lib/test_util.c index 5c22fa4c2825..712a0878932e 100644 --- a/tools/testing/selftests/kvm/lib/test_util.c +++ b/tools/testing/selftests/kvm/lib/test_util.c @@ -271,6 +271,13 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_= src_alias(uint32_t i) */ .flag =3D MAP_SHARED, }, + [VM_MEM_SRC_SHARED_HUGETLB_HGM] =3D { + /* + * Identical to shared_hugetlb except for the name. + */ + .name =3D "shared_hugetlb_hgm", + .flag =3D MAP_SHARED, + }, }; _Static_assert(ARRAY_SIZE(aliases) =3D=3D NUM_SRC_TYPES, "Missing new backing src types?"); @@ -289,6 +296,7 @@ size_t get_backing_src_pagesz(uint32_t i) switch (i) { case VM_MEM_SRC_ANONYMOUS: case VM_MEM_SRC_SHMEM: + case VM_MEM_SRC_SHARED_HUGETLB_HGM: return getpagesize(); case VM_MEM_SRC_ANONYMOUS_THP: return get_trans_hugepagesz(); @@ -305,6 +313,12 @@ bool is_backing_src_hugetlb(uint32_t i) return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB); } =20 +bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type) +{ + return src_type =3D=3D VM_MEM_SRC_SHARED_HUGETLB || + src_type =3D=3D VM_MEM_SRC_SHARED_HUGETLB_HGM; +} + static void print_available_backing_src_types(const char *prefix) { int i; diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/tes= ting/selftests/kvm/lib/userfaultfd_util.c index 92cef20902f1..3c7178d6c4f4 100644 --- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c +++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c @@ -25,6 +25,10 @@ =20 #ifdef __NR_userfaultfd =20 +#ifndef MADV_SPLIT +#define MADV_SPLIT 26 +#endif + static void *uffd_handler_thread_fn(void *arg) { struct uffd_desc *uffd_desc =3D (struct uffd_desc *)arg; @@ -108,9 +112,9 @@ static void *uffd_handler_thread_fn(void *arg) return NULL; } =20 -struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay, - void *hva, uint64_t len, - uffd_handler_t handler) +struct uffd_desc *uffd_setup_demand_paging( + int uffd_mode, useconds_t delay, void *hva, uint64_t len, + enum vm_mem_backing_src_type src_type, uffd_handler_t handler) { struct uffd_desc *uffd_desc; bool is_minor =3D (uffd_mode =3D=3D UFFDIO_REGISTER_MODE_MINOR); @@ -140,6 +144,10 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mo= de, useconds_t delay, "ioctl UFFDIO_API failed: %" PRIu64, (uint64_t)uffdio_api.api); =20 + if (src_type =3D=3D VM_MEM_SRC_SHARED_HUGETLB_HGM) + TEST_ASSERT(!madvise(hva, len, MADV_SPLIT), + "Could not enable HGM"); + uffdio_register.range.start =3D (uint64_t)hva; uffdio_register.range.len =3D len; uffdio_register.mode =3D uffd_mode; --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D13EC636D6 for ; Sat, 18 Feb 2023 00:32:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230356AbjBRAcY (ORCPT ); Fri, 17 Feb 2023 19:32:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230306AbjBRAb7 (ORCPT ); Fri, 17 Feb 2023 19:31:59 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 038CB6EBA3 for ; Fri, 17 Feb 2023 16:30:26 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-53659b9818dso20142317b3.18 for ; Fri, 17 Feb 2023 16:30:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=hyOLv1FsgINJ0DL+GVviZenHuSxZX0xiNJERAdFTwJY=; b=CawVGssLxbR/eBsnpR76uSOwVJlkmsmCv0a5oroE0FUoq5CLC5qHPXGXefy+nfOAyq vHPyCbLQuykbScgNRy8G6ArFuhS2PA24tAoAR0gKyBhUDTykhHwShI6arftzkplVbyyk b4WDZ3892nTCDc681eGtskse+KIsH+dqtfTG2Vne6aDgRhNRdxfjHtbN/sR4SQs6gHc0 qUEwaDcngWDaQt95+PjWHSmLfyKw65IbdY5tMd2l4nt3+4TnGgT5wI1VLYvFO5GyJcwD HDiIQ3YFBSVkmnMVLHQ4cyRGkHdEevZ48y3mNmJQPrsPUHrSdwcp+QpVtufpdBI9D2OH id0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=hyOLv1FsgINJ0DL+GVviZenHuSxZX0xiNJERAdFTwJY=; b=xw41vqJJWN4lrBYOs2cIIcNvUKe2MA7KDvGo6Mm1y1ACe+TrkjuiBC7v4CX6nQdmVC UrIj+dKaHseR6QC141WSiVoaYFnRTJgmVX/Zs6pQTkglq6PEDuf5Q1emoxMIeC6k7/q6 EZaQqSi7UqqR5p+bNU5xYqN9IM06UIv90HpOqNcV+HKkD5wQOZ18HLmDgJs6Jr1hcjt/ BGKOpAx4GReqAmvXjh5oRcqfFJm0OreUo8P3T9u3RyBvmfg8lQimSBvjiLprY39PM0jh +kELcCExAybInH2/CEtW4mWOoSMXEkyEdwLnftFzCw7pP2yabL5UmhyaFScYz8BIHmAQ 3RBg== X-Gm-Message-State: AO0yUKXaak9bUALie6YOEEGSZt4pWJFBbP5WN1RHpJmGhYAkHx9kWX/N DkqagcKkL0rFSTIGBPgnh1AT8+3aPym9J6EB X-Google-Smtp-Source: AK7set+1gp+3xzyRWa+pkBelE3wO17FQrBJLukvYGoDbOlAPndL6HG5Z6zOqFqMyhLaMuLWmUDnySY104VSUsagh X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:4f0:b0:98e:6280:74ca with SMTP id w16-20020a05690204f000b0098e628074camr174263ybs.1.1676680184745; Fri, 17 Feb 2023 16:29:44 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:17 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-45-jthoughton@google.com> Subject: [PATCH v2 44/46] selftests/mm: add anon and shared hugetlb to migration test From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Shared HugeTLB mappings are migrated best-effort. Sometimes, due to being unable to grab the VMA lock for writing, migration may just randomly fail. To allow for that, we allow retries. Signed-off-by: James Houghton diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftes= ts/mm/migration.c index 1cec8425e3ca..21577a84d7e4 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -13,6 +13,7 @@ #include #include #include +#include =20 #define TWOMEG (2<<20) #define RUNTIME (60) @@ -59,11 +60,12 @@ FIXTURE_TEARDOWN(migration) free(self->pids); } =20 -int migrate(uint64_t *ptr, int n1, int n2) +int migrate(uint64_t *ptr, int n1, int n2, int retries) { int ret, tmp; int status =3D 0; struct timespec ts1, ts2; + int failed =3D 0; =20 if (clock_gettime(CLOCK_MONOTONIC, &ts1)) return -1; @@ -78,6 +80,9 @@ int migrate(uint64_t *ptr, int n1, int n2) ret =3D move_pages(0, 1, (void **) &ptr, &n2, &status, MPOL_MF_MOVE_ALL); if (ret) { + if (++failed < retries) + continue; + if (ret > 0) printf("Didn't migrate %d pages\n", ret); else @@ -88,6 +93,7 @@ int migrate(uint64_t *ptr, int n1, int n2) tmp =3D n2; n2 =3D n1; n1 =3D tmp; + failed =3D 0; } =20 return 0; @@ -128,7 +134,7 @@ TEST_F_TIMEOUT(migration, private_anon, 2*RUNTIME) if (pthread_create(&self->threads[i], NULL, access_mem, ptr)) perror("Couldn't create thread"); =20 - ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0); + ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0); for (i =3D 0; i < self->nthreads - 1; i++) ASSERT_EQ(pthread_cancel(self->threads[i]), 0); } @@ -158,7 +164,7 @@ TEST_F_TIMEOUT(migration, shared_anon, 2*RUNTIME) self->pids[i] =3D pid; } =20 - ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0); + ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0); for (i =3D 0; i < self->nthreads - 1; i++) ASSERT_EQ(kill(self->pids[i], SIGTERM), 0); } @@ -185,9 +191,78 @@ TEST_F_TIMEOUT(migration, private_anon_thp, 2*RUNTIME) if (pthread_create(&self->threads[i], NULL, access_mem, ptr)) perror("Couldn't create thread"); =20 - ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0); + ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0); + for (i =3D 0; i < self->nthreads - 1; i++) + ASSERT_EQ(pthread_cancel(self->threads[i]), 0); +} + +/* + * Tests the anon hugetlb migration entry paths. + */ +TEST_F_TIMEOUT(migration, private_anon_hugetlb, 2*RUNTIME) +{ + uint64_t *ptr; + int i; + + if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0) + SKIP(return, "Not enough threads or NUMA nodes available"); + + ptr =3D mmap(NULL, TWOMEG, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); + if (ptr =3D=3D MAP_FAILED) + SKIP(return, "Could not allocate hugetlb pages"); + + memset(ptr, 0xde, TWOMEG); + for (i =3D 0; i < self->nthreads - 1; i++) + if (pthread_create(&self->threads[i], NULL, access_mem, ptr)) + perror("Couldn't create thread"); + + ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0); for (i =3D 0; i < self->nthreads - 1; i++) ASSERT_EQ(pthread_cancel(self->threads[i]), 0); } =20 +/* + * Tests the shared hugetlb migration entry paths. + */ +TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME) +{ + uint64_t *ptr; + int i; + int fd; + unsigned long sz; + struct statfs filestat; + + if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0) + SKIP(return, "Not enough threads or NUMA nodes available"); + + fd =3D memfd_create("tmp_hugetlb", MFD_HUGETLB); + if (fd < 0) + SKIP(return, "Couldn't create hugetlb memfd"); + + if (fstatfs(fd, &filestat) < 0) + SKIP(return, "Couldn't fstatfs hugetlb file"); + + sz =3D filestat.f_bsize; + + if (ftruncate(fd, sz)) + SKIP(return, "Couldn't allocate hugetlb pages"); + ptr =3D mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (ptr =3D=3D MAP_FAILED) + SKIP(return, "Could not map hugetlb pages"); + + memset(ptr, 0xde, sz); + for (i =3D 0; i < self->nthreads - 1; i++) + if (pthread_create(&self->threads[i], NULL, access_mem, ptr)) + perror("Couldn't create thread"); + + ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0); + for (i =3D 0; i < self->nthreads - 1; i++) { + ASSERT_EQ(pthread_cancel(self->threads[i]), 0); + pthread_join(self->threads[i], NULL); + } + ftruncate(fd, 0); + close(fd); +} + TEST_HARNESS_MAIN --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8809BC636D6 for ; Sat, 18 Feb 2023 00:32:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230318AbjBRAc1 (ORCPT ); Fri, 17 Feb 2023 19:32:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43148 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230340AbjBRAcA (ORCPT ); Fri, 17 Feb 2023 19:32:00 -0500 Received: from mail-ua1-x94a.google.com (mail-ua1-x94a.google.com [IPv6:2607:f8b0:4864:20::94a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6268A6E677 for ; Fri, 17 Feb 2023 16:30:28 -0800 (PST) Received: by mail-ua1-x94a.google.com with SMTP id x2-20020ab03802000000b0060d5bfd73b5so940115uav.16 for ; Fri, 17 Feb 2023 16:30:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=IfKq0bBaVq4CjPIDar2fHBWF2wKUUWJpsDr5c4wwSS8=; b=FNS8OXzQNbEkpLDpjB3nAEwYdiqjzDyLdWfy2qMCsFYgDJSKBhPFK3yDxyIWpeySgl maQUSikgn8RHfHnveHKP4KewomfuBHkrzkOdNxKAEwOz4ERjMdSgdS+IkmHhDevj0Dlo 8SYMrUU5w0fXhm10675DXBEz1qa82TExwLIbs6xG0LABcq+O89FT45Z9grk1Qq9uWQnL VEl/w45AklX6i6AYyOmin2NKbHJDvd0cIf5fk1zu+qjIn8o5y4g9O4lMFPFSQVgZ0QZ9 4eExXJ6WU1Q2oLrhiuNiGB+aLlTM+fJuXr+noURddq0gbHVbzoVDxDtmh6wk9zHk9oKN 9jig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IfKq0bBaVq4CjPIDar2fHBWF2wKUUWJpsDr5c4wwSS8=; b=nCHjl62P6gsOcOxCvrHUS+VwgswZKQgJ6ToSISvaiipVQfhzbRl9htdiUC429QSy+5 akicXhbT5kpmJOYn+Yy9Eq99aso618WeMF63JggG5+PuU8lHXc1SHm6JX86bbPhGGH1A 0EVuK8LvE514eKFjeG4y2rCG6q4pK4xyq1884JZuq0M4okr1S7OLkM6+8E3l0K32bB9o IApRjc43eXVdRdmhT13p1YVOhb+iK+ZS9aG6TFlGoibGUi/BuLjvZx4gN1vgwm4PSFkQ 0Ouuf/DZ2n/1duFbGn7MpbaYz3cniV9Qr7gx/3tRmkyXrUViWTKJNKanbfQ/Vj/pRvdz 6fYw== X-Gm-Message-State: AO0yUKUJb+Lwy3SWhRwZlGWhdZF9VAn0z70ML3OmjRYawFuoOGTlDtfw APBSdSXDeZbIFEQe8T5qbsz17zsT0KALwpno X-Google-Smtp-Source: AK7set+/8h4Ti6SaOCrSw2JWCwa+A4XjEDrEdPurb5cORBytv6OxLdCN0/KOEI5ep27FSzg7p5WKdfqWfRYxseKZ X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:ab0:53d2:0:b0:68b:923a:d6f4 with SMTP id l18-20020ab053d2000000b0068b923ad6f4mr47364uaa.2.1676680186208; Fri, 17 Feb 2023 16:29:46 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:18 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-46-jthoughton@google.com> Subject: [PATCH v2 45/46] selftests/mm: add hugetlb HGM test to migration selftest From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This is mostly the same as the shared HugeTLB case, but instead of mapping the page with a regular page fault, we map it with lots of UFFDIO_CONTINUE operations. We also verify that the contents haven't changed after the migration, which would be the case if the post-migration PTEs pointed to the wrong page. Signed-off-by: James Houghton diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftes= ts/mm/migration.c index 21577a84d7e4..1fb3607accab 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -14,12 +14,21 @@ #include #include #include +#include +#include +#include +#include +#include =20 #define TWOMEG (2<<20) #define RUNTIME (60) =20 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) =20 +#ifndef MADV_SPLIT +#define MADV_SPLIT 26 +#endif + FIXTURE(migration) { pthread_t *threads; @@ -265,4 +274,141 @@ TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME) close(fd); } =20 +#ifdef __NR_userfaultfd +static int map_at_high_granularity(char *mem, size_t length) +{ + int i; + int ret; + int uffd =3D syscall(__NR_userfaultfd, 0); + struct uffdio_api api; + struct uffdio_register reg; + int pagesize =3D getpagesize(); + + if (uffd < 0) { + perror("couldn't create uffd"); + return uffd; + } + + api.api =3D UFFD_API; + api.features =3D 0; + + ret =3D ioctl(uffd, UFFDIO_API, &api); + if (ret || api.api !=3D UFFD_API) { + perror("UFFDIO_API failed"); + goto out; + } + + if (madvise(mem, length, MADV_SPLIT) =3D=3D -1) { + perror("MADV_SPLIT failed"); + goto out; + } + + reg.range.start =3D (unsigned long)mem; + reg.range.len =3D length; + + reg.mode =3D UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_MINOR; + + ret =3D ioctl(uffd, UFFDIO_REGISTER, ®); + if (ret) { + perror("UFFDIO_REGISTER failed"); + goto out; + } + + /* UFFDIO_CONTINUE each 4K segment of the 2M page. */ + for (i =3D 0; i < length/pagesize; ++i) { + struct uffdio_continue cont; + + cont.range.start =3D (unsigned long long)mem + i * pagesize; + cont.range.len =3D pagesize; + cont.mode =3D 0; + ret =3D ioctl(uffd, UFFDIO_CONTINUE, &cont); + if (ret) { + fprintf(stderr, "UFFDIO_CONTINUE failed " + "for %llx -> %llx: %d\n", + cont.range.start, + cont.range.start + cont.range.len, + errno); + goto out; + } + } + ret =3D 0; +out: + close(uffd); + return ret; +} +#else +static int map_at_high_granularity(char *mem, size_t length) +{ + fprintf(stderr, "Userfaultfd missing\n"); + return -1; +} +#endif /* __NR_userfaultfd */ + +/* + * Tests the high-granularity hugetlb migration entry paths. + */ +TEST_F_TIMEOUT(migration, shared_hugetlb_hgm, 2*RUNTIME) +{ + uint64_t *ptr; + int i; + int fd; + unsigned long sz; + struct statfs filestat; + + if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0) + SKIP(return, "Not enough threads or NUMA nodes available"); + + fd =3D memfd_create("tmp_hugetlb", MFD_HUGETLB); + if (fd < 0) + SKIP(return, "Couldn't create hugetlb memfd"); + + if (fstatfs(fd, &filestat) < 0) + SKIP(return, "Couldn't fstatfs hugetlb file"); + + sz =3D filestat.f_bsize; + + if (ftruncate(fd, sz)) + SKIP(return, "Couldn't allocate hugetlb pages"); + + if (fallocate(fd, 0, 0, sz) < 0) { + perror("fallocate failed"); + SKIP(return, "fallocate failed"); + } + + ptr =3D mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (ptr =3D=3D MAP_FAILED) + SKIP(return, "Could not allocate hugetlb pages"); + + /* + * We have to map_at_high_granularity before we memset, otherwise + * memset will map everything at the hugepage size. + */ + if (map_at_high_granularity((char *)ptr, sz) < 0) + SKIP(return, "Could not map HugeTLB range at high granularity"); + + /* Populate the page we're migrating. */ + for (i =3D 0; i < sz/sizeof(*ptr); ++i) + ptr[i] =3D i; + + for (i =3D 0; i < self->nthreads - 1; i++) + if (pthread_create(&self->threads[i], NULL, access_mem, ptr)) + perror("Couldn't create thread"); + + ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0); + for (i =3D 0; i < self->nthreads - 1; i++) { + ASSERT_EQ(pthread_cancel(self->threads[i]), 0); + pthread_join(self->threads[i], NULL); + } + + /* Check that the contents didnt' change. */ + for (i =3D 0; i < sz/sizeof(*ptr); ++i) { + ASSERT_EQ(ptr[i], i); + if (ptr[i] !=3D i) + break; + } + + ftruncate(fd, 0); + close(fd); +} + TEST_HARNESS_MAIN --=20 2.39.2.637.g21b0678d19-goog From nobody Thu Sep 11 14:00:51 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CED6DC6379F for ; Sat, 18 Feb 2023 00:32:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230055AbjBRAc3 (ORCPT ); Fri, 17 Feb 2023 19:32:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230095AbjBRAcB (ORCPT ); Fri, 17 Feb 2023 19:32:01 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 99E706EBB9 for ; Fri, 17 Feb 2023 16:30:29 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id e83-20020a25e756000000b0086349255277so2438159ybh.8 for ; Fri, 17 Feb 2023 16:30:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=nWGrggQ/qDTMgdBpDhPDYEEJzTc7x7HUNebehORCAiA=; b=Ycfh2IW+hDbVlFvLyA3lntlwjCloreR3r/qvCLXWGq6CNK6K2k2C8p8Spq55zPyhuq AAzGAWCFdAKep93lG+yL4ZIn4/dB4F5SfgWiuIci0gRz5Jc/WSpHykVxXgtPRca51cTh Kjt4vcYHHiUBWKNDbEdNkw5LmtxPtJAs961xziLXEQQmcRRKC+XTvb8X3Mv5HuNd3g65 rbEnCtYwG9f2sqWhaLJitGD3eobDHYJm3Lp8idB4WHCamsENmNmOfrndZ99VziA1718v ZFwhg4eMsDrzkJAlatdDRFvDNHgbLy+EDTZnB6N9dZ7yt4JVcAMrYLFaHtXLj4rhKRWg o1Pg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=nWGrggQ/qDTMgdBpDhPDYEEJzTc7x7HUNebehORCAiA=; b=xW7/t2TW9JqIWsGkrDP8r8AazBmsFE4vvffdIaA4l29Q95s1Bkt49qNYMcJaAlK7oi 9Dpbl9eoi69eLCLe8ETNyCx4KEFOVah6HQgOsTsp1CsDwA08DayDX3WjCwvs1nqvDCSW XxaaGoeYvfIWPScgQ3YV6eVx3Gihj5m+nFFkIn8gbqyj8dZaIjcF3ArVyf+B+iLWhtEL 3yMUMlhfc+0tzimQkJHC4+n82M5RsxvpK3WFraZXK6VtqaJ1BWtecD0302PRvyi4oOg0 e5TvIDjNrX4OMlkL56/MAp4+hls/cewx3LEExUuOd4s82gO8Ol3NRrD7ksJq2ePnzUS7 9lxA== X-Gm-Message-State: AO0yUKWX15o8wkC+wqIuhGvnw3DTwyGd1W0ysXdK79AIGNwDfVPTW2CC RRvWxlCQoU9AgaMIsg2RIYWSfeQy8FBUoWZa X-Google-Smtp-Source: AK7set8/WW6OVTCGwPxg+uQR5Lu/p2b0h5x6j4EcxX6yyi99urCYtaxSW3TCY7nRxHf6oG5i6KVmizuXRuCPg5hr X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a5b:5c3:0:b0:8ed:262d:defe with SMTP id w3-20020a5b05c3000000b008ed262ddefemr185750ybp.0.1676680187212; Fri, 17 Feb 2023 16:29:47 -0800 (PST) Date: Sat, 18 Feb 2023 00:28:19 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-47-jthoughton@google.com> Subject: [PATCH v2 46/46] selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Test that high-granularity CONTINUEs at all sizes work (exercising contiguous PTE sizes for arm64, when support is added). Also test that collapse works and hwpoison works correctly (although we aren't yet testing high-granularity poison). This test uses UFFD_FEATURE_EVENT_FORK + UFFD_REGISTER_MODE_WP to force the kernel to copy page tables on fork(), exercising the changes to copy_hugetlb_page_range(). Also test that UFFDIO_WRITEPROTECT doesn't prevent UFFDIO_CONTINUE from behaving properly (in other words, that HGM walks treat UFFD-WP markers like blank PTEs in the appropriate cases). We also test that the uffd-wp PTE markers are preserved properly. Signed-off-by: James Houghton create mode 100644 tools/testing/selftests/mm/hugetlb-hgm.c diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/= mm/Makefile index d90cdc06aa59..920baccccb9e 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -36,6 +36,7 @@ TEST_GEN_FILES +=3D compaction_test TEST_GEN_FILES +=3D gup_test TEST_GEN_FILES +=3D hmm-tests TEST_GEN_FILES +=3D hugetlb-madvise +TEST_GEN_FILES +=3D hugetlb-hgm TEST_GEN_FILES +=3D hugepage-mmap TEST_GEN_FILES +=3D hugepage-mremap TEST_GEN_FILES +=3D hugepage-shm diff --git a/tools/testing/selftests/mm/hugetlb-hgm.c b/tools/testing/selft= ests/mm/hugetlb-hgm.c new file mode 100644 index 000000000000..4c27a6a11818 --- /dev/null +++ b/tools/testing/selftests/mm/hugetlb-hgm.c @@ -0,0 +1,608 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test uncommon cases in HugeTLB high-granularity mapping: + * 1. Test all supported high-granularity page sizes (with MADV_COLLAPSE). + * 2. Test MADV_HWPOISON behavior. + * 3. Test interaction with UFFDIO_WRITEPROTECT. + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define PAGE_SIZE 4096 +#define PAGE_MASK ~(PAGE_SIZE - 1) + +#ifndef MADV_COLLAPSE +#define MADV_COLLAPSE 25 +#endif + +#ifndef MADV_SPLIT +#define MADV_SPLIT 26 +#endif + +#define PREFIX " ... " +#define ERROR_PREFIX " !!! " + +static void *sigbus_addr; +bool was_mceerr; +bool got_sigbus; +bool expecting_sigbus; + +enum test_status { + TEST_PASSED =3D 0, + TEST_FAILED =3D 1, + TEST_SKIPPED =3D 2, +}; + +static char *status_to_str(enum test_status status) +{ + switch (status) { + case TEST_PASSED: + return "TEST_PASSED"; + case TEST_FAILED: + return "TEST_FAILED"; + case TEST_SKIPPED: + return "TEST_SKIPPED"; + default: + return "TEST_???"; + } +} + +static int userfaultfd(int flags) +{ + return syscall(__NR_userfaultfd, flags); +} + +static int map_range(int uffd, char *addr, uint64_t length) +{ + struct uffdio_continue cont =3D { + .range =3D (struct uffdio_range) { + .start =3D (uint64_t)addr, + .len =3D length, + }, + .mode =3D 0, + .mapped =3D 0, + }; + + if (ioctl(uffd, UFFDIO_CONTINUE, &cont) < 0) { + perror(ERROR_PREFIX "UFFDIO_CONTINUE failed"); + return -1; + } + return 0; +} + +static int userfaultfd_writeprotect(int uffd, char *addr, uint64_t length, + bool protect) +{ + struct uffdio_writeprotect wp =3D { + .range =3D (struct uffdio_range) { + .start =3D (uint64_t)addr, + .len =3D length, + }, + .mode =3D UFFDIO_WRITEPROTECT_MODE_DONTWAKE, + }; + + if (protect) + wp.mode =3D UFFDIO_WRITEPROTECT_MODE_WP; + + printf(PREFIX "UFFDIO_WRITEPROTECT: %p -> %p (%sprotected)\n", addr, + addr + length, protect ? "" : "un"); + + if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp) < 0) { + perror(ERROR_PREFIX "UFFDIO_WRITEPROTECT failed"); + return -1; + } + return 0; +} + +static int check_equal(char *mapping, size_t length, char value) +{ + size_t i; + + for (i =3D 0; i < length; ++i) + if (mapping[i] !=3D value) { + printf(ERROR_PREFIX "mismatch at %p (%d !=3D %d)\n", + &mapping[i], mapping[i], value); + return -1; + } + + return 0; +} + +static int test_continues(int uffd, char *primary_map, char *secondary_map, + size_t len, bool verify) +{ + size_t offset =3D 0; + unsigned char iter =3D 0; + unsigned long pagesize =3D getpagesize(); + uint64_t size; + + for (size =3D len/2; size >=3D pagesize; + offset +=3D size, size /=3D 2) { + iter++; + memset(secondary_map + offset, iter, size); + printf(PREFIX "UFFDIO_CONTINUE: %p -> %p =3D %d%s\n", + primary_map + offset, + primary_map + offset + size, + iter, + verify ? " (and verify)" : ""); + if (map_range(uffd, primary_map + offset, size)) + return -1; + if (verify && check_equal(primary_map + offset, size, iter)) + return -1; + } + return 0; +} + +static int verify_contents(char *map, size_t len, bool last_page_zero) +{ + size_t offset =3D 0; + int i =3D 0; + uint64_t size; + + for (size =3D len/2; size > PAGE_SIZE; offset +=3D size, size /=3D 2) + if (check_equal(map + offset, size, ++i)) + return -1; + + if (last_page_zero) + if (check_equal(map + len - PAGE_SIZE, PAGE_SIZE, 0)) + return -1; + + return 0; +} + +static int test_collapse(char *primary_map, size_t len, bool verify) +{ + int ret =3D 0; + + printf(PREFIX "collapsing %p -> %p\n", primary_map, primary_map + len); + if (madvise(primary_map, len, MADV_COLLAPSE) < 0) { + perror(ERROR_PREFIX "collapse failed"); + return -1; + } + + if (verify) { + printf(PREFIX "verifying %p -> %p\n", primary_map, + primary_map + len); + ret =3D verify_contents(primary_map, len, true); + } + return ret; +} + +static void sigbus_handler(int signo, siginfo_t *info, void *context) +{ + if (!expecting_sigbus) + printf(ERROR_PREFIX "unexpected sigbus: %p\n", info->si_addr); + + got_sigbus =3D true; + was_mceerr =3D info->si_code =3D=3D BUS_MCEERR_AR; + sigbus_addr =3D info->si_addr; + + pthread_exit(NULL); +} + +static void *access_mem(void *addr) +{ + volatile char *ptr =3D addr; + + /* + * Do a write without changing memory contents, as other routines will + * need to verify that mapping contents haven't changed. + * + * We do a write so that we trigger uffd-wp SIGBUSes. To test that we + * get HWPOISON SIGBUSes, we would only need to read. + */ + *ptr =3D *ptr; + return NULL; +} + +static int test_sigbus(char *addr, bool poison) +{ + int ret; + pthread_t pthread; + + sigbus_addr =3D (void *)0xBADBADBAD; + was_mceerr =3D false; + got_sigbus =3D false; + expecting_sigbus =3D true; + ret =3D pthread_create(&pthread, NULL, &access_mem, addr); + if (ret) { + printf(ERROR_PREFIX "failed to create thread: %s\n", + strerror(ret)); + goto out; + } + + pthread_join(pthread, NULL); + + ret =3D -1; + if (!got_sigbus) + printf(ERROR_PREFIX "didn't get a SIGBUS: %p\n", addr); + else if (sigbus_addr !=3D addr) + printf(ERROR_PREFIX "got incorrect sigbus address: %p vs %p\n", + sigbus_addr, addr); + else if (poison && !was_mceerr) + printf(ERROR_PREFIX "didn't get an MCEERR?\n"); + else + ret =3D 0; +out: + expecting_sigbus =3D false; + return ret; +} + +static void *read_from_uffd_thd(void *arg) +{ + int uffd =3D *(int *)arg; + struct uffd_msg msg; + /* opened without O_NONBLOCK */ + if (read(uffd, &msg, sizeof(msg)) !=3D sizeof(msg)) + printf(ERROR_PREFIX "reading uffd failed\n"); + + return NULL; +} + +static int read_event_from_uffd(int *uffd, pthread_t *pthread) +{ + int ret =3D 0; + + ret =3D pthread_create(pthread, NULL, &read_from_uffd_thd, (void *)uffd); + if (ret) { + printf(ERROR_PREFIX "failed to create thread: %s\n", + strerror(ret)); + return ret; + } + return 0; +} + +static int test_sigbus_range(char *primary_map, size_t len, bool hwpoison) +{ + const unsigned long pagesize =3D getpagesize(); + const int num_checks =3D 512; + unsigned long bytes_per_check =3D len/num_checks; + int i; + + printf(PREFIX "checking that we can't access " + "(%d addresses within %p -> %p)\n", + num_checks, primary_map, primary_map + len); + + if (pagesize > bytes_per_check) + bytes_per_check =3D pagesize; + + for (i =3D 0; i < len; i +=3D bytes_per_check) + if (test_sigbus(primary_map + i, hwpoison) < 0) + return 1; + /* check very last byte, because we left it unmapped */ + if (test_sigbus(primary_map + len - 1, hwpoison)) + return 1; + + return 0; +} + +static enum test_status test_hwpoison(char *primary_map, size_t len) +{ + printf(PREFIX "poisoning %p -> %p\n", primary_map, primary_map + len); + if (madvise(primary_map, len, MADV_HWPOISON) < 0) { + perror(ERROR_PREFIX "MADV_HWPOISON failed"); + return TEST_SKIPPED; + } + + return test_sigbus_range(primary_map, len, true) + ? TEST_FAILED : TEST_PASSED; +} + +static int test_fork(int uffd, char *primary_map, size_t len) +{ + int status; + int ret =3D 0; + pid_t pid; + pthread_t uffd_thd; + + /* + * UFFD_FEATURE_EVENT_FORK will put fork event on the userfaultfd, + * which we must read, otherwise we block fork(). Setup a thread to + * read that event now. + * + * Page fault events should result in a SIGBUS, so we expect only a + * single event from the uffd (the fork event). + */ + if (read_event_from_uffd(&uffd, &uffd_thd)) + return -1; + + pid =3D fork(); + + if (!pid) { + /* + * Because we have UFFDIO_REGISTER_MODE_WP and + * UFFD_FEATURE_EVENT_FORK, the page tables should be copied + * exactly. + * + * Check that everything except that last 4K has correct + * contents, and then check that the last 4K gets a SIGBUS. + */ + printf(PREFIX "child validating...\n"); + ret =3D verify_contents(primary_map, len, false) || + test_sigbus(primary_map + len - 1, false); + ret =3D 0; + exit(ret ? 1 : 0); + } else { + /* wait for the child to finish. */ + waitpid(pid, &status, 0); + ret =3D WEXITSTATUS(status); + if (!ret) { + printf(PREFIX "parent validating...\n"); + /* Same check as the child. */ + ret =3D verify_contents(primary_map, len, false) || + test_sigbus(primary_map + len - 1, false); + ret =3D 0; + } + } + + pthread_join(uffd_thd, NULL); + return ret; + +} + +static int uffd_register(int uffd, char *primary_map, unsigned long len, + int mode) +{ + struct uffdio_register reg; + + reg.range.start =3D (unsigned long)primary_map; + reg.range.len =3D len; + reg.mode =3D mode; + + reg.ioctls =3D 0; + return ioctl(uffd, UFFDIO_REGISTER, ®); +} + +enum test_type { + TEST_DEFAULT, + TEST_UFFDWP, + TEST_HWPOISON +}; + +static enum test_status +test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type) +{ + int uffd; + char *primary_map, *secondary_map; + struct uffdio_api api; + struct sigaction new, old; + enum test_status status =3D TEST_SKIPPED; + bool hwpoison =3D type =3D=3D TEST_HWPOISON; + bool uffd_wp =3D type =3D=3D TEST_UFFDWP; + bool verify =3D type =3D=3D TEST_DEFAULT; + int register_args; + + if (ftruncate(fd, len) < 0) { + perror(ERROR_PREFIX "ftruncate failed"); + return status; + } + + uffd =3D userfaultfd(O_CLOEXEC); + if (uffd < 0) { + perror(ERROR_PREFIX "uffd not created"); + return status; + } + + primary_map =3D mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0= ); + if (primary_map =3D=3D MAP_FAILED) { + perror(ERROR_PREFIX "mmap for primary mapping failed"); + goto close_uffd; + } + secondary_map =3D mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd,= 0); + if (secondary_map =3D=3D MAP_FAILED) { + perror(ERROR_PREFIX "mmap for secondary mapping failed"); + goto unmap_primary; + } + + printf(PREFIX "primary mapping: %p\n", primary_map); + printf(PREFIX "secondary mapping: %p\n", secondary_map); + + api.api =3D UFFD_API; + api.features =3D UFFD_FEATURE_SIGBUS | UFFD_FEATURE_EXACT_ADDRESS | + UFFD_FEATURE_EVENT_FORK; + if (ioctl(uffd, UFFDIO_API, &api) =3D=3D -1) { + perror(ERROR_PREFIX "UFFDIO_API failed"); + goto out; + } + + if (madvise(primary_map, len, MADV_SPLIT)) { + perror(ERROR_PREFIX "MADV_SPLIT failed"); + goto out; + } + + /* + * Register with UFFDIO_REGISTER_MODE_WP to force fork() to copy page + * tables (also need UFFD_FEATURE_EVENT_FORK, which we have). + */ + register_args =3D UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP; + if (!uffd_wp) + /* + * If we're testing UFFDIO_WRITEPROTECT, then we don't want + * minor faults. With minor faults enabled, we'll get SIGBUSes + * for any minor fault, wheresa without minot faults enabled, + * writes will verify that uffd-wp PTE markers were installed + * properly. + */ + register_args |=3D UFFDIO_REGISTER_MODE_MINOR; + + if (uffd_register(uffd, primary_map, len, register_args)) { + perror(ERROR_PREFIX "UFFDIO_REGISTER failed"); + goto out; + } + + + new.sa_sigaction =3D &sigbus_handler; + new.sa_flags =3D SA_SIGINFO; + if (sigaction(SIGBUS, &new, &old) < 0) { + perror(ERROR_PREFIX "could not setup SIGBUS handler"); + goto out; + } + + status =3D TEST_FAILED; + + if (uffd_wp) { + /* + * Install uffd-wp PTE markers now. They should be preserved + * as we split the mappings with UFFDIO_CONTINUE later. + */ + if (userfaultfd_writeprotect(uffd, primary_map, len, true)) + goto done; + /* Verify that we really are write-protected. */ + if (test_sigbus(primary_map, false)) + goto done; + } + + /* + * Main piece of the test: map primary_map at all the possible + * page sizes. Starting at the hugepage size and going down to + * PAGE_SIZE. This leaves the final PAGE_SIZE piece of the mapping + * unmapped. + */ + if (test_continues(uffd, primary_map, secondary_map, len, verify)) + goto done; + + /* + * Verify that MADV_HWPOISON is able to properly poison the entire + * mapping. + */ + if (hwpoison) { + enum test_status new_status =3D test_hwpoison(primary_map, len); + + if (new_status !=3D TEST_PASSED) { + status =3D new_status; + goto done; + } + } + + if (uffd_wp) { + /* + * Check that the uffd-wp marker we installed initially still + * exists in the unmapped 4K piece at the end the mapping. + * + * test_sigbus() will do a write. When this happens: + * 1. The page fault handler will find the uffd-wp marker and + * create a read-only PTE. + * 2. The memory access is retried, and the page fault handler + * will find that a write was attempted in a UFFD_WP VMA + * where a RO mapping exists, so SIGBUS + * (we have UFFD_FEATURE_SIGBUS). + * + * We only check the final pag because UFFDIO_CONTINUE will + * have cleared the write-protection on all the other pieces + * of the mapping. + */ + printf(PREFIX "verifying that we can't write to final page\n"); + if (test_sigbus(primary_map + len - 1, false)) + goto done; + } + + if (!hwpoison) + /* + * test_fork() will verify memory contents. We can't do + * that if memory has been poisoned. + */ + if (test_fork(uffd, primary_map, len)) + goto done; + + /* + * Check that MADV_COLLAPSE functions properly. That is: + * - the PAGE_SIZE hole we had is no longer unmapped. + * - poisoned regions are still poisoned. + * + * Verify the data is correct if we haven't poisoned. + */ + if (test_collapse(primary_map, len, !hwpoison)) + goto done; + /* + * Verify that memory is still poisoned. + */ + if (hwpoison && test_sigbus_range(primary_map, len, true)) + goto done; + + status =3D TEST_PASSED; + +done: + if (ftruncate(fd, 0) < 0) { + perror(ERROR_PREFIX "ftruncate back to 0 failed"); + status =3D TEST_FAILED; + } + +out: + munmap(secondary_map, len); +unmap_primary: + munmap(primary_map, len); +close_uffd: + close(uffd); + return status; +} + +int main(void) +{ + int fd; + struct statfs file_stat; + size_t hugepagesize; + size_t len; + enum test_status status; + int ret =3D 0; + + fd =3D memfd_create("hugetlb_tmp", MFD_HUGETLB); + if (fd < 0) { + perror(ERROR_PREFIX "could not open hugetlbfs file"); + return -1; + } + + memset(&file_stat, 0, sizeof(file_stat)); + if (fstatfs(fd, &file_stat)) { + perror(ERROR_PREFIX "fstatfs failed"); + goto close; + } + if (file_stat.f_type !=3D HUGETLBFS_MAGIC) { + printf(ERROR_PREFIX "not hugetlbfs file\n"); + goto close; + } + + hugepagesize =3D file_stat.f_bsize; + len =3D 2 * hugepagesize; + + printf("HGM regular test...\n"); + status =3D test_hgm(fd, hugepagesize, len, TEST_DEFAULT); + printf("HGM regular test: %s\n", status_to_str(status)); + if (status =3D=3D TEST_FAILED) + ret =3D -1; + + printf("HGM uffd-wp test...\n"); + status =3D test_hgm(fd, hugepagesize, len, TEST_UFFDWP); + printf("HGM uffd-wp test: %s\n", status_to_str(status)); + if (status =3D=3D TEST_FAILED) + ret =3D -1; + + printf("HGM hwpoison test...\n"); + status =3D test_hgm(fd, hugepagesize, len, TEST_HWPOISON); + printf("HGM hwpoison test: %s\n", status_to_str(status)); + if (status =3D=3D TEST_FAILED) + ret =3D -1; +close: + close(fd); + + return ret; +} --=20 2.39.2.637.g21b0678d19-goog