From nobody Thu Dec 18 15:27:38 2025 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2EC37256C88 for ; Wed, 14 May 2025 23:43:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747266234; cv=none; b=aBkBJdyTP1oQvvXC6gH+9izdiTRvMNEvxDxvGyJp923QqIhxqSaTy1Um2eJdOx/dz2+FbdiDdejGmtPqcdpEKBZ9B45fNLnpAKV4anUMFeyEsqDipZDiwvxMBYMeTMMgOVgmNCwmX6QSGM8y9U0aefkxywGNksIxAQ/e36gkgaw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747266234; c=relaxed/simple; bh=WIiuIhBvEYQRoDVYaMe6Epa6jW56JZZDNHsLnkwv8LE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=mk1XyxamSEOyyeV8diliIb8zZkMbMFm4jBevrKXU2vKN4q/gNPtVaPjvO15jqVtIp7n+i2aCQohdctKXaAZbTLQiCsrAGXdBZJDZlV5b3qcNt836S79dAoxXcUmEsGtwk79uRXQqF3UJhOjJS6MBDaXU4SSi0e0DUHutiXmeKsg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=mARadzej; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="mARadzej" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-30ab5d34fdbso360806a91.0 for ; Wed, 14 May 2025 16:43:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747266232; x=1747871032; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=VfnYbWQzMmQc1PANKcXbyhh93wrWxLA5NUfiI2kZlfs=; b=mARadzejB9xaLag0baFW/aClRLPyi2wXmXmCQLxkqP3A7q8uU3atLorUOvL435O0l3 ey3vbDghtzjtiXIx2ec9i+aDs2fUeI27wgLt+x/UmnfYIZMoe6LWQb9W+2KwrJbufhXs dePLiQykaaRLMeaZF7+MHJAqev6y217+sEjIEvvUmTqy/ecmMQpcp6cUGh1JSjD33YWC 4xktAJdcC5Cw+9NmJhYSgzUi8Y9e5DiUTK8rarjl0nzV3jkNRtLP3TuYC/OXm1opdNY/ ekYapZowZb3x4NaxLyrAKJ3Mq/XcnkPSkKsmi41tn/wI8gPmYWMsopGyzyaqAqnpPHH+ ciWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747266232; x=1747871032; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=VfnYbWQzMmQc1PANKcXbyhh93wrWxLA5NUfiI2kZlfs=; b=h2Dt2aIl3ktzICp2/o9vy93BEYrPEwuNS9U/GdpejW+KhH8nmEizmK8Bhc1BnXQdn5 D8p28TMr2A6Zlr0uKhBkTx1wEsbptenZ+TuiCIWd2ySnZ7dFpc7y6UqTCJ7zlIXkqcjp GdsD+rgwr08RlJqk71MV846Vw3y7sLWB13JbsAGKx43A0uNv9WR3g1i8hFZsOrajI5Oh j6iK3l9r13Fvli1qGZ7CGYJwkL5pZITdm6TXEQbJXf/c2SvGUn9hBzCPad2y1sKeF0Jb YVzqaYy004r1fU64+AVZzA+WhSW2WkmtG+KRyX3aKzCJxzM9Q6U01VEmm8ZLDa4FaYmT fwJg== X-Forwarded-Encrypted: i=1; AJvYcCVRhpxqIruH7Iaw1iKsXWETeR9F/1p2GqI1CWUtKczv/4kXEbFIKiTY8ZgP2SLnd9hdeItAS0XRIn1ZxwQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwQgz8326wOVaKIcde35N46W7sig6KSv8NjzEYSv6JeD8otgsAW dBb4qme0Dkuyw5CpVMAdl0gOK0MSonOq8vtAgcUGMy/fwpAiILC4aIyEmeKoJQzNyvxaAOVhacy g33wt5STSOcgpmuXji6/nGQ== X-Google-Smtp-Source: AGHT+IFVGcqlDEnJpkFEd+R21FoPe5Rerk1Iq1HXA53kK0tkVEoANeobibZBnoWnktKVrpAkNdhU2ptVfud0MkCE0w== X-Received: from pji8.prod.google.com ([2002:a17:90b:3fc8:b0:2ea:5084:5297]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:4acb:b0:2ee:b4bf:2d06 with SMTP id 98e67ed59e1d1-30e2e6133d8mr7435241a91.19.1747266231568; Wed, 14 May 2025 16:43:51 -0700 (PDT) Date: Wed, 14 May 2025 16:42:14 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.49.0.1045.g170613ef41-goog Message-ID: <2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com> Subject: [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for splitting and merging pages From: Ackerley Tng To: kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org Cc: ackerleytng@google.com, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, tabba@google.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" These functions allow guest_memfd to split and merge HugeTLB pages, and clean them up on freeing the page. For merging and splitting pages on conversion, guestmem_hugetlb expects the refcount on the pages to already be 0. The caller must ensure that. For conversions, guest_memfd ensures that the refcounts are already 0 by checking that there are no unexpected refcounts, and then freezing the expected refcounts away. On unexpected refcounts, guest_memfd will return an error to userspace. For truncation, on unexpected refcounts, guest_memfd will return an error to userspace. For truncation on closing, guest_memfd will just remove its own refcounts (the filemap refcounts) and mark split pages with PGTY_guestmem_hugetlb. The presence of PGTY_guestmem_hugetlb will trigger the folio_put() callback to handle further cleanup. This cleanup process will merge pages (with refcount 0, since cleanup is triggered from folio_put()) before returning the pages to HugeTLB. Since the merging process is long, it is deferred to a worker thread since folio_put() could be called from atomic context. Change-Id: Ib04a3236f1e7250fd9af827630c334d40fb09d40 Signed-off-by: Ackerley Tng Co-developed-by: Vishal Annapurve Signed-off-by: Vishal Annapurve --- include/linux/guestmem.h | 3 + mm/guestmem_hugetlb.c | 349 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 347 insertions(+), 5 deletions(-) diff --git a/include/linux/guestmem.h b/include/linux/guestmem.h index 4b2d820274d9..3ee816d1dd34 100644 --- a/include/linux/guestmem.h +++ b/include/linux/guestmem.h @@ -8,6 +8,9 @@ struct guestmem_allocator_operations { void *(*inode_setup)(size_t size, u64 flags); void (*inode_teardown)(void *private, size_t inode_size); struct folio *(*alloc_folio)(void *private); + int (*split_folio)(struct folio *folio); + void (*merge_folio)(struct folio *folio); + void (*free_folio)(struct folio *folio); /* * Returns the number of PAGE_SIZE pages in a page that this guestmem * allocator provides. diff --git a/mm/guestmem_hugetlb.c b/mm/guestmem_hugetlb.c index ec5a188ca2a7..8727598cf18e 100644 --- a/mm/guestmem_hugetlb.c +++ b/mm/guestmem_hugetlb.c @@ -11,15 +11,12 @@ #include #include #include +#include =20 #include =20 #include "guestmem_hugetlb.h" - -void guestmem_hugetlb_handle_folio_put(struct folio *folio) -{ - WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress."); -} +#include "hugetlb_vmemmap.h" =20 struct guestmem_hugetlb_private { struct hstate *h; @@ -34,6 +31,339 @@ static size_t guestmem_hugetlb_nr_pages_in_folio(void *= priv) return pages_per_huge_page(private->h); } =20 +static DEFINE_XARRAY(guestmem_hugetlb_stash); + +struct guestmem_hugetlb_metadata { + void *_hugetlb_subpool; + void *_hugetlb_cgroup; + void *_hugetlb_hwpoison; + void *private; +}; + +struct guestmem_hugetlb_stash_item { + struct guestmem_hugetlb_metadata hugetlb_metadata; + /* hstate tracks the original size of this folio. */ + struct hstate *h; + /* Count of split pages, individually freed, waiting to be merged. */ + atomic_t nr_pages_waiting_to_be_merged; +}; + +struct workqueue_struct *guestmem_hugetlb_wq __ro_after_init; +static struct work_struct guestmem_hugetlb_cleanup_work; +static LLIST_HEAD(guestmem_hugetlb_cleanup_list); + +static inline void guestmem_hugetlb_register_folio_put_callback(struct fol= io *folio) +{ + __folio_set_guestmem_hugetlb(folio); +} + +static inline void guestmem_hugetlb_unregister_folio_put_callback(struct f= olio *folio) +{ + __folio_clear_guestmem_hugetlb(folio); +} + +static inline void guestmem_hugetlb_defer_cleanup(struct folio *folio) +{ + struct llist_node *node; + + /* + * Reuse the folio->mapping pointer as a struct llist_node, since + * folio->mapping is NULL at this point. + */ + BUILD_BUG_ON(sizeof(folio->mapping) !=3D sizeof(struct llist_node)); + node =3D (struct llist_node *)&folio->mapping; + + /* + * Only schedule work if list is previously empty. Otherwise, + * schedule_work() had been called but the workfn hasn't retrieved the + * list yet. + */ + if (llist_add(node, &guestmem_hugetlb_cleanup_list)) + queue_work(guestmem_hugetlb_wq, &guestmem_hugetlb_cleanup_work); +} + +void guestmem_hugetlb_handle_folio_put(struct folio *folio) +{ + guestmem_hugetlb_unregister_folio_put_callback(folio); + + /* + * folio_put() can be called in interrupt context, hence do the work + * outside of interrupt context + */ + guestmem_hugetlb_defer_cleanup(folio); +} + +/* + * Stash existing hugetlb metadata. Use this function just before splittin= g a + * hugetlb page. + */ +static inline void +__guestmem_hugetlb_stash_metadata(struct guestmem_hugetlb_metadata *metada= ta, + struct folio *folio) +{ + /* + * (folio->page + 1) doesn't have to be stashed since those fields are + * known on split/reconstruct and will be reinitialized anyway. + */ + + /* + * subpool is created for every guest_memfd inode, but the folios will + * outlive the inode, hence we store the subpool here. + */ + metadata->_hugetlb_subpool =3D folio->_hugetlb_subpool; + /* + * _hugetlb_cgroup has to be stored for freeing + * later. _hugetlb_cgroup_rsvd does not, since it is NULL for + * guest_memfd folios anyway. guest_memfd reservations are handled in + * the inode. + */ + metadata->_hugetlb_cgroup =3D folio->_hugetlb_cgroup; + metadata->_hugetlb_hwpoison =3D folio->_hugetlb_hwpoison; + + /* + * HugeTLB flags are stored in folio->private. stash so that ->private + * can be used by core-mm. + */ + metadata->private =3D folio->private; +} + +static int guestmem_hugetlb_stash_metadata(struct folio *folio) +{ + XA_STATE(xas, &guestmem_hugetlb_stash, 0); + struct guestmem_hugetlb_stash_item *stash; + void *entry; + + stash =3D kzalloc(sizeof(*stash), 1); + if (!stash) + return -ENOMEM; + + stash->h =3D folio_hstate(folio); + __guestmem_hugetlb_stash_metadata(&stash->hugetlb_metadata, folio); + + xas_set_order(&xas, folio_pfn(folio), folio_order(folio)); + + xas_lock(&xas); + entry =3D xas_store(&xas, stash); + xas_unlock(&xas); + + if (xa_is_err(entry)) { + kfree(stash); + return xa_err(entry); + } + + return 0; +} + +static inline void +__guestmem_hugetlb_unstash_metadata(struct guestmem_hugetlb_metadata *meta= data, + struct folio *folio) +{ + folio->_hugetlb_subpool =3D metadata->_hugetlb_subpool; + folio->_hugetlb_cgroup =3D metadata->_hugetlb_cgroup; + folio->_hugetlb_cgroup_rsvd =3D NULL; + folio->_hugetlb_hwpoison =3D metadata->_hugetlb_hwpoison; + + folio_change_private(folio, metadata->private); +} + +static int guestmem_hugetlb_unstash_free_metadata(struct folio *folio) +{ + struct guestmem_hugetlb_stash_item *stash; + unsigned long pfn; + + pfn =3D folio_pfn(folio); + + stash =3D xa_erase(&guestmem_hugetlb_stash, pfn); + __guestmem_hugetlb_unstash_metadata(&stash->hugetlb_metadata, folio); + + kfree(stash); + + return 0; +} + +/** + * guestmem_hugetlb_split_folio() - Split a HugeTLB @folio to PAGE_SIZE pa= ges. + * + * @folio: The folio to be split. + * + * Context: Before splitting, the folio must have a refcount of 0. After + * splitting, each split folio has a refcount of 0. + * Return: 0 on success and negative error otherwise. + */ +static int guestmem_hugetlb_split_folio(struct folio *folio) +{ + long orig_nr_pages; + int ret; + int i; + + if (folio_size(folio) =3D=3D PAGE_SIZE) + return 0; + + orig_nr_pages =3D folio_nr_pages(folio); + ret =3D guestmem_hugetlb_stash_metadata(folio); + if (ret) + return ret; + + /* + * hugetlb_vmemmap_restore_folio() has to be called ahead of the rest + * because it checks and page type. This doesn't actually split the + * folio, so the first few struct pages are still intact. + */ + ret =3D hugetlb_vmemmap_restore_folio(folio_hstate(folio), folio); + if (ret) + goto err; + + /* + * Can clear without lock because this will not race with the folio + * being mapped. folio's page type is overlaid with mapcount and so in + * other cases it's necessary to take hugetlb_lock to prevent races with + * mapcount increasing. + */ + __folio_clear_hugetlb(folio); + + /* + * Remove the first folio from h->hugepage_activelist since it is no + * longer a HugeTLB page. The other split pages should not be on any + * lists. + */ + hugetlb_folio_list_del(folio); + + /* Actually split page by undoing prep_compound_page() */ + __folio_clear_head(folio); + +#ifdef NR_PAGES_IN_LARGE_FOLIO + /* + * Zero out _nr_pages, otherwise this overlaps with memcg_data, + * resulting in lookups on false memcg_data. _nr_pages doesn't have to + * be set to 1 because folio_nr_pages() relies on the presence of the + * head flag to return 1 for nr_pages. + */ + folio->_nr_pages =3D 0; +#endif + + for (i =3D 1; i < orig_nr_pages; ++i) { + struct page *p =3D folio_page(folio, i); + + /* Copy flags from the first page to split pages. */ + p->flags =3D folio->flags; + + p->mapping =3D NULL; + clear_compound_head(p); + } + + return 0; + +err: + guestmem_hugetlb_unstash_free_metadata(folio); + + return ret; +} + +/** + * guestmem_hugetlb_merge_folio() - Merge a HugeTLB folio from the folio + * beginning @first_folio. + * + * @first_folio: the first folio in a contiguous block of folios to be mer= ged. + * + * The size of the contiguous block is tracked in guestmem_hugetlb_stash. + * + * Context: The first folio is checked to have a refcount of 0 before + * reconstruction. After reconstruction, the reconstructed folio = has a + * refcount of 0. + */ +static void guestmem_hugetlb_merge_folio(struct folio *first_folio) +{ + struct guestmem_hugetlb_stash_item *stash; + struct hstate *h; + + stash =3D xa_load(&guestmem_hugetlb_stash, folio_pfn(first_folio)); + h =3D stash->h; + + /* + * This is the step that does the merge. prep_compound_page() will write + * to pages 1 and 2 as well, so guestmem_unstash_hugetlb_metadata() has + * to come after this. + */ + prep_compound_page(&first_folio->page, huge_page_order(h)); + + WARN_ON(guestmem_hugetlb_unstash_free_metadata(first_folio)); + + /* + * prep_compound_page() will set up mapping on tail pages. For + * completeness, clear mapping on head page. + */ + first_folio->mapping =3D NULL; + + __folio_set_hugetlb(first_folio); + + hugetlb_folio_list_add(first_folio, &h->hugepage_activelist); + + hugetlb_vmemmap_optimize_folio(h, first_folio); +} + +static struct folio *guestmem_hugetlb_maybe_merge_folio(struct folio *foli= o) +{ + struct guestmem_hugetlb_stash_item *stash; + unsigned long first_folio_pfn; + struct folio *first_folio; + unsigned long pfn; + size_t nr_pages; + + pfn =3D folio_pfn(folio); + + stash =3D xa_load(&guestmem_hugetlb_stash, pfn); + nr_pages =3D pages_per_huge_page(stash->h); + if (atomic_inc_return(&stash->nr_pages_waiting_to_be_merged) < nr_pages) + return NULL; + + first_folio_pfn =3D round_down(pfn, nr_pages); + first_folio =3D pfn_folio(first_folio_pfn); + + guestmem_hugetlb_merge_folio(first_folio); + + return first_folio; +} + +static void guestmem_hugetlb_cleanup_folio(struct folio *folio) +{ + struct folio *merged_folio; + + merged_folio =3D guestmem_hugetlb_maybe_merge_folio(folio); + if (merged_folio) + __folio_put(merged_folio); +} + +static void guestmem_hugetlb_cleanup_workfn(struct work_struct *work) +{ + struct llist_node *node; + + node =3D llist_del_all(&guestmem_hugetlb_cleanup_list); + while (node) { + struct folio *folio; + + folio =3D container_of((struct address_space **)node, + struct folio, mapping); + + node =3D node->next; + folio->mapping =3D NULL; + + guestmem_hugetlb_cleanup_folio(folio); + } +} + +static int __init guestmem_hugetlb_init(void) +{ + INIT_WORK(&guestmem_hugetlb_cleanup_work, guestmem_hugetlb_cleanup_workfn= ); + + guestmem_hugetlb_wq =3D alloc_workqueue("guestmem_hugetlb", + WQ_MEM_RECLAIM | WQ_UNBOUND, 0); + if (!guestmem_hugetlb_wq) + return -ENOMEM; + + return 0; +} +subsys_initcall(guestmem_hugetlb_init); + static void *guestmem_hugetlb_setup(size_t size, u64 flags) =20 { @@ -164,10 +494,19 @@ static struct folio *guestmem_hugetlb_alloc_folio(voi= d *priv) return ERR_PTR(-ENOMEM); } =20 +static void guestmem_hugetlb_free_folio(struct folio *folio) +{ + if (xa_load(&guestmem_hugetlb_stash, folio_pfn(folio))) + guestmem_hugetlb_register_folio_put_callback(folio); +} + const struct guestmem_allocator_operations guestmem_hugetlb_ops =3D { .inode_setup =3D guestmem_hugetlb_setup, .inode_teardown =3D guestmem_hugetlb_teardown, .alloc_folio =3D guestmem_hugetlb_alloc_folio, + .split_folio =3D guestmem_hugetlb_split_folio, + .merge_folio =3D guestmem_hugetlb_merge_folio, + .free_folio =3D guestmem_hugetlb_free_folio, .nr_pages_in_folio =3D guestmem_hugetlb_nr_pages_in_folio, }; EXPORT_SYMBOL_GPL(guestmem_hugetlb_ops); --=20 2.49.0.1045.g170613ef41-goog