From nobody Sat Nov 30 10:28:51 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A8121C3F39 for ; Tue, 10 Sep 2024 23:45:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726011945; cv=none; b=WV33brbhTARVJ3Ri+/IKi8W/MDvCDiKaYqk/P0GMQHMfv7JntczBUC+Ye/3GzTYZwPBC9rR+/SoM+5pA0IoH7s04OxFF7al+WFg0Mnazh0uJzklNp2x2X1G2g0vh1BwZAWSSqqsaP6wjxEQn56AZieIIg3rUrzEaD7m20U0kjGI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726011945; c=relaxed/simple; bh=qlWybXpHX73W0c+WoyiK3ZIrz+Xb8gYUp5+RSabhcTs=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=QbzX7nAcDJ3+UwTAwicTuRcSmbdBNrTfdeMuNzlhIGX4aFEMl6vOeRO7zPOv+8mxcvyUS7PayF24Xpaq0kLaxT6JMHQ5JBcHucRN+DQLk5OJYVOh/gIABhZo9PQoC6K+KDuPY8MKl+SIYo9WlJxeZzs78Lo4qBqkmHQ7Nkklffw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=bFaXTy0k; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="bFaXTy0k" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6d9e31e66eeso218875977b3.1 for ; Tue, 10 Sep 2024 16:45:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1726011941; x=1726616741; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=plYJYQpQfTnkaDlyMrD3/y2PoxU/BsueDgrxr9NfLVw=; b=bFaXTy0kxx6flmBZ1ucQVp4NqXXsQJaCVMo4B0NxBxjiZtG5aZCemOerGrYlntVP06 rLjIjkCyf84g3cK8BODligtxBBVe2H3bmvQiOnOqsyni3LMB9WSt9UGuI/NLzOyHsi9r NqEu/teJdgNOFyJZW4Qv8dOubj7zn1WfAbbYCY0S2qK09dw6OWA2156jiZQo0XL5Y1d+ EdcGovMrAXg8vR+whNMR+6Q5ma09jpedQ+2vLQKGm35B3OlNgyWrnBKYImd8fxXnaYTD G9d8KrR7I3qnoLY/4TIxEFLWNZkRLL/hT1YncKC/ATtwkXruGS9IGNc+BSufCMGipX4Z WhTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726011941; x=1726616741; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=plYJYQpQfTnkaDlyMrD3/y2PoxU/BsueDgrxr9NfLVw=; b=t+hm4VfkXtaiVL29jMVlhb1mFhFu6PFKO5DUwI16Am7jxlyXp9dDagp4HdvdKXeYP7 +fYozV+gc7+Ct+o5c7VgybBw+/0xpVs+dPYUa/17NjrcfgQWwpq3MXmXcVg/KQdfd8/Q NTbHlaB838vye0Nxx5DKd5RjdM7+BtpehhJICaHmTs5DS8FA5yauUAAQSLD1UT7oj8zB n6WdXF1UDXr+kze0PbVC+NToClhjKEf4poe6yl9LWoJfEbRtGWa/W4R2YknV2ax3Sr23 zQLBNWwqWZGuoNdIv5zBXT8iRTZRTu0ATzT/5sW68j1mWxo95ptoeksw8kvjaL4jZZN6 +4Ow== X-Forwarded-Encrypted: i=1; AJvYcCU/9iORGg6kjfogdguDM1twQl3nZkf0074l+W9fv7f1oamqufrVv8J59m8QY/icQFEWbYXaH9FLRiaaLUs=@vger.kernel.org X-Gm-Message-State: AOJu0YyrnnHXL3t82rYTQ9gnSh4BEZY+3onJZu2Qq/Hf0OszfvgaTW0y 3yl0A4WcKG6nbGHqwPCkZzTkNpVaaZgsOMiZsWIfBw6fYdUKzxDEMzQdn0CvkJ6F0oe0RdqRtP9 tgdKPKwz1ER5YPRFVNPxEKQ== X-Google-Smtp-Source: AGHT+IF4JIC3xXL5PcaZmTn+2qPPsXwRjGrvNWXeluzH7gFrLjYhLnR02nFhF2UYLVydi+C+3P7U7SKVduv8ybLF7g== X-Received: from ackerleytng-ctop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:13f8]) (user=ackerleytng job=sendgmr) by 2002:a05:690c:6703:b0:6d5:df94:b7f2 with SMTP id 00721157ae682-6db45163299mr10464137b3.5.1726011940637; Tue, 10 Sep 2024 16:45:40 -0700 (PDT) Date: Tue, 10 Sep 2024 23:44:10 +0000 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.46.0.598.g6f2099f65c-goog Message-ID: <38723c5d5e9b530e52f28b9f9f4a6d862ed69bcd.1726009989.git.ackerleytng@google.com> Subject: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page From: Ackerley Tng To: tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, mike.kravetz@oracle.com Cc: erdemaktas@google.com, vannapurve@google.com, ackerleytng@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Vishal Annapurve The faultability of a page is used to determine whether to split or reconstruct a page. If there is any page in a folio that is faultable, split the folio. If all pages in a folio are not faultable, reconstruct the folio. On truncation, always reconstruct and free regardless of faultability (as long as a HugeTLB page's worth of pages is truncated). Co-developed-by: Vishal Annapurve Signed-off-by: Vishal Annapurve Co-developed-by: Ackerley Tng Signed-off-by: Ackerley Tng --- virt/kvm/guest_memfd.c | 678 +++++++++++++++++++++++++++-------------- 1 file changed, 456 insertions(+), 222 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index fb292e542381..0afc111099c0 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -99,6 +99,23 @@ static bool kvm_gmem_is_faultable(struct inode *inode, p= goff_t index) return xa_to_value(xa_load(faultability, index)) =3D=3D KVM_GMEM_FAULTABI= LITY_VALUE; } =20 +/** + * Return true if any of the @nr_pages beginning at @index is allowed to be + * faulted in. + */ +static bool kvm_gmem_is_any_faultable(struct inode *inode, pgoff_t index, + int nr_pages) +{ + pgoff_t i; + + for (i =3D index; i < index + nr_pages; ++i) { + if (kvm_gmem_is_faultable(inode, i)) + return true; + } + + return false; +} + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index. @@ -312,6 +329,40 @@ static int kvm_gmem_hugetlb_filemap_add_folio(struct a= ddress_space *mapping, return 0; } =20 +static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *fol= io) +{ + folio_lock(folio); + + folio_clear_dirty(folio); + folio_clear_uptodate(folio); + filemap_remove_folio(folio); + + folio_unlock(folio); +} + +/* + * Locks a block of nr_pages (1 << huge_page_order(h)) pages within @mappi= ng + * beginning at @index. Take either this or filemap_invalidate_lock() when= ever + * the filemap is accessed. + */ +static u32 hugetlb_fault_mutex_lock(struct address_space *mapping, pgoff_t= index) +{ + pgoff_t hindex; + u32 hash; + + hindex =3D index >> huge_page_order(kvm_gmem_hgmem(mapping->host)->h); + hash =3D hugetlb_fault_mutex_hash(mapping, hindex); + + mutex_lock(&hugetlb_fault_mutex_table[hash]); + + return hash; +} + +static void hugetlb_fault_mutex_unlock(u32 hash) +{ + mutex_unlock(&hugetlb_fault_mutex_table[hash]); +} + struct kvm_gmem_split_stash { struct { unsigned long _flags_2; @@ -394,15 +445,136 @@ static int kvm_gmem_hugetlb_reconstruct_folio(struct= hstate *h, struct folio *fo } =20 __folio_set_hugetlb(folio); - - folio_set_count(folio, 1); + hugetlb_folio_list_add(folio, &h->hugepage_activelist); =20 hugetlb_vmemmap_optimize_folio(h, folio); =20 + folio_set_count(folio, 1); + return 0; } =20 -/* Basically folio_set_order(folio, 1) without the checks. */ +/** + * Reconstruct a HugeTLB folio out of folio_nr_pages(@first_folio) pages. = Will + * clean up subfolios from filemap and add back the reconstructed folio. F= olios + * to be reconstructed must not be locked, and reconstructed folio will no= t be + * locked. Return 0 on success or negative error otherwise. + * + * hugetlb_fault_mutex_lock() has to be held when calling this function. + * + * Expects that before this call, the filemap's refcounts are the only ref= counts + * for the folios in the filemap. After this function returns, the filemap= 's + * refcount will be the only refcount on the reconstructed folio. + */ +static int kvm_gmem_reconstruct_folio_in_filemap(struct hstate *h, + struct folio *first_folio) +{ + struct address_space *mapping; + struct folio_batch fbatch; + unsigned long end; + pgoff_t index; + pgoff_t next; + int ret; + int i; + + if (folio_order(first_folio) =3D=3D huge_page_order(h)) + return 0; + + index =3D first_folio->index; + mapping =3D first_folio->mapping; + + next =3D index; + end =3D index + (1UL << huge_page_order(h)); + folio_batch_init(&fbatch); + while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { + for (i =3D 0; i < folio_batch_count(&fbatch); ++i) { + struct folio *folio; + + folio =3D fbatch.folios[i]; + + /* + * Before removing from filemap, take a reference so + * sub-folios don't get freed when removing from + * filemap. + */ + folio_get(folio); + + kvm_gmem_hugetlb_filemap_remove_folio(folio); + } + folio_batch_release(&fbatch); + } + + ret =3D kvm_gmem_hugetlb_reconstruct_folio(h, first_folio); + if (ret) { + /* TODO: handle cleanup properly. */ + WARN_ON(ret); + return ret; + } + + kvm_gmem_hugetlb_filemap_add_folio(mapping, first_folio, index, + htlb_alloc_mask(h)); + + folio_unlock(first_folio); + folio_put(first_folio); + + return ret; +} + +/** + * Reconstruct any HugeTLB folios in range [@start, @end), if all the subf= olios + * are not faultable. Return 0 on success or negative error otherwise. + * + * Will skip any folios that are already reconstructed. + */ +static int kvm_gmem_try_reconstruct_folios_range(struct inode *inode, + pgoff_t start, pgoff_t end) +{ + unsigned int nr_pages; + pgoff_t aligned_start; + pgoff_t aligned_end; + struct hstate *h; + pgoff_t index; + int ret; + + if (!is_kvm_gmem_hugetlb(inode)) + return 0; + + h =3D kvm_gmem_hgmem(inode)->h; + nr_pages =3D 1UL << huge_page_order(h); + + aligned_start =3D round_up(start, nr_pages); + aligned_end =3D round_down(end, nr_pages); + + ret =3D 0; + for (index =3D aligned_start; !ret && index < aligned_end; index +=3D nr_= pages) { + struct folio *folio; + u32 hash; + + hash =3D hugetlb_fault_mutex_lock(inode->i_mapping, index); + + folio =3D filemap_get_folio(inode->i_mapping, index); + if (!IS_ERR(folio)) { + /* + * Drop refcount because reconstruction expects an equal number + * of refcounts for all subfolios - just keep the refcount taken + * by the filemap. + */ + folio_put(folio); + + /* Merge only when the entire block of nr_pages is not faultable. */ + if (!kvm_gmem_is_any_faultable(inode, index, nr_pages)) { + ret =3D kvm_gmem_reconstruct_folio_in_filemap(h, folio); + WARN_ON(ret); + } + } + + hugetlb_fault_mutex_unlock(hash); + } + + return ret; +} + +/* Basically folio_set_order() without the checks. */ static inline void kvm_gmem_folio_set_order(struct folio *folio, unsigned = int order) { folio->_flags_1 =3D (folio->_flags_1 & ~0xffUL) | order; @@ -414,8 +586,8 @@ static inline void kvm_gmem_folio_set_order(struct foli= o *folio, unsigned int or /** * Split a HugeTLB @folio of size huge_page_size(@h). * - * After splitting, each split folio has a refcount of 1. There are no che= cks on - * refcounts before splitting. + * Folio must have refcount of 1 when this function is called. After split= ting, + * each split folio has a refcount of 1. * * Return 0 on success and negative error otherwise. */ @@ -423,14 +595,18 @@ static int kvm_gmem_hugetlb_split_folio(struct hstate= *h, struct folio *folio) { int ret; =20 + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio) !=3D 1, folio); + + folio_set_count(folio, 0); + ret =3D hugetlb_vmemmap_restore_folio(h, folio); if (ret) - return ret; + goto out; =20 ret =3D kvm_gmem_hugetlb_stash_metadata(folio); if (ret) { hugetlb_vmemmap_optimize_folio(h, folio); - return ret; + goto out; } =20 kvm_gmem_folio_set_order(folio, 0); @@ -439,109 +615,183 @@ static int kvm_gmem_hugetlb_split_folio(struct hsta= te *h, struct folio *folio) __folio_clear_hugetlb(folio); =20 /* - * Remove the first folio from h->hugepage_activelist since it is no + * Remove the original folio from h->hugepage_activelist since it is no * longer a HugeTLB page. The other split pages should not be on any * lists. */ hugetlb_folio_list_del(folio); =20 - return 0; + ret =3D 0; +out: + folio_set_count(folio, 1); + return ret; } =20 -static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *= inode, - pgoff_t index) +/** + * Split a HugeTLB folio into folio_nr_pages(@folio) pages. Will clean up = folio + * from filemap and add back the split folios. @folio must not be locked, = and + * all split folios will not be locked. Return 0 on success or negative er= ror + * otherwise. + * + * hugetlb_fault_mutex_lock() has to be held when calling this function. + * + * Expects that before this call, the filemap's refcounts are the only ref= counts + * for the folio. After this function returns, the filemap's refcounts wil= l be + * the only refcounts on the split folios. + */ +static int kvm_gmem_split_folio_in_filemap(struct hstate *h, struct folio = *folio) { - struct folio *allocated_hugetlb_folio; - pgoff_t hugetlb_first_subpage_index; - struct page *hugetlb_first_subpage; - struct kvm_gmem_hugetlb *hgmem; - struct page *requested_page; + struct address_space *mapping; + struct page *first_subpage; + pgoff_t index; int ret; int i; =20 - hgmem =3D kvm_gmem_hgmem(inode); - allocated_hugetlb_folio =3D kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem-= >spool); - if (IS_ERR(allocated_hugetlb_folio)) - return allocated_hugetlb_folio; + if (folio_order(folio) =3D=3D 0) + return 0; =20 - requested_page =3D folio_file_page(allocated_hugetlb_folio, index); - hugetlb_first_subpage =3D folio_file_page(allocated_hugetlb_folio, 0); - hugetlb_first_subpage_index =3D index & (huge_page_mask(hgmem->h) >> PAGE= _SHIFT); + index =3D folio->index; + mapping =3D folio->mapping; =20 - ret =3D kvm_gmem_hugetlb_split_folio(hgmem->h, allocated_hugetlb_folio); + first_subpage =3D folio_page(folio, 0); + + /* + * Take reference so that folio will not be released when removed from + * filemap. + */ + folio_get(folio); + + kvm_gmem_hugetlb_filemap_remove_folio(folio); + + ret =3D kvm_gmem_hugetlb_split_folio(h, folio); if (ret) { - folio_put(allocated_hugetlb_folio); - return ERR_PTR(ret); + WARN_ON(ret); + kvm_gmem_hugetlb_filemap_add_folio(mapping, folio, index, + htlb_alloc_mask(h)); + folio_put(folio); + return ret; } =20 - for (i =3D 0; i < pages_per_huge_page(hgmem->h); ++i) { - struct folio *folio =3D page_folio(nth_page(hugetlb_first_subpage, i)); + for (i =3D 0; i < pages_per_huge_page(h); ++i) { + struct folio *folio =3D page_folio(nth_page(first_subpage, i)); =20 - ret =3D kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, - folio, - hugetlb_first_subpage_index + i, - htlb_alloc_mask(hgmem->h)); + ret =3D kvm_gmem_hugetlb_filemap_add_folio(mapping, folio, + index + i, + htlb_alloc_mask(h)); if (ret) { /* TODO: handle cleanup properly. */ - pr_err("Handle cleanup properly index=3D%lx, ret=3D%d\n", - hugetlb_first_subpage_index + i, ret); - dump_page(nth_page(hugetlb_first_subpage, i), "check"); - return ERR_PTR(ret); + WARN_ON(ret); + return ret; } =20 + folio_unlock(folio); + /* - * Skip unlocking for the requested index since - * kvm_gmem_get_folio() returns a locked folio. - * - * Do folio_put() to drop the refcount that came with the folio, - * from splitting the folio. Splitting the folio has a refcount - * to be in line with hugetlb_alloc_folio(), which returns a - * folio with refcount 1. - * - * Skip folio_put() for requested index since - * kvm_gmem_get_folio() returns a folio with refcount 1. + * Drop reference so that the only remaining reference is the + * one held by the filemap. */ - if (hugetlb_first_subpage_index + i !=3D index) { - folio_unlock(folio); - folio_put(folio); - } + folio_put(folio); } =20 + return ret; +} + +/* + * Allocates and then caches a folio in the filemap. Returns a folio with + * refcount of 2: 1 after allocation, and 1 taken by the filemap. + */ +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *= inode, + pgoff_t index) +{ + struct kvm_gmem_hugetlb *hgmem; + pgoff_t aligned_index; + struct folio *folio; + int nr_pages; + int ret; + + hgmem =3D kvm_gmem_hgmem(inode); + folio =3D kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool); + if (IS_ERR(folio)) + return folio; + + nr_pages =3D 1UL << huge_page_order(hgmem->h); + aligned_index =3D round_down(index, nr_pages); + + ret =3D kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio, + aligned_index, + htlb_alloc_mask(hgmem->h)); + WARN_ON(ret); + spin_lock(&inode->i_lock); inode->i_blocks +=3D blocks_per_huge_page(hgmem->h); spin_unlock(&inode->i_lock); =20 - return page_folio(requested_page); + return folio; +} + +/** + * Split @folio if any of the subfolios are faultable. Returns the split + * (locked, refcount=3D2) folio at @index. + * + * Expects a locked folio with 1 refcount in addition to filemap's refcoun= ts. + * + * After splitting, the subfolios in the filemap will be unlocked and have + * refcount 1 (other than the returned folio, which will be locked and have + * refcount 2). + */ +static struct folio *kvm_gmem_maybe_split_folio(struct folio *folio, pgoff= _t index) +{ + pgoff_t aligned_index; + struct inode *inode; + struct hstate *h; + int nr_pages; + int ret; + + inode =3D folio->mapping->host; + h =3D kvm_gmem_hgmem(inode)->h; + nr_pages =3D 1UL << huge_page_order(h); + aligned_index =3D round_down(index, nr_pages); + + if (!kvm_gmem_is_any_faultable(inode, aligned_index, nr_pages)) + return folio; + + /* Drop lock and refcount in preparation for splitting. */ + folio_unlock(folio); + folio_put(folio); + + ret =3D kvm_gmem_split_folio_in_filemap(h, folio); + if (ret) { + kvm_gmem_hugetlb_filemap_remove_folio(folio); + return ERR_PTR(ret); + } + + /* + * At this point, the filemap has the only reference on the folio. Take + * lock and refcount on folio to align with kvm_gmem_get_folio(). + */ + return filemap_lock_folio(inode->i_mapping, index); } =20 static struct folio *kvm_gmem_get_hugetlb_folio(struct inode *inode, pgoff_t index) { - struct address_space *mapping; struct folio *folio; - struct hstate *h; - pgoff_t hindex; u32 hash; =20 - h =3D kvm_gmem_hgmem(inode)->h; - hindex =3D index >> huge_page_order(h); - mapping =3D inode->i_mapping; - - /* To lock, we calculate the hash using the hindex and not index. */ - hash =3D hugetlb_fault_mutex_hash(mapping, hindex); - mutex_lock(&hugetlb_fault_mutex_table[hash]); + hash =3D hugetlb_fault_mutex_lock(inode->i_mapping, index); =20 /* - * The filemap is indexed with index and not hindex. Taking lock on - * folio to align with kvm_gmem_get_regular_folio() + * The filemap is indexed with index and not hindex. Take lock on folio + * to align with kvm_gmem_get_regular_folio() */ - folio =3D filemap_lock_folio(mapping, index); + folio =3D filemap_lock_folio(inode->i_mapping, index); + if (IS_ERR(folio)) + folio =3D kvm_gmem_hugetlb_alloc_and_cache_folio(inode, index); + if (!IS_ERR(folio)) - goto out; + folio =3D kvm_gmem_maybe_split_folio(folio, index); =20 - folio =3D kvm_gmem_hugetlb_alloc_and_cache_folio(inode, index); -out: - mutex_unlock(&hugetlb_fault_mutex_table[hash]); + hugetlb_fault_mutex_unlock(hash); =20 return folio; } @@ -610,17 +860,6 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *g= mem, pgoff_t start, } } =20 -static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *fol= io) -{ - folio_lock(folio); - - folio_clear_dirty(folio); - folio_clear_uptodate(folio); - filemap_remove_folio(folio); - - folio_unlock(folio); -} - /** * Removes folios in range [@lstart, @lend) from page cache/filemap (@mapp= ing), * returning the number of HugeTLB pages freed. @@ -631,61 +870,30 @@ static int kvm_gmem_hugetlb_filemap_remove_folios(str= uct address_space *mapping, struct hstate *h, loff_t lstart, loff_t lend) { - const pgoff_t end =3D lend >> PAGE_SHIFT; - pgoff_t next =3D lstart >> PAGE_SHIFT; - LIST_HEAD(folios_to_reconstruct); - struct folio_batch fbatch; - struct folio *folio, *tmp; - int num_freed =3D 0; - int i; - - /* - * TODO: Iterate over huge_page_size(h) blocks to avoid taking and - * releasing hugetlb_fault_mutex_table[hash] lock so often. When - * truncating, lstart and lend should be clipped to the size of this - * guest_memfd file, otherwise there would be too many iterations. - */ - folio_batch_init(&fbatch); - while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { - for (i =3D 0; i < folio_batch_count(&fbatch); ++i) { - struct folio *folio; - pgoff_t hindex; - u32 hash; - - folio =3D fbatch.folios[i]; + loff_t offset; + int num_freed; =20 - hindex =3D folio->index >> huge_page_order(h); - hash =3D hugetlb_fault_mutex_hash(mapping, hindex); - mutex_lock(&hugetlb_fault_mutex_table[hash]); + num_freed =3D 0; + for (offset =3D lstart; offset < lend; offset +=3D huge_page_size(h)) { + struct folio *folio; + pgoff_t index; + u32 hash; =20 - /* - * Collect first pages of HugeTLB folios for - * reconstruction later. - */ - if ((folio->index & ~(huge_page_mask(h) >> PAGE_SHIFT)) =3D=3D 0) - list_add(&folio->lru, &folios_to_reconstruct); + index =3D offset >> PAGE_SHIFT; + hash =3D hugetlb_fault_mutex_lock(mapping, index); =20 - /* - * Before removing from filemap, take a reference so - * sub-folios don't get freed. Don't free the sub-folios - * until after reconstruction. - */ - folio_get(folio); + folio =3D filemap_get_folio(mapping, index); + if (!IS_ERR(folio)) { + /* Drop refcount so that filemap holds only reference. */ + folio_put(folio); =20 + kvm_gmem_reconstruct_folio_in_filemap(h, folio); kvm_gmem_hugetlb_filemap_remove_folio(folio); =20 - mutex_unlock(&hugetlb_fault_mutex_table[hash]); + num_freed++; } - folio_batch_release(&fbatch); - cond_resched(); - } - - list_for_each_entry_safe(folio, tmp, &folios_to_reconstruct, lru) { - kvm_gmem_hugetlb_reconstruct_folio(h, folio); - hugetlb_folio_list_move(folio, &h->hugepage_activelist); =20 - folio_put(folio); - num_freed++; + hugetlb_fault_mutex_unlock(hash); } =20 return num_freed; @@ -705,6 +913,10 @@ static void kvm_gmem_hugetlb_truncate_folios_range(str= uct inode *inode, int gbl_reserve; int num_freed; =20 + /* No point truncating more than inode size. */ + lstart =3D min(lstart, inode->i_size); + lend =3D min(lend, inode->i_size); + hgmem =3D kvm_gmem_hgmem(inode); h =3D hgmem->h; =20 @@ -1042,13 +1254,27 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *v= mf) bool is_prepared; =20 inode =3D file_inode(vmf->vma->vm_file); - if (!kvm_gmem_is_faultable(inode, vmf->pgoff)) + + /* + * Use filemap_invalidate_lock_shared() to make sure + * kvm_gmem_get_folio() doesn't race with faultability updates. + */ + filemap_invalidate_lock_shared(inode->i_mapping); + + if (!kvm_gmem_is_faultable(inode, vmf->pgoff)) { + filemap_invalidate_unlock_shared(inode->i_mapping); return VM_FAULT_SIGBUS; + } =20 folio =3D kvm_gmem_get_folio(inode, vmf->pgoff); + + filemap_invalidate_unlock_shared(inode->i_mapping); + if (!folio) return VM_FAULT_SIGBUS; =20 + WARN(folio_test_hugetlb(folio), "should not be faulting in hugetlb folio= =3D%p\n", folio); + is_prepared =3D folio_test_uptodate(folio); if (!is_prepared) { unsigned long nr_pages; @@ -1731,8 +1957,6 @@ static bool kvm_gmem_no_mappings_range(struct inode *= inode, pgoff_t start, pgoff pgoff_t index; bool checked_indices_unmapped; =20 - filemap_invalidate_lock_shared(inode->i_mapping); - /* TODO: replace iteration with filemap_get_folios() for efficiency. */ checked_indices_unmapped =3D true; for (index =3D start; checked_indices_unmapped && index < end;) { @@ -1754,98 +1978,130 @@ static bool kvm_gmem_no_mappings_range(struct inod= e *inode, pgoff_t start, pgoff folio_put(folio); } =20 - filemap_invalidate_unlock_shared(inode->i_mapping); return checked_indices_unmapped; } =20 /** - * Returns true if pages in range [@start, @end) in memslot @slot have no - * userspace mappings. + * Split any HugeTLB folios in range [@start, @end), if any of the offsets= in + * the folio are faultable. Return 0 on success or negative error otherwis= e. + * + * Will skip any folios that are already split. */ -static bool kvm_gmem_no_mappings_slot(struct kvm_memory_slot *slot, - gfn_t start, gfn_t end) +static int kvm_gmem_try_split_folios_range(struct inode *inode, + pgoff_t start, pgoff_t end) { - pgoff_t offset_start; - pgoff_t offset_end; - struct file *file; - bool ret; - - offset_start =3D start - slot->base_gfn + slot->gmem.pgoff; - offset_end =3D end - slot->base_gfn + slot->gmem.pgoff; - - file =3D kvm_gmem_get_file(slot); - if (!file) - return false; - - ret =3D kvm_gmem_no_mappings_range(file_inode(file), offset_start, offset= _end); + unsigned int nr_pages; + pgoff_t aligned_start; + pgoff_t aligned_end; + struct hstate *h; + pgoff_t index; + int ret; =20 - fput(file); + if (!is_kvm_gmem_hugetlb(inode)) + return 0; =20 - return ret; -} + h =3D kvm_gmem_hgmem(inode)->h; + nr_pages =3D 1UL << huge_page_order(h); =20 -/** - * Returns true if pages in range [@start, @end) have no host userspace ma= ppings. - */ -static bool kvm_gmem_no_mappings(struct kvm *kvm, gfn_t start, gfn_t end) -{ - int i; + aligned_start =3D round_down(start, nr_pages); + aligned_end =3D round_up(end, nr_pages); =20 - lockdep_assert_held(&kvm->slots_lock); + ret =3D 0; + for (index =3D aligned_start; !ret && index < aligned_end; index +=3D nr_= pages) { + struct folio *folio; + u32 hash; =20 - for (i =3D 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { - struct kvm_memslot_iter iter; - struct kvm_memslots *slots; + hash =3D hugetlb_fault_mutex_lock(inode->i_mapping, index); =20 - slots =3D __kvm_memslots(kvm, i); - kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { - struct kvm_memory_slot *slot; - gfn_t gfn_start; - gfn_t gfn_end; - - slot =3D iter.slot; - gfn_start =3D max(start, slot->base_gfn); - gfn_end =3D min(end, slot->base_gfn + slot->npages); + folio =3D filemap_get_folio(inode->i_mapping, index); + if (!IS_ERR(folio)) { + /* + * Drop refcount so that the only references held are refcounts + * from the filemap. + */ + folio_put(folio); =20 - if (iter.slot->flags & KVM_MEM_GUEST_MEMFD && - !kvm_gmem_no_mappings_slot(iter.slot, gfn_start, gfn_end)) - return false; + if (kvm_gmem_is_any_faultable(inode, index, nr_pages)) { + ret =3D kvm_gmem_split_folio_in_filemap(h, folio); + if (ret) { + /* TODO cleanup properly. */ + WARN_ON(ret); + } + } } + + hugetlb_fault_mutex_unlock(hash); } =20 - return true; + return ret; } =20 /** - * Set faultability of given range of gfns [@start, @end) in memslot @slot= to - * @faultable. + * Returns 0 if guest_memfd permits setting range [@start, @end) with + * faultability @faultable within memslot @slot, or negative error otherwi= se. + * + * If a request was made to set the memory to PRIVATE (not faultable), the= pages + * in the range must not be pinned or mapped for the request to be permitt= ed. + * + * Because this may allow pages to be faulted in to userspace when request= ed to + * set attributes to shared, this must only be called after the pages have= been + * invalidated from guest page tables. */ -static void kvm_gmem_set_faultable_slot(struct kvm_memory_slot *slot, gfn_= t start, - gfn_t end, bool faultable) +static int kvm_gmem_try_set_faultable_slot(struct kvm_memory_slot *slot, + gfn_t start, gfn_t end, + bool faultable) { pgoff_t start_offset; + struct inode *inode; pgoff_t end_offset; struct file *file; + int ret; =20 file =3D kvm_gmem_get_file(slot); if (!file) - return; + return 0; =20 start_offset =3D start - slot->base_gfn + slot->gmem.pgoff; end_offset =3D end - slot->base_gfn + slot->gmem.pgoff; =20 - WARN_ON(kvm_gmem_set_faultable(file_inode(file), start_offset, end_offset, - faultable)); + inode =3D file_inode(file); + + /* + * Use filemap_invalidate_lock_shared() to make sure + * splitting/reconstruction doesn't race with faultability updates. + */ + filemap_invalidate_lock(inode->i_mapping); + + kvm_gmem_set_faultable(inode, start_offset, end_offset, faultable); + + if (faultable) { + ret =3D kvm_gmem_try_split_folios_range(inode, start_offset, + end_offset); + } else { + if (kvm_gmem_no_mappings_range(inode, start_offset, end_offset)) { + ret =3D kvm_gmem_try_reconstruct_folios_range(inode, + start_offset, + end_offset); + } else { + ret =3D -EINVAL; + } + } + + filemap_invalidate_unlock(inode->i_mapping); =20 fput(file); + + return ret; } =20 /** - * Set faultability of given range of gfns [@start, @end) in memslot @slot= to - * @faultable. + * Returns 0 if guest_memfd permits setting range [@start, @end) with + * faultability @faultable within VM @kvm, or negative error otherwise. + * + * See kvm_gmem_try_set_faultable_slot() for details. */ -static void kvm_gmem_set_faultable_vm(struct kvm *kvm, gfn_t start, gfn_t = end, - bool faultable) +static int kvm_gmem_try_set_faultable_vm(struct kvm *kvm, gfn_t start, gfn= _t end, + bool faultable) { int i; =20 @@ -1866,43 +2122,15 @@ static void kvm_gmem_set_faultable_vm(struct kvm *k= vm, gfn_t start, gfn_t end, gfn_end =3D min(end, slot->base_gfn + slot->npages); =20 if (iter.slot->flags & KVM_MEM_GUEST_MEMFD) { - kvm_gmem_set_faultable_slot(slot, gfn_start, - gfn_end, faultable); + int ret; + + ret =3D kvm_gmem_try_set_faultable_slot(slot, gfn_start, + gfn_end, faultable); + if (ret) + return ret; } } } -} - -/** - * Returns true if guest_memfd permits setting range [@start, @end) to PRI= VATE. - * - * If memory is faulted in to host userspace and a request was made to set= the - * memory to PRIVATE, the faulted in pages must not be pinned for the requ= est to - * be permitted. - */ -static int kvm_gmem_should_set_attributes_private(struct kvm *kvm, gfn_t s= tart, - gfn_t end) -{ - kvm_gmem_set_faultable_vm(kvm, start, end, false); - - if (kvm_gmem_no_mappings(kvm, start, end)) - return 0; - - kvm_gmem_set_faultable_vm(kvm, start, end, true); - return -EINVAL; -} - -/** - * Returns true if guest_memfd permits setting range [@start, @end) to SHA= RED. - * - * Because this allows pages to be faulted in to userspace, this must only= be - * called after the pages have been invalidated from guest page tables. - */ -static int kvm_gmem_should_set_attributes_shared(struct kvm *kvm, gfn_t st= art, - gfn_t end) -{ - /* Always okay to set shared, hence set range faultable here. */ - kvm_gmem_set_faultable_vm(kvm, start, end, true); =20 return 0; } @@ -1922,10 +2150,16 @@ static int kvm_gmem_should_set_attributes_shared(st= ruct kvm *kvm, gfn_t start, int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, gfn_t end, unsigned long attrs) { - if (attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE) - return kvm_gmem_should_set_attributes_private(kvm, start, end); - else - return kvm_gmem_should_set_attributes_shared(kvm, start, end); + bool faultable; + int ret; + + faultable =3D !(attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE); + + ret =3D kvm_gmem_try_set_faultable_vm(kvm, start, end, faultable); + if (ret) + WARN_ON(kvm_gmem_try_set_faultable_vm(kvm, start, end, !faultable)); + + return ret; } =20 #endif --=20 2.46.0.598.g6f2099f65c-goog