From nobody Sat Nov 30 10:35:19 2024 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D191A1BC08F for ; Tue, 10 Sep 2024 23:44:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726011898; cv=none; b=OExu+4U0+eEkmzr9lXlVvvHbJWSFWT3ZOvzj6IOhslwni/q25AGldRYfSwVdiS2caJTKizgzq/p8JMrMl4ybz/dtyXx4EMcanMaLJ7Ag8VcWDdHfECXyQCWfmfkDRKeLEg7dNFPEHmx7i11zFgLq0cO3hzv3GFBTMG+t6BFK4sg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726011898; c=relaxed/simple; bh=tK2AWcK5RorCRS4IJuPgWIL6zepcg7gijUrMtw7JGFk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=SUAv0YfP2jBS8Xj7Wa4+Au+2i+zqO83HZWnzFvc9BhFpkOE/u/3q4LHfiKfkC+mD29LoGhI0nvvcjET2qWGiVlQ1ugPAwjUmnSlKk76ohu3V2CGpzhamZL0o10Yizd7QMQkItNyU1tyiQ6f7TR9KIo3fs3a9myffPdr5vorhUAo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ZClEDWzJ; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ZClEDWzJ" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-205516d992eso15743255ad.3 for ; Tue, 10 Sep 2024 16:44:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1726011896; x=1726616696; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9Gc31H97UbE78QiSNq4CUIpVlCNlnI2SeIjwGOQZSc0=; b=ZClEDWzJk1GlPJoEJMSACkIJIVStYDEWhklSbtVitrIhLSwZcqKKOwxF1b7Aj79upQ lkgFyGXxfW44uE5cKHMyx31/womGfKAnsTKSBYXZqenNfK23jqY8VomAsbvlxohGpRVJ cnD4V99vvwrxjudqLigTUd6EC7SFHVa0+QSUmkKKng7SDkGc7HfDgGTr6XycKMEDJX2G 13lBs+/gXQdD/HS9KN6Kqb1GUm8/uRIO7lNVlBB3FKBiiafo7EITnrUA8yzox/vTbyoR JQxe69VGtXtt+IMK/0lkT4d1B9+eEb+Y61w/afDzO6GybDokVkMdd8QJpZSwCIXXtRe2 Tz8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726011896; x=1726616696; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9Gc31H97UbE78QiSNq4CUIpVlCNlnI2SeIjwGOQZSc0=; b=kxt3xJmz+yCAgCP7gsvqqhZM6E06HX9tPjVbXKhIGvBkYc+KF4jbvYd8RM10iMXpZn NeAUO4QJiLkvjc/Vvnk0V1/jopIToLloI6gxHHZh9i4Vt4nf5EhFsGR8wDMfy71BJjzU bqQpd3fQNzKMhcb2O5aw8U/2JhdZTEROYRfJjyY1ouP18WpUY5dLaLUxK2WIkwiK1SwB vUny6OggubrzK2b0jdboIR2bqXXWXcYFH6HTyhD/1JlHGJ1lYyKt9savPmSPZeVgBoVt wQhdMVSLz685vH35nn/OY6mxw+6u16g5Zvq3C6nhXtZ6xBQAXY+z6+TBybf64PrKjQgO S0EQ== X-Forwarded-Encrypted: i=1; AJvYcCWzx5fA1dsVcvYU/iHzJJntCTvUvlaN+lnuncxKVVqq9SAzrEXLpgfnPzab4t2woeBZMO4IfK/VaJG3Qic=@vger.kernel.org X-Gm-Message-State: AOJu0YzwTFjzmTet+NkixSbQhLzG63f90eK5bIJ5vFDLgsumfrjhNXva UgbAxYXeVG9QB9U15GZS0WMUmyx9du3RgTS7yL+rCNh74dKAC5csI9K+WQnVTqA1v9A3ib1h1wK 5O7+wWe/Ysu/vqLLCqtCAHA== X-Google-Smtp-Source: AGHT+IGTAqnZcaBNVO3Nl4Ef5dtGerZRypTfLYB/jrgUVkEZWYdnxrwGBGsysGpj4YxZly3jtQUA0VOLee5RXF37ZQ== X-Received: from ackerleytng-ctop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:13f8]) (user=ackerleytng job=sendgmr) by 2002:a17:902:ce91:b0:206:928c:bfd9 with SMTP id d9443c01a7336-20752208a62mr470995ad.6.1726011896097; Tue, 10 Sep 2024 16:44:56 -0700 (PDT) Date: Tue, 10 Sep 2024 23:43:45 +0000 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.46.0.598.g6f2099f65c-goog Message-ID: <3fec11d8a007505405eadcf2b3e10ec9051cf6bf.1726009989.git.ackerleytng@google.com> Subject: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup From: Ackerley Tng To: tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, mike.kravetz@oracle.com Cc: erdemaktas@google.com, vannapurve@google.com, ackerleytng@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" First stage of hugetlb support: add initialization and cleanup routines. After guest_mem was massaged to use guest_mem inodes instead of anonymous inodes in an earlier patch, the .evict_inode handler can now be overridden to do hugetlb metadata cleanup. Signed-off-by: Ackerley Tng --- include/uapi/linux/kvm.h | 26 ++++++ virt/kvm/guest_memfd.c | 177 +++++++++++++++++++++++++++++++++++++-- 2 files changed, 197 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc055145..77de7c4432f6 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -13,6 +13,7 @@ #include #include #include +#include =20 #define KVM_API_VERSION 12 =20 @@ -1558,6 +1559,31 @@ struct kvm_memory_attributes { =20 #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest= _memfd) =20 +#define KVM_GUEST_MEMFD_HUGETLB (1ULL << 1) + +/* + * Huge page size encoding when KVM_GUEST_MEMFD_HUGETLB is specified, and = a huge + * page size other than the default is desired. See hugetlb_encode.h. All + * known huge page size encodings are provided here. It is the responsibi= lity + * of the application to know which sizes are supported on the running sys= tem. + * See mmap(2) man page for details. + */ +#define KVM_GUEST_MEMFD_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT +#define KVM_GUEST_MEMFD_HUGE_MASK HUGETLB_FLAG_ENCODE_MASK + +#define KVM_GUEST_MEMFD_HUGE_64KB HUGETLB_FLAG_ENCODE_64KB +#define KVM_GUEST_MEMFD_HUGE_512KB HUGETLB_FLAG_ENCODE_512KB +#define KVM_GUEST_MEMFD_HUGE_1MB HUGETLB_FLAG_ENCODE_1MB +#define KVM_GUEST_MEMFD_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB +#define KVM_GUEST_MEMFD_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB +#define KVM_GUEST_MEMFD_HUGE_16MB HUGETLB_FLAG_ENCODE_16MB +#define KVM_GUEST_MEMFD_HUGE_32MB HUGETLB_FLAG_ENCODE_32MB +#define KVM_GUEST_MEMFD_HUGE_256MB HUGETLB_FLAG_ENCODE_256MB +#define KVM_GUEST_MEMFD_HUGE_512MB HUGETLB_FLAG_ENCODE_512MB +#define KVM_GUEST_MEMFD_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB +#define KVM_GUEST_MEMFD_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB +#define KVM_GUEST_MEMFD_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB + struct kvm_create_guest_memfd { __u64 size; __u64 flags; diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 5d7fd1f708a6..31e1115273e1 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include #include @@ -18,6 +19,16 @@ struct kvm_gmem { struct list_head entry; }; =20 +struct kvm_gmem_hugetlb { + struct hstate *h; + struct hugepage_subpool *spool; +}; + +static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode) +{ + return inode->i_mapping->i_private_data; +} + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index. @@ -154,6 +165,82 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *g= mem, pgoff_t start, } } =20 +static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *fol= io) +{ + folio_lock(folio); + + folio_clear_dirty(folio); + folio_clear_uptodate(folio); + filemap_remove_folio(folio); + + folio_unlock(folio); +} + +/** + * Removes folios in range [@lstart, @lend) from page cache/filemap (@mapp= ing), + * returning the number of pages freed. + */ +static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *ma= pping, + struct hstate *h, + loff_t lstart, loff_t lend) +{ + const pgoff_t end =3D lend >> PAGE_SHIFT; + pgoff_t next =3D lstart >> PAGE_SHIFT; + struct folio_batch fbatch; + int num_freed =3D 0; + + folio_batch_init(&fbatch); + while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { + int i; + for (i =3D 0; i < folio_batch_count(&fbatch); ++i) { + struct folio *folio; + pgoff_t hindex; + u32 hash; + + folio =3D fbatch.folios[i]; + hindex =3D folio->index >> huge_page_order(h); + hash =3D hugetlb_fault_mutex_hash(mapping, hindex); + + mutex_lock(&hugetlb_fault_mutex_table[hash]); + kvm_gmem_hugetlb_filemap_remove_folio(folio); + mutex_unlock(&hugetlb_fault_mutex_table[hash]); + + num_freed++; + } + folio_batch_release(&fbatch); + cond_resched(); + } + + return num_freed; +} + +/** + * Removes folios in range [@lstart, @lend) from page cache of inode, upda= tes + * inode metadata and hugetlb reservations. + */ +static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode, + loff_t lstart, loff_t lend) +{ + struct kvm_gmem_hugetlb *hgmem; + struct hstate *h; + int gbl_reserve; + int num_freed; + + hgmem =3D kvm_gmem_hgmem(inode); + h =3D hgmem->h; + + num_freed =3D kvm_gmem_hugetlb_filemap_remove_folios(inode->i_mapping, + h, lstart, lend); + + gbl_reserve =3D hugepage_subpool_put_pages(hgmem->spool, num_freed); + hugetlb_acct_memory(h, -gbl_reserve); + + spin_lock(&inode->i_lock); + inode->i_blocks -=3D blocks_per_huge_page(h) * num_freed; + spin_unlock(&inode->i_lock); +} + + static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t= len) { struct list_head *gmem_list =3D &inode->i_mapping->i_private_list; @@ -307,8 +394,33 @@ static inline struct file *kvm_gmem_get_file(struct kv= m_memory_slot *slot) return get_file_active(&slot->gmem.file); } =20 +static void kvm_gmem_hugetlb_teardown(struct inode *inode) +{ + struct kvm_gmem_hugetlb *hgmem; + + truncate_inode_pages_final_prepare(inode->i_mapping); + kvm_gmem_hugetlb_truncate_folios_range(inode, 0, LLONG_MAX); + + hgmem =3D kvm_gmem_hgmem(inode); + hugepage_put_subpool(hgmem->spool); + kfree(hgmem); +} + +static void kvm_gmem_evict_inode(struct inode *inode) +{ + u64 flags =3D (u64)inode->i_private; + + if (flags & KVM_GUEST_MEMFD_HUGETLB) + kvm_gmem_hugetlb_teardown(inode); + else + truncate_inode_pages_final(inode->i_mapping); + + clear_inode(inode); +} + static const struct super_operations kvm_gmem_super_operations =3D { .statfs =3D simple_statfs, + .evict_inode =3D kvm_gmem_evict_inode, }; =20 static int kvm_gmem_init_fs_context(struct fs_context *fc) @@ -431,6 +543,42 @@ static const struct inode_operations kvm_gmem_iops =3D= { .setattr =3D kvm_gmem_setattr, }; =20 +static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 fl= ags) +{ + struct kvm_gmem_hugetlb *hgmem; + struct hugepage_subpool *spool; + int page_size_log; + struct hstate *h; + long hpages; + + page_size_log =3D (flags >> KVM_GUEST_MEMFD_HUGE_SHIFT) & KVM_GUEST_MEMFD= _HUGE_MASK; + h =3D hstate_sizelog(page_size_log); + + /* Round up to accommodate size requests that don't align with huge pages= */ + hpages =3D round_up(size, huge_page_size(h)) >> huge_page_shift(h); + + spool =3D hugepage_new_subpool(h, hpages, hpages, false); + if (!spool) + goto err; + + hgmem =3D kzalloc(sizeof(*hgmem), GFP_KERNEL); + if (!hgmem) + goto err_subpool; + + inode->i_blkbits =3D huge_page_shift(h); + + hgmem->h =3D h; + hgmem->spool =3D spool; + inode->i_mapping->i_private_data =3D hgmem; + + return 0; + +err_subpool: + kfree(spool); +err: + return -ENOMEM; +} + static struct inode *kvm_gmem_inode_make_secure_inode(const char *name, loff_t size, u64 flags) { @@ -443,9 +591,13 @@ static struct inode *kvm_gmem_inode_make_secure_inode(= const char *name, return inode; =20 err =3D security_inode_init_security_anon(inode, &qname, NULL); - if (err) { - iput(inode); - return ERR_PTR(err); + if (err) + goto out; + + if (flags & KVM_GUEST_MEMFD_HUGETLB) { + err =3D kvm_gmem_hugetlb_setup(inode, size, flags); + if (err) + goto out; } =20 inode->i_private =3D (void *)(unsigned long)flags; @@ -459,6 +611,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(= const char *name, WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); =20 return inode; + +out: + iput(inode); + + return ERR_PTR(err); } =20 static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size, @@ -526,14 +683,22 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t = size, u64 flags) return err; } =20 +#define KVM_GUEST_MEMFD_ALL_FLAGS KVM_GUEST_MEMFD_HUGETLB + int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size =3D args->size; u64 flags =3D args->flags; - u64 valid_flags =3D 0; =20 - if (flags & ~valid_flags) - return -EINVAL; + if (flags & KVM_GUEST_MEMFD_HUGETLB) { + /* Allow huge page size encoding in flags */ + if (flags & ~(KVM_GUEST_MEMFD_ALL_FLAGS | + (KVM_GUEST_MEMFD_HUGE_MASK << KVM_GUEST_MEMFD_HUGE_SHIFT))) + return -EINVAL; + } else { + if (flags & ~KVM_GUEST_MEMFD_ALL_FLAGS) + return -EINVAL; + } =20 if (size <=3D 0 || !PAGE_ALIGNED(size)) return -EINVAL; --=20 2.46.0.598.g6f2099f65c-goog