From nobody Wed Feb 11 07:50:05 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EC92F3542EE for ; Mon, 2 Feb 2026 22:30:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770071427; cv=none; b=eMpcgUeEpyBZhaIKN6HjZf2ksmpR8ueRMKa0gi8hzzMQK+u+OzfJ+E3ve5NU3FK+cxZpP1aedCHcZnMsoecLoeUZKJwCkyKGbf/TgQpOO2tZsYy+wPcFoqWKYpzoFu6moN0pVKsG9j55/Y6jYFIqFM9qMQGNoUdL3FIJ++6FFsA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770071427; c=relaxed/simple; bh=/4ze1crWALxcWyuaLtBG1Pb3XCbyM64yleJKXnEp7qk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ltrsICLCHOV21V/GHSxqEU86OFmQ9G41BLa1lBhb5rylqu+QdE1LAUy9rTrNOlK5Wbr52Nbr2CMp6ScX9exI0wrMWQ+Eht4KuUlGjKzcPPdabUqpau/sA7w0Ysp4Zo3KUwedDWjmvHTGLiKrNXwNCawonoyrg0cuNyLK/kSO8SI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=vaVzTw2Q; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="vaVzTw2Q" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-34ac814f308so8791996a91.3 for ; Mon, 02 Feb 2026 14:30:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1770071424; x=1770676224; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=WJgWKdX1fXoucLet6khpaz7IZED1X+tl8lLH7XoCgYo=; b=vaVzTw2QpM2bPQi3nVMxlZiE9ShlM5J7jP0G0nrlAClOTxHp9SVfMKgFTDDcxpBntv rwFJACx+EkbzCZk4AJEMPovS10JI2ORBJLOR4WwUH50K/J7uiM/I+icvIW2DphNJJu7a 1Hru7cHglpvCUG0SSZuN/+z9qUIOSChpKs40/EvZAFc/oNSxuxodcycLt903laGBnwH9 xKw6TooDubnbYikOn3b247y7bGEyN0UWJHv+cXdFWQHoLRAke9WQcyqd++v4AwG0h1VZ DtJZghcwfOq4tckbcf2MBMK53fe7OHHvObmpy8rFMZidcEUj49V0LkCEio7BVuitqk9d mJag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770071424; x=1770676224; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WJgWKdX1fXoucLet6khpaz7IZED1X+tl8lLH7XoCgYo=; b=jp2kqKOLY9sstq0fc5CwLkylBkGk14+YM55R+vWwj276sQbSnOZv+z+B2IwcvlNRTd kIt57klMDo2NMaXDRjGJ08pnlwGxufmFXy/9EPEyLNhHhdt9tYMXYXvyAnyAWMYQO7+q Cg2Hf16kILiYWVRPWKwk+cd0Tpdy1eh+Gx3HMcDDAwY6I88nUljRKQh9X+YahO+TN/bB MkjuaDZcjx4M0RfT+9txX7hSVYIWpENsVHGvhTOQ915LOlcaOlM2BTDdgW/dIadinKVL ganTEQjcUCQgpJev+XQFDV7hx/cvrM9+pPVTyNSAf1DTCnxU1rohbGNlYj/J76fjpI3u QM5w== X-Forwarded-Encrypted: i=1; AJvYcCWmSBNk4TXTkLUYEGzRJSOED0lz7fc+YTBh8tAOThu6+Q25ZhCFwk8nSAgtzKUtn/G+uIhemqRX5i9fM6I=@vger.kernel.org X-Gm-Message-State: AOJu0YzlEmRBCzBSXlrR/XPZ9qVI07TcR1K37cLXxtC2UifpwBEsyIRl 1F0MAxN/8rimOaTUCV1VAmrnQdAjQ7YiT7HTQfSeusfi8UlG3Hlvz6m0PGXjeKvMMnLLgbytm37 ECTYg0dnPOUk6jSjB19pb5xuL2w== X-Received: from pjuz21.prod.google.com ([2002:a17:90a:d795:b0:352:f24d:9908]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:578c:b0:34a:48ff:694 with SMTP id 98e67ed59e1d1-3543b3b5d3bmr11061858a91.31.1770071424338; Mon, 02 Feb 2026 14:30:24 -0800 (PST) Date: Mon, 2 Feb 2026 14:29:39 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.53.0.rc1.225.gd81095ad13-goog Message-ID: <6963888973817a40a5430ef369f587572eae726a.1770071243.git.ackerleytng@google.com> Subject: [RFC PATCH v2 01/37] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings From: Ackerley Tng To: kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-trace-kernel@vger.kernel.org, x86@kernel.org Cc: aik@amd.com, andrew.jones@linux.dev, binbin.wu@linux.intel.com, bp@alien8.de, brauner@kernel.org, chao.p.peng@intel.com, chao.p.peng@linux.intel.com, chenhuacai@kernel.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, hpa@zytor.com, ira.weiny@intel.com, jgg@nvidia.com, jmattson@google.com, jroedel@suse.de, jthoughton@google.com, maobibo@loongson.cn, mathieu.desnoyers@efficios.com, maz@kernel.org, mhiramat@kernel.org, michael.roth@amd.com, mingo@redhat.com, mlevitsk@redhat.com, oupton@kernel.org, pankaj.gupta@amd.com, pbonzini@redhat.com, prsampat@amd.com, qperret@google.com, ricarkol@google.com, rick.p.edgecombe@intel.com, rientjes@google.com, rostedt@goodmis.org, seanjc@google.com, shivankg@amd.com, shuah@kernel.org, steven.price@arm.com, tabba@google.com, tglx@linutronix.de, vannapurve@google.com, vbabka@suse.cz, willy@infradead.org, wyihan@google.com, yan.y.zhao@intel.com, Ackerley Tng Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Sean Christopherson Start plumbing in guest_memfd support for in-place private<=3D>shared conversions by tracking attributes via a maple tree. KVM currently tracks private vs. shared attributes on a per-VM basis, which made sense when a guest_memfd _only_ supported private memory, but tracking per-VM simply can't work for in-place conversions as the shareability of a given page needs to be per-gmem_inode, not per-VM. Use the filemap invalidation lock to protect the maple tree, as taking the lock for read when faulting in memory (for userspace or the guest) isn't expected to result in meaningful contention, and using a separate lock would add significant complexity (avoid deadlock is quite difficult). Signed-off-by: Sean Christopherson Co-developed-by: Ackerley Tng Signed-off-by: Ackerley Tng Co-developed-by: Vishal Annapurve Signed-off-by: Vishal Annapurve Co-developed-by: Fuad Tabba Signed-off-by: Fuad Tabba --- virt/kvm/guest_memfd.c | 134 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 118 insertions(+), 16 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index fdaea3422c30..7a970e5fab67 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include #include @@ -32,6 +33,7 @@ struct gmem_inode { struct inode vfs_inode; =20 u64 flags; + struct maple_tree attributes; }; =20 static __always_inline struct gmem_inode *GMEM_I(struct inode *inode) @@ -59,6 +61,31 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot= *slot, gfn_t gfn) return gfn - slot->base_gfn + slot->gmem.pgoff; } =20 +static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index) +{ + struct maple_tree *mt =3D &GMEM_I(inode)->attributes; + void *entry =3D mtree_load(mt, index); + + /* + * The lock _must_ be held for lookups, as some maple tree operations, + * e.g. append, are unsafe (return inaccurate information) with respect + * to concurrent RCU-protected lookups. + */ + lockdep_assert(mt_lock_is_held(mt)); + + return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry); +} + +static bool kvm_gmem_is_private_mem(struct inode *inode, pgoff_t index) +{ + return kvm_gmem_get_attributes(inode, index) & KVM_MEMORY_ATTRIBUTE_PRIVA= TE; +} + +static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index) +{ + return !kvm_gmem_is_private_mem(inode, index); +} + static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slo= t *slot, pgoff_t index, struct folio *folio) { @@ -402,10 +429,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct = vm_fault *vmf) if (((loff_t)vmf->pgoff << PAGE_SHIFT) >=3D i_size_read(inode)) return VM_FAULT_SIGBUS; =20 - if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)) - return VM_FAULT_SIGBUS; + filemap_invalidate_lock_shared(inode->i_mapping); + if (kvm_gmem_is_shared_mem(inode, vmf->pgoff)) + folio =3D kvm_gmem_get_folio(inode, vmf->pgoff); + else + folio =3D ERR_PTR(-EACCES); + filemap_invalidate_unlock_shared(inode->i_mapping); =20 - folio =3D kvm_gmem_get_folio(inode, vmf->pgoff); if (IS_ERR(folio)) { if (PTR_ERR(folio) =3D=3D -EAGAIN) return VM_FAULT_RETRY; @@ -561,6 +591,51 @@ bool __weak kvm_arch_supports_gmem_init_shared(struct = kvm *kvm) return true; } =20 +static int kvm_gmem_init_inode(struct inode *inode, loff_t size, u64 flags) +{ + struct gmem_inode *gi =3D GMEM_I(inode); + MA_STATE(mas, &gi->attributes, 0, (size >> PAGE_SHIFT) - 1); + u64 attrs; + int r; + + inode->i_op =3D &kvm_gmem_iops; + inode->i_mapping->a_ops =3D &kvm_gmem_aops; + inode->i_mode |=3D S_IFREG; + inode->i_size =3D size; + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + + /* + * guest_memfd memory is neither migratable nor swappable: set + * inaccessible to gate off both. + */ + mapping_set_inaccessible(inode->i_mapping); + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); + + gi->flags =3D flags; + + mt_set_external_lock(&gi->attributes, + &inode->i_mapping->invalidate_lock); + + /* + * Store default attributes for the entire gmem instance. Ensuring every + * index is represented in the maple tree at all times simplifies the + * conversion and merging logic. + */ + attrs =3D gi->flags & GUEST_MEMFD_FLAG_INIT_SHARED ? 0 : KVM_MEMORY_ATTRI= BUTE_PRIVATE; + + /* + * Acquire the invalidation lock purely to make lockdep happy. The + * maple tree library expects all stores to be protected via the lock, + * and the library can't know when the tree is reachable only by the + * caller, as is the case here. + */ + filemap_invalidate_lock(inode->i_mapping); + r =3D mas_store_gfp(&mas, xa_mk_value(attrs), GFP_KERNEL); + filemap_invalidate_unlock(inode->i_mapping); + + return r; +} + static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { static const char *name =3D "[kvm-gmem]"; @@ -591,16 +666,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t s= ize, u64 flags) goto err_fops; } =20 - inode->i_op =3D &kvm_gmem_iops; - inode->i_mapping->a_ops =3D &kvm_gmem_aops; - inode->i_mode |=3D S_IFREG; - inode->i_size =3D size; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); - mapping_set_inaccessible(inode->i_mapping); - /* Unmovable mappings are supposed to be marked unevictable as well. */ - WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); - - GMEM_I(inode)->flags =3D flags; + err =3D kvm_gmem_init_inode(inode, size, flags); + if (err) + goto err_inode; =20 file =3D alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_f= ops); if (IS_ERR(file)) { @@ -804,9 +872,13 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memor= y_slot *slot, if (!file) return -EFAULT; =20 + filemap_invalidate_lock_shared(file_inode(file)->i_mapping); + folio =3D __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_or= der); - if (IS_ERR(folio)) - return PTR_ERR(folio); + if (IS_ERR(folio)) { + r =3D PTR_ERR(folio); + goto out; + } =20 if (!is_prepared) r =3D kvm_gmem_prepare_folio(kvm, slot, gfn, folio); @@ -818,6 +890,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory= _slot *slot, else folio_put(folio); =20 +out: + filemap_invalidate_unlock_shared(file_inode(file)->i_mapping); return r; } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); @@ -931,13 +1005,41 @@ static struct inode *kvm_gmem_alloc_inode(struct sup= er_block *sb) =20 mpol_shared_policy_init(&gi->policy, NULL); =20 + /* + * Memory attributes are protected by the filemap invalidation lock, but + * the lock structure isn't available at this time. Immediately mark + * maple tree as using external locking so that accessing the tree + * before it's fully initialized results in NULL pointer dereferences + * and not more subtle bugs. + */ + mt_init_flags(&gi->attributes, MT_FLAGS_LOCK_EXTERN); + gi->flags =3D 0; return &gi->vfs_inode; } =20 static void kvm_gmem_destroy_inode(struct inode *inode) { - mpol_free_shared_policy(&GMEM_I(inode)->policy); + struct gmem_inode *gi =3D GMEM_I(inode); + + mpol_free_shared_policy(&gi->policy); + + /* + * Note! Checking for an empty tree is functionally necessary + * to avoid explosions if the tree hasn't been fully + * initialized, i.e. if the inode is being destroyed before + * guest_memfd can set the external lock, lockdep would find + * that the tree's internal ma_lock was not held. + */ + if (!mtree_empty(&gi->attributes)) { + /* + * Acquire the invalidation lock purely to make lockdep happy, + * the inode is unreachable at this point. + */ + filemap_invalidate_lock(inode->i_mapping); + __mt_destroy(&gi->attributes); + filemap_invalidate_unlock(inode->i_mapping); + } } =20 static void kvm_gmem_free_inode(struct inode *inode) --=20 2.53.0.rc1.225.gd81095ad13-goog