From nobody Tue Feb 10 03:45:29 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FF491494D9 for ; Sun, 16 Nov 2025 01:32:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763256754; cv=none; b=fMhimiPWECviHftpElu+MyIGtn1m6FJo5kVCgkA9Jf3kGIRlXVgNIoS3tGEq/y5NmvFKNYMEQVF+kmkoovOYcJoUuAQ9DHwGPAB05CXSRspsQer91DmWd5CubBFyFXWFU64MdsqFARMI/UXm+UjaqL8Uq81MNpC0IyGKt3P03PY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763256754; c=relaxed/simple; bh=RFR9SyXEMPiUvb3afTg8l1rtUQb38RAKr+DLQv+RpTk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=VMf4a9XQAzys7dh3Eq/uhp8NQDa/d/5EmOX+VjTR/kwuTerUpa4FUA192UIfEyB/KNmZkrYBuoWYjwH+EMLO06xVG3/JAFKldhi9IpN4UQKR7bUHX8cYDkFoIVvgE6U24vSMCdMfT1R5JFlCBPDWqxabjpj3UP5O00If2Y7pBlk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1f/Sx9XJ; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1f/Sx9XJ" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-340ad9349b3so8402546a91.1 for ; Sat, 15 Nov 2025 17:32:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1763256752; x=1763861552; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=os6SZTaeXY813tjI0MktjS9EkkNR7SCiBq4bw5IkwZI=; b=1f/Sx9XJ2v2iBvGRAzOR8lGaKatzxLq5qjKEJJ5DRP4iiQDb2yNQxD/8upVBU86Xvj 8DlDO9RqFYMPlJED5Uj3MsZS6PkmYgkXPLjWMbrfDRumYQpPlWtaq99Prdu2Ho9B8uMB APvBmAxoxvfpIZLzKC5cC1CylHp/UU9tBs5gcG72TWGDtkP1gweRo/GTUDc67mf8gRnb KRSQA+7fUaPGgNeB+uLGMs1HmTFjjVz99NKe3SbEjLfZoi1hnHPUODpEg9zWKVX31zuG mLMbiDK83wADMWwJ62/qzfYoljVz/Lh6xfR2GItaTeuP0Ztp8Lb/vmGfGN8rRuYGRXL4 w56Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763256752; x=1763861552; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=os6SZTaeXY813tjI0MktjS9EkkNR7SCiBq4bw5IkwZI=; b=n9cMUYM7pCNf7njeUnYfC1B4TwEW49crax7OnZUi8t8gdAsDsl3Zi7AFbRM4zylWtQ Hs7No4g8u2hJv99gyFRYH49kYxVi+Ej3MfB5UxZlWZjW077xBTrm66I376tHQb1e6JQK gd/j5k3VN1MP5nGin4aSNQkdKzhwkpxcdkk9g0ShH8sgfWMVFg7gFh3vN9FOjuQ6IARq 1LrNB8EZP8a+KerN27A9nMX9pZPzc5EAeNPpJ/qHuxgUUDmeZFONXAjPf2m0dKNQ4jng LLJhn/ChRA1BLI+4UUmGq7oO1YQquivabWEua3WnI3L0LFGygyIJZctyqpnExfqrDqIi 7Fcw== X-Forwarded-Encrypted: i=1; AJvYcCWvmxAeqbmEGGIsdXhJ2hA2mrXTjWOPzCVqyDx9ct7M7XnYU3kavcO4s5KJnFeyvKVtARlXaWDQGBAwJXw=@vger.kernel.org X-Gm-Message-State: AOJu0YwoWHFQ2JgVf3hWs1SmsbnpVDxKvVd7jdA1Cw1Oq+hRdTO1a2ji keOVhaPKKsPhw7NUgs6O82PaL9Wtmk53/19N9kva9FukZ1AK2nr4SIZS1s3kM8iHi3M3w7DMVH8 K5GkLn7mpS0LhDQ== X-Google-Smtp-Source: AGHT+IFF1NojknbOkMkJx3nkb9BpNu1qDh2+jUJDNFtUK5ivMNc7tNHbVDIZXIiwRMoOI2BvT/E7mg7MjaIhSg== X-Received: from pjpq7.prod.google.com ([2002:a17:90a:a007:b0:332:7fae:e138]) (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:3a46:b0:32b:9506:1780 with SMTP id 98e67ed59e1d1-343fa0d72aemr8132252a91.9.1763256751712; Sat, 15 Nov 2025 17:32:31 -0800 (PST) Date: Sun, 16 Nov 2025 01:32:23 +0000 In-Reply-To: <20251116013223.1557158-1-jiaqiyan@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251116013223.1557158-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.52.0.rc1.455.g30608eb744-goog Message-ID: <20251116013223.1557158-4-jiaqiyan@google.com> Subject: [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com, william.roche@oracle.com, harry.yoo@oracle.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, sidhartha.kumar@oracle.com, ziy@nvidia.com, david@redhat.com, dave.hansen@linux.intel.com, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jiaqi Yan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Document its motivation, userspace API, behaviors, and limitations. Signed-off-by: Jiaqi Yan Reviewed-by: Jane Chu --- Documentation/userspace-api/index.rst | 1 + .../userspace-api/mfd_mfr_policy.rst | 60 +++++++++++++++++++ 2 files changed, 61 insertions(+) create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspac= e-api/index.rst index b8c73be4fb112..d8c6977d9e67a 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -67,6 +67,7 @@ Everything else futex2 perf_ring_buffer ntsync + mfd_mfr_policy =20 .. only:: subproject and html =20 diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation= /userspace-api/mfd_mfr_policy.rst new file mode 100644 index 0000000000000..c5a25df39791a --- /dev/null +++ b/Documentation/userspace-api/mfd_mfr_policy.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D +Userspace Memory Failure Recovery Policy via memfd +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D + +:Author: + Jiaqi Yan + + +Motivation +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +When a userspace process is able to recover from memory failures (MF) +caused by uncorrected memory error (UE) in the DIMM, especially when it is +able to avoid consuming known UEs, keeping the memory page mapped and +accessible is benifical to the owning process for a couple of reasons: + +- The memory pages affected by UE have a large smallest granularity, for + example 1G hugepage, but the actual corrupted amount of the page is only + several cachlines. Losing the entire hugepage of data is unacceptable to + the application. + +- In addition to keeping the data accessible, the application still wants + to access with a large page size for the fastest virtual-to-physical + translations. + +Memory failure recovery for 1G or larger HugeTLB is a good example. With +memfd userspace process can control whether the kernel hard offlines its +hugepages that backs the in-RAM file created by memfd. + + +User API +=3D=3D=3D=3D=3D=3D=3D=3D + +``int memfd_create(const char *name, unsigned int flags)`` + +``MFD_MF_KEEP_UE_MAPPED`` + + When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery + in the kernel does not hard offline memory due to UE until the + returned ``memfd`` is released. IOW, the HWPoison-ed memory remains + accessible via the returned ``memfd`` or the memory mapping created + with the returned ``memfd``. Note the affected memory will be + immediately isolated and prevented from future use once the memfd + is closed. By default ``MFD_MF_KEEP_UE_MAPPED`` is not set, and + kernel hard offlines memory having UEs. + +Notes about the behavior and limitations + +- Even if the page affected by UE is kept, a portion of the (huge)page is + already lost due to hardware corruption, and the size of the portion + is the smallest page size that kernel uses to manages memory on the + architecture, i.e. PAGESIZE. Accessing a virtual address within any of + these parts results in a SIGBUS; accessing virtual address outside these + parts are good until it is corrupted by new memory error. + +- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so + ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``. + Otherwise ``memfd_create`` returns EINVAL. --=20 2.52.0.rc1.455.g30608eb744-goog