From nobody Mon Feb 9 18:59:52 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DD69E35F8AE for ; Tue, 3 Feb 2026 19:24:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770146642; cv=none; b=H3BNw+rEn0/wtjLej/6QbpyPfTrQoggpeQyZonX7tZ9MOvrkv9IKJb4yB4qrV4CV1FJz90cxITSilbeIKgz3pRLusnsOYQ6nq9CjIDiuQJ4z+clW/LRdDDnGz6tA0HuY63k7sPgYMBlc/+Rjak0EIKnbXfj0t7iUhTsHHQf6ebs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770146642; c=relaxed/simple; bh=xSXUV2Py7qWOXZ0kXPtRSXoGc7TMmvAZq4THKRHLYqg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=hXi66MA43htV1ZfKEqFYaDJQerUw57ywLXXIwxrkR8+o8Z4nYVyl8g8FMJRlTeVtAnrjnNCUsLnF6n5HKgL12GHAYc10Vtmioh2mBa4BkQvZyAWstAJWWN5wyKcpM5fdtVYtbRW4vEJfeGe0r1/H/Eby1T+R/qlGF+0lvA+c3Q8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1CV82F2g; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1CV82F2g" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2a0d058fc56so44521095ad.3 for ; Tue, 03 Feb 2026 11:24:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1770146640; x=1770751440; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Aquur2bWSawAVy++fhfPAogrKIXUck0YPKC9WFv0tRU=; b=1CV82F2gf00mXb32dHCazM+iIEfRUf7IrK6WSwtcLVGQtb3kwfGE6CSS+OeMyzYBiI fURXn76wW9y98UMuTzRHtCtnP1DWgLcXzIl75MwQ/GuFHBaKaIRwEOAuKTdpikumhJzn 56R2fvGvqQt6fIhEmS8+F5bnwTQpXARo5cYAMp9aVAC+9V1TtuIXLxQftMVryTFA4pML J7vqQoxQPxyT85WyRlUeveB4utSEyimy87FJOhH0yvwlkAn55wziUa0IOZsxMk1NdFCz sL426eBaJwitevEuXAaSxeyRWdPyIiyW+CIItkTfpNtYrNBt6wlkOd0CtLYxxiXHVOk0 8oIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770146640; x=1770751440; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Aquur2bWSawAVy++fhfPAogrKIXUck0YPKC9WFv0tRU=; b=wWTZ8vnpg7xj6ETpRnjr7WEvg7bQts5lzjzKMGMn/XIsq327/fNrX+g8+RaeygB6tq 77ZfDprXAIbpD1osyXBTSQnMqbA8+LJnD9cXF8Bd2cw9NIVA3IjZ2ro8MeOUhqCdZV7+ 0708vr4RB3oDJyz/l0ePhO+JYMLpTo5Zlg52PYartg2v1eK23xsueImQb+SPKbzB6yt+ RYRjodt/Msh9Pf+RewQ187hojk6AQrgY8I+rEZGOJyFPHVXpKlP+1Ip01E1RQzYbAXu7 gcRhWjISRZRe4ticaUiVo2KYPEQVT1jF48OVsrjRUYPkGngnZJJ5Qx4jklf+HBkn7POv vStw== X-Forwarded-Encrypted: i=1; AJvYcCXxZe0L4w0eDWMANSG/N2nkmPgRyqTbQupGa/fZJyt0twLpL0COdqLPn6DfUwA/CnK0QY/BWoyqhVTTrQA=@vger.kernel.org X-Gm-Message-State: AOJu0YyiaInatPXOByoVG9XudhRTfA0vCdqEYDJ9rTx4cGCfTjUFEcb8 ypykocYPhk2d2aoFBA4UCdZ7Cmuq7fHMRGxmLCLRtDLIYnngDwnu5YNJOmW3Uyd2Uvcqd7NDiLs w/digc4N8sgnmNw== X-Received: from pjvv12.prod.google.com ([2002:a17:90b:588c:b0:352:bd7e:99e7]) (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:40ce:b0:2a7:c8db:488a with SMTP id d9443c01a7336-2a933b9d168mr3697935ad.7.1770146640258; Tue, 03 Feb 2026 11:24:00 -0800 (PST) Date: Tue, 3 Feb 2026 19:23:52 +0000 In-Reply-To: <20260203192352.2674184-1-jiaqiyan@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260203192352.2674184-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.53.0.rc2.204.g2597b5adb4-goog Message-ID: <20260203192352.2674184-4-jiaqiyan@google.com> Subject: [PATCH v3 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED From: Jiaqi Yan To: linmiaohe@huawei.com, william.roche@oracle.com, harry.yoo@oracle.com, jane.chu@oracle.com Cc: nao.horiguchi@gmail.com, tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, sidhartha.kumar@oracle.com, ziy@nvidia.com, david@redhat.com, dave.hansen@linux.intel.com, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jiaqi Yan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Document its motivation, userspace API, behaviors, and limitations. Reviewed-by: Jane Chu Signed-off-by: Jiaqi Yan Reviewed-by: William Roche --- Documentation/userspace-api/index.rst | 1 + .../userspace-api/mfd_mfr_policy.rst | 60 +++++++++++++++++++ 2 files changed, 61 insertions(+) create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspac= e-api/index.rst index 8a61ac4c1bf19..6d8d94028a6cd 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -68,6 +68,7 @@ Everything else futex2 perf_ring_buffer ntsync + mfd_mfr_policy =20 .. only:: subproject and html =20 diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation= /userspace-api/mfd_mfr_policy.rst new file mode 100644 index 0000000000000..c5a25df39791a --- /dev/null +++ b/Documentation/userspace-api/mfd_mfr_policy.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D +Userspace Memory Failure Recovery Policy via memfd +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D + +:Author: + Jiaqi Yan + + +Motivation +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +When a userspace process is able to recover from memory failures (MF) +caused by uncorrected memory error (UE) in the DIMM, especially when it is +able to avoid consuming known UEs, keeping the memory page mapped and +accessible is benifical to the owning process for a couple of reasons: + +- The memory pages affected by UE have a large smallest granularity, for + example 1G hugepage, but the actual corrupted amount of the page is only + several cachlines. Losing the entire hugepage of data is unacceptable to + the application. + +- In addition to keeping the data accessible, the application still wants + to access with a large page size for the fastest virtual-to-physical + translations. + +Memory failure recovery for 1G or larger HugeTLB is a good example. With +memfd userspace process can control whether the kernel hard offlines its +hugepages that backs the in-RAM file created by memfd. + + +User API +=3D=3D=3D=3D=3D=3D=3D=3D + +``int memfd_create(const char *name, unsigned int flags)`` + +``MFD_MF_KEEP_UE_MAPPED`` + + When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery + in the kernel does not hard offline memory due to UE until the + returned ``memfd`` is released. IOW, the HWPoison-ed memory remains + accessible via the returned ``memfd`` or the memory mapping created + with the returned ``memfd``. Note the affected memory will be + immediately isolated and prevented from future use once the memfd + is closed. By default ``MFD_MF_KEEP_UE_MAPPED`` is not set, and + kernel hard offlines memory having UEs. + +Notes about the behavior and limitations + +- Even if the page affected by UE is kept, a portion of the (huge)page is + already lost due to hardware corruption, and the size of the portion + is the smallest page size that kernel uses to manages memory on the + architecture, i.e. PAGESIZE. Accessing a virtual address within any of + these parts results in a SIGBUS; accessing virtual address outside these + parts are good until it is corrupted by new memory error. + +- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so + ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``. + Otherwise ``memfd_create`` returns EINVAL. --=20 2.53.0.rc2.204.g2597b5adb4-goog