.../ABI/testing/sysfs-memory-page-offline | 3 ++ Documentation/admin-guide/sysctl/vm.rst | 28 ++++++++++++++++--- mm/memory-failure.c | 17 +++++++++-- .../selftests/mm/hugetlb-soft-offline.c | 19 ++++++++++--- 4 files changed, 56 insertions(+), 11 deletions(-)
Soft offlining a HugeTLB page reduces the HugeTLB page pool.
Commit 56374430c5dfc ("mm/memory-failure: userspace controls soft-offlining pages")
introduced the following sysctl interface to control soft offline:
/proc/sys/vm/enable_soft_offline
The interface does not distinguish between page types:
0 - Soft offline is disabled
1 - Soft offline is enabled
Convert enable_soft_offline to a bitmask and support disabling soft
offline for HugeTLB pages:
Bits:
0 - Enable soft offline
1 - Disable soft offline for HugeTLB pages
Supported values:
0 - Soft offline is disabled
1 - Soft offline is enabled
3 - Soft offline is enabled (disabled for HugeTLB pages)
Existing behavior is preserved.
Update documentation and HugeTLB soft offline self tests.
Reported-by: Shawn Fan <shawn.fan@intel.com>
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
---
Tony's patch:
* https://lore.kernel.org/all/20250904155720.22149-1-tony.luck@intel.com
v1:
* https://lore.kernel.org/all/aMGkAI3zKlVsO0S2@hpe.com
v1 -> v2:
* Make the interface extensible, as suggested by David.
* Preserve existing behavior, as suggested by Jiaqi and David.
Why clear errno in self tests?
madvise() does not set errno when it's successful and errno is set by madvise()
during test_soft_offline_common(3) causing test_soft_offline_common(1) to fail:
# Test soft-offline when enabled_soft_offline=1
# Hugepagesize is 1048576kB
# enable_soft_offline => 1
# Before MADV_SOFT_OFFLINE nr_hugepages=7
# Allocated 0x80000000 bytes of hugetlb pages
# MADV_SOFT_OFFLINE 0x7fd600000000 ret=0, errno=95
# MADV_SOFT_OFFLINE should ret 0
# After MADV_SOFT_OFFLINE nr_hugepages=6
not ok 2 Test soft-offline when enabled_soft_offline=1
---
.../ABI/testing/sysfs-memory-page-offline | 3 ++
Documentation/admin-guide/sysctl/vm.rst | 28 ++++++++++++++++---
mm/memory-failure.c | 17 +++++++++--
.../selftests/mm/hugetlb-soft-offline.c | 19 ++++++++++---
4 files changed, 56 insertions(+), 11 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-memory-page-offline b/Documentation/ABI/testing/sysfs-memory-page-offline
index 00f4e35f916f..d3f05ed6605e 100644
--- a/Documentation/ABI/testing/sysfs-memory-page-offline
+++ b/Documentation/ABI/testing/sysfs-memory-page-offline
@@ -20,6 +20,9 @@ Description:
number, or a error when the offlining failed. Reading
the file is not allowed.
+ Soft-offline can be controlled via sysctl, see:
+ Documentation/admin-guide/sysctl/vm.rst
+
What: /sys/devices/system/memory/hard_offline_page
Date: Sep 2009
KernelVersion: 2.6.33
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4d71211fdad8..ace73480eb9d 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -309,19 +309,39 @@ physical memory) vs performance / capacity implications in transparent and
HugeTLB cases.
For all architectures, enable_soft_offline controls whether to soft offline
-memory pages. When set to 1, kernel attempts to soft offline the pages
-whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to
-the request to soft offline the pages. Its default value is 1.
+memory pages.
+
+enable_soft_offline is a bitmask:
+
+Bits::
+
+ 0 - Enable soft offline
+ 1 - Disable soft offline for HugeTLB pages
+
+Supported values::
+
+ 0 - Soft offline is disabled
+ 1 - Soft offline is enabled
+ 3 - Soft offline is enabled (disabled for HugeTLB pages)
+
+The default value is 1.
+
+If soft offline is disabled for the requested page type, EOPNOTSUPP is returned.
It is worth mentioning that after setting enable_soft_offline to 0, the
following requests to soft offline pages will not be performed:
+- Request to soft offline from sysfs (soft_offline_page).
+
- Request to soft offline pages from RAS Correctable Errors Collector.
-- On ARM, the request to soft offline pages from GHES driver.
+- On ARM and X86, the request to soft offline pages from GHES driver.
- On PARISC, the request to soft offline pages from Page Deallocation Table.
+Note:
+ Soft offlining a HugeTLB page reduces the HugeTLB page pool.
+
extfrag_threshold
=================
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc30ca4804bf..0ad9ae11d9e8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -64,11 +64,14 @@
#include "internal.h"
#include "ras/ras_event.h"
+#define SOFT_OFFLINE_ENABLED BIT(0)
+#define SOFT_OFFLINE_SKIP_HUGETLB BIT(1)
+
static int sysctl_memory_failure_early_kill __read_mostly;
static int sysctl_memory_failure_recovery __read_mostly = 1;
-static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_enable_soft_offline __read_mostly = SOFT_OFFLINE_ENABLED;
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
@@ -150,7 +153,7 @@ static const struct ctl_table memory_failure_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
- .extra2 = SYSCTL_ONE,
+ .extra2 = SYSCTL_THREE,
}
};
@@ -2799,12 +2802,20 @@ int soft_offline_page(unsigned long pfn, int flags)
return -EIO;
}
- if (!sysctl_enable_soft_offline) {
+ if (!(sysctl_enable_soft_offline & SOFT_OFFLINE_ENABLED)) {
pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n");
put_ref_page(pfn, flags);
return -EOPNOTSUPP;
}
+ if (sysctl_enable_soft_offline & SOFT_OFFLINE_SKIP_HUGETLB) {
+ if (folio_test_hugetlb(pfn_folio(pfn))) {
+ pr_info_once("disabled for HugeTLB pages by /proc/sys/vm/enable_soft_offline\n");
+ put_ref_page(pfn, flags);
+ return -EOPNOTSUPP;
+ }
+ }
+
mutex_lock(&mf_mutex);
if (PageHWPoison(page)) {
diff --git a/tools/testing/selftests/mm/hugetlb-soft-offline.c b/tools/testing/selftests/mm/hugetlb-soft-offline.c
index f086f0e04756..b87c8778cadf 100644
--- a/tools/testing/selftests/mm/hugetlb-soft-offline.c
+++ b/tools/testing/selftests/mm/hugetlb-soft-offline.c
@@ -5,6 +5,8 @@
* offlining failed with EOPNOTSUPP.
* - if enable_soft_offline = 1, a hugepage should be dissolved and
* nr_hugepages/free_hugepages should be reduced by 1.
+ * - if enable_soft_offline = 3, hugepages should stay intact and soft
+ * offlining failed with EOPNOTSUPP.
*
* Before running, make sure more than 2 hugepages of default_hugepagesz
* are allocated. For example, if /proc/meminfo/Hugepagesize is 2048kB:
@@ -32,6 +34,9 @@
#define EPREFIX " !!! "
+#define SOFT_OFFLINE_ENABLED (1 << 0)
+#define SOFT_OFFLINE_SKIP_HUGETLB (1 << 1)
+
static int do_soft_offline(int fd, size_t len, int expect_errno)
{
char *filemap = NULL;
@@ -56,6 +61,7 @@ static int do_soft_offline(int fd, size_t len, int expect_errno)
ksft_print_msg("Allocated %#lx bytes of hugetlb pages\n", len);
hwp_addr = filemap + len / 2;
+ errno = 0;
ret = madvise(hwp_addr, pagesize, MADV_SOFT_OFFLINE);
ksft_print_msg("MADV_SOFT_OFFLINE %p ret=%d, errno=%d\n",
hwp_addr, ret, errno);
@@ -83,7 +89,7 @@ static int set_enable_soft_offline(int value)
char cmd[256] = {0};
FILE *cmdfile = NULL;
- if (value != 0 && value != 1)
+ if (value < 0 || value > 3)
return -EINVAL;
sprintf(cmd, "echo %d > /proc/sys/vm/enable_soft_offline", value);
@@ -155,13 +161,17 @@ static int create_hugetlbfs_file(struct statfs *file_stat)
static void test_soft_offline_common(int enable_soft_offline)
{
int fd;
- int expect_errno = enable_soft_offline ? 0 : EOPNOTSUPP;
+ int expect_errno = 0;
struct statfs file_stat;
unsigned long hugepagesize_kb = 0;
unsigned long nr_hugepages_before = 0;
unsigned long nr_hugepages_after = 0;
int ret;
+ if (!(enable_soft_offline & SOFT_OFFLINE_ENABLED) ||
+ (enable_soft_offline & SOFT_OFFLINE_SKIP_HUGETLB))
+ expect_errno = EOPNOTSUPP;
+
ksft_print_msg("Test soft-offline when enabled_soft_offline=%d\n",
enable_soft_offline);
@@ -198,7 +208,7 @@ static void test_soft_offline_common(int enable_soft_offline)
// No need for the hugetlbfs file from now on.
close(fd);
- if (enable_soft_offline) {
+ if (expect_errno == 0) {
if (nr_hugepages_before != nr_hugepages_after + 1) {
ksft_test_result_fail("MADV_SOFT_OFFLINE should reduced 1 hugepage\n");
return;
@@ -219,8 +229,9 @@ static void test_soft_offline_common(int enable_soft_offline)
int main(int argc, char **argv)
{
ksft_print_header();
- ksft_set_plan(2);
+ ksft_set_plan(3);
+ test_soft_offline_common(3);
test_soft_offline_common(1);
test_soft_offline_common(0);
--
2.51.0
On 16/09/25 5:57 AM, Kyle Meyer wrote: > Soft offlining a HugeTLB page reduces the HugeTLB page pool. > > Commit 56374430c5dfc ("mm/memory-failure: userspace controls soft-offlining pages") > introduced the following sysctl interface to control soft offline: > > /proc/sys/vm/enable_soft_offline > > The interface does not distinguish between page types: > > 0 - Soft offline is disabled > 1 - Soft offline is enabled > > Convert enable_soft_offline to a bitmask and support disabling soft > offline for HugeTLB pages: > > Bits: > > 0 - Enable soft offline > 1 - Disable soft offline for HugeTLB pages > > Supported values: > > 0 - Soft offline is disabled > 1 - Soft offline is enabled > 3 - Soft offline is enabled (disabled for HugeTLB pages) > > Existing behavior is preserved. > > Update documentation and HugeTLB soft offline self tests. > > Reported-by: Shawn Fan <shawn.fan@intel.com> > Suggested-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> > --- > > Tony's patch: > * https://lore.kernel.org/all/20250904155720.22149-1-tony.luck@intel.com > > v1: > * https://lore.kernel.org/all/aMGkAI3zKlVsO0S2@hpe.com > > v1 -> v2: > * Make the interface extensible, as suggested by David. > * Preserve existing behavior, as suggested by Jiaqi and David. > > Why clear errno in self tests? > > madvise() does not set errno when it's successful and errno is set by madvise() > during test_soft_offline_common(3) causing test_soft_offline_common(1) to fail: > > # Test soft-offline when enabled_soft_offline=1 > # Hugepagesize is 1048576kB > # enable_soft_offline => 1 > # Before MADV_SOFT_OFFLINE nr_hugepages=7 > # Allocated 0x80000000 bytes of hugetlb pages > # MADV_SOFT_OFFLINE 0x7fd600000000 ret=0, errno=95 > # MADV_SOFT_OFFLINE should ret 0 > # After MADV_SOFT_OFFLINE nr_hugepages=6 > not ok 2 Test soft-offline when enabled_soft_offline=1 > > --- > .../ABI/testing/sysfs-memory-page-offline | 3 ++ > Documentation/admin-guide/sysctl/vm.rst | 28 ++++++++++++++++--- > mm/memory-failure.c | 17 +++++++++-- > .../selftests/mm/hugetlb-soft-offline.c | 19 ++++++++++--- > 4 files changed, 56 insertions(+), 11 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-memory-page-offline b/Documentation/ABI/testing/sysfs-memory-page-offline > index 00f4e35f916f..d3f05ed6605e 100644 > --- a/Documentation/ABI/testing/sysfs-memory-page-offline > +++ b/Documentation/ABI/testing/sysfs-memory-page-offline > @@ -20,6 +20,9 @@ Description: > number, or a error when the offlining failed. Reading > the file is not allowed. > > + Soft-offline can be controlled via sysctl, see: > + Documentation/admin-guide/sysctl/vm.rst > + This update is applicable right away without other changes proposed. Probably can be moved into a separate patch in itself ? > What: /sys/devices/system/memory/hard_offline_page > Date: Sep 2009 > KernelVersion: 2.6.33 > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index 4d71211fdad8..ace73480eb9d 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -309,19 +309,39 @@ physical memory) vs performance / capacity implications in transparent and > HugeTLB cases. > > For all architectures, enable_soft_offline controls whether to soft offline > -memory pages. When set to 1, kernel attempts to soft offline the pages > -whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to > -the request to soft offline the pages. Its default value is 1. > +memory pages. > + > +enable_soft_offline is a bitmask: > + > +Bits:: > + > + 0 - Enable soft offline > + 1 - Disable soft offline for HugeTLB pages > + > +Supported values:: > + > + 0 - Soft offline is disabled > + 1 - Soft offline is enabled > + 3 - Soft offline is enabled (disabled for HugeTLB pages) This looks very adhoc even though existing behavior is preserved. - Are HugeTLB pages the only page types to be considered ? - How the remaining bits here are going to be used later ? Also without a bit-wise usage roadmap, is not changing a procfs interface (ABI) bit problematic ? > + > +The default value is 1. > + > +If soft offline is disabled for the requested page type, EOPNOTSUPP is returned. > > It is worth mentioning that after setting enable_soft_offline to 0, the > following requests to soft offline pages will not be performed: > > +- Request to soft offline from sysfs (soft_offline_page). > + > - Request to soft offline pages from RAS Correctable Errors Collector. > > -- On ARM, the request to soft offline pages from GHES driver. > +- On ARM and X86, the request to soft offline pages from GHES driver. > > - On PARISC, the request to soft offline pages from Page Deallocation Table. > > +Note: > + Soft offlining a HugeTLB page reduces the HugeTLB page pool. > + > extfrag_threshold > ================= > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index fc30ca4804bf..0ad9ae11d9e8 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -64,11 +64,14 @@ > #include "internal.h" > #include "ras/ras_event.h" > > +#define SOFT_OFFLINE_ENABLED BIT(0) > +#define SOFT_OFFLINE_SKIP_HUGETLB BIT(1) > + > static int sysctl_memory_failure_early_kill __read_mostly; > > static int sysctl_memory_failure_recovery __read_mostly = 1; > > -static int sysctl_enable_soft_offline __read_mostly = 1; > +static int sysctl_enable_soft_offline __read_mostly = SOFT_OFFLINE_ENABLED; > > atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); > > @@ -150,7 +153,7 @@ static const struct ctl_table memory_failure_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = SYSCTL_ZERO, > - .extra2 = SYSCTL_ONE, > + .extra2 = SYSCTL_THREE, > } > }; > > @@ -2799,12 +2802,20 @@ int soft_offline_page(unsigned long pfn, int flags) > return -EIO; > } > > - if (!sysctl_enable_soft_offline) { > + if (!(sysctl_enable_soft_offline & SOFT_OFFLINE_ENABLED)) { > pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n"); > put_ref_page(pfn, flags); > return -EOPNOTSUPP; > } > > + if (sysctl_enable_soft_offline & SOFT_OFFLINE_SKIP_HUGETLB) { > + if (folio_test_hugetlb(pfn_folio(pfn))) { > + pr_info_once("disabled for HugeTLB pages by /proc/sys/vm/enable_soft_offline\n"); > + put_ref_page(pfn, flags); > + return -EOPNOTSUPP; > + } > + } > + > mutex_lock(&mf_mutex); > > if (PageHWPoison(page)) { > diff --git a/tools/testing/selftests/mm/hugetlb-soft-offline.c b/tools/testing/selftests/mm/hugetlb-soft-offline.c > index f086f0e04756..b87c8778cadf 100644 > --- a/tools/testing/selftests/mm/hugetlb-soft-offline.c > +++ b/tools/testing/selftests/mm/hugetlb-soft-offline.c > @@ -5,6 +5,8 @@ > * offlining failed with EOPNOTSUPP. > * - if enable_soft_offline = 1, a hugepage should be dissolved and > * nr_hugepages/free_hugepages should be reduced by 1. > + * - if enable_soft_offline = 3, hugepages should stay intact and soft > + * offlining failed with EOPNOTSUPP. > * > * Before running, make sure more than 2 hugepages of default_hugepagesz > * are allocated. For example, if /proc/meminfo/Hugepagesize is 2048kB: > @@ -32,6 +34,9 @@ > > #define EPREFIX " !!! " > > +#define SOFT_OFFLINE_ENABLED (1 << 0) > +#define SOFT_OFFLINE_SKIP_HUGETLB (1 << 1) > + > static int do_soft_offline(int fd, size_t len, int expect_errno) > { > char *filemap = NULL; > @@ -56,6 +61,7 @@ static int do_soft_offline(int fd, size_t len, int expect_errno) > ksft_print_msg("Allocated %#lx bytes of hugetlb pages\n", len); > > hwp_addr = filemap + len / 2; > + errno = 0; > ret = madvise(hwp_addr, pagesize, MADV_SOFT_OFFLINE); > ksft_print_msg("MADV_SOFT_OFFLINE %p ret=%d, errno=%d\n", > hwp_addr, ret, errno); > @@ -83,7 +89,7 @@ static int set_enable_soft_offline(int value) > char cmd[256] = {0}; > FILE *cmdfile = NULL; > > - if (value != 0 && value != 1) > + if (value < 0 || value > 3) > return -EINVAL; > > sprintf(cmd, "echo %d > /proc/sys/vm/enable_soft_offline", value); > @@ -155,13 +161,17 @@ static int create_hugetlbfs_file(struct statfs *file_stat) > static void test_soft_offline_common(int enable_soft_offline) > { > int fd; > - int expect_errno = enable_soft_offline ? 0 : EOPNOTSUPP; > + int expect_errno = 0; > struct statfs file_stat; > unsigned long hugepagesize_kb = 0; > unsigned long nr_hugepages_before = 0; > unsigned long nr_hugepages_after = 0; > int ret; > > + if (!(enable_soft_offline & SOFT_OFFLINE_ENABLED) || > + (enable_soft_offline & SOFT_OFFLINE_SKIP_HUGETLB)) > + expect_errno = EOPNOTSUPP; > + > ksft_print_msg("Test soft-offline when enabled_soft_offline=%d\n", > enable_soft_offline); > > @@ -198,7 +208,7 @@ static void test_soft_offline_common(int enable_soft_offline) > // No need for the hugetlbfs file from now on. > close(fd); > > - if (enable_soft_offline) { > + if (expect_errno == 0) { > if (nr_hugepages_before != nr_hugepages_after + 1) { > ksft_test_result_fail("MADV_SOFT_OFFLINE should reduced 1 hugepage\n"); > return; > @@ -219,8 +229,9 @@ static void test_soft_offline_common(int enable_soft_offline) > int main(int argc, char **argv) > { > ksft_print_header(); > - ksft_set_plan(2); > + ksft_set_plan(3); > > + test_soft_offline_common(3); > test_soft_offline_common(1); > test_soft_offline_common(0); >
>> + >> + 0 - Enable soft offline >> + 1 - Disable soft offline for HugeTLB pages >> + >> +Supported values:: >> + >> + 0 - Soft offline is disabled >> + 1 - Soft offline is enabled >> + 3 - Soft offline is enabled (disabled for HugeTLB pages) > > This looks very adhoc even though existing behavior is preserved. > > - Are HugeTLB pages the only page types to be considered ? > - How the remaining bits here are going to be used later ? > What I proposed (that could be better documented here) is that all other bits except the first one will be a disable mask when bit 0 is set. 2 - ... but yet disabled for hugetlb 4 - ... but yet disabled for $WHATEVER 8 - ... but yet disabled for $WHATEVERELSE > Also without a bit-wise usage roadmap, is not changing a procfs > interface (ABI) bit problematic ? For now we failed setting it to values that are neither 0 or 1, IIUC set_enable_soft_offline() correctly? So there should not be any problem, or which scenario do you have in mind? -- Cheers David / dhildenb
On Wed, Sep 17, 2025 at 09:02:55AM +0200, David Hildenbrand wrote: > > > > + > > > + 0 - Enable soft offline > > > + 1 - Disable soft offline for HugeTLB pages > > > + > > > +Supported values:: > > > + > > > + 0 - Soft offline is disabled > > > + 1 - Soft offline is enabled > > > + 3 - Soft offline is enabled (disabled for HugeTLB pages) > > > > This looks very adhoc even though existing behavior is preserved. > > > > - Are HugeTLB pages the only page types to be considered ? > > - How the remaining bits here are going to be used later ? > > > > What I proposed (that could be better documented here) is that all other > bits except the first one will be a disable mask when bit 0 is set. > > 2 - ... but yet disabled for hugetlb > 4 - ... but yet disabled for $WHATEVER > 8 - ... but yet disabled for $WHATEVERELSE > > > Also without a bit-wise usage roadmap, is not changing a procfs > > interface (ABI) bit problematic ? > > For now we failed setting it to values that are neither 0 or 1, IIUC > set_enable_soft_offline() correctly? Yes, -EINVAL will be returned. > So there should not be any problem, or which scenario do you have in mind? Here's an alternative approach. Do not modify the existing sysctl parameter: /proc/sys/vm/enable_soft_offline 0 - Soft offline is disabled 1 - Soft offline is enabled Instead, introduce a new sysctl parameter: /proc/sys/vm/enable_soft_offline_hugetlb 0 - Soft offline is disabled for HugeTLB pages 1 - Soft offline is enabled for HugeTLB pages and note in documentation that this setting only takes effect if enable_soft_offline is enabled. Anshuman (and David), would you prefer this? Thanks, Kyle Meyer
On 17.09.25 20:51, Kyle Meyer wrote: > On Wed, Sep 17, 2025 at 09:02:55AM +0200, David Hildenbrand wrote: >> >>>> + >>>> + 0 - Enable soft offline >>>> + 1 - Disable soft offline for HugeTLB pages >>>> + >>>> +Supported values:: >>>> + >>>> + 0 - Soft offline is disabled >>>> + 1 - Soft offline is enabled >>>> + 3 - Soft offline is enabled (disabled for HugeTLB pages) >>> >>> This looks very adhoc even though existing behavior is preserved. >>> >>> - Are HugeTLB pages the only page types to be considered ? >>> - How the remaining bits here are going to be used later ? >>> >> >> What I proposed (that could be better documented here) is that all other >> bits except the first one will be a disable mask when bit 0 is set. >> >> 2 - ... but yet disabled for hugetlb >> 4 - ... but yet disabled for $WHATEVER >> 8 - ... but yet disabled for $WHATEVERELSE >> >>> Also without a bit-wise usage roadmap, is not changing a procfs >>> interface (ABI) bit problematic ? >> >> For now we failed setting it to values that are neither 0 or 1, IIUC >> set_enable_soft_offline() correctly? > > Yes, -EINVAL will be returned. > >> So there should not be any problem, or which scenario do you have in mind? > > Here's an alternative approach. > > Do not modify the existing sysctl parameter: > > /proc/sys/vm/enable_soft_offline > > 0 - Soft offline is disabled > 1 - Soft offline is enabled > > Instead, introduce a new sysctl parameter: > > /proc/sys/vm/enable_soft_offline_hugetlb > > 0 - Soft offline is disabled for HugeTLB pages > 1 - Soft offline is enabled for HugeTLB pages > > and note in documentation that this setting only takes effect if > enable_soft_offline is enabled. > > Anshuman (and David), would you prefer this? Hmm, at least I don't particularly like that. For each new exception we would create a new file, and the file has weird semantics such that it has no meaning when enable_soft_offline=0. -- Cheers David / dhildenb
On 18/09/25 12:35 AM, David Hildenbrand wrote: > On 17.09.25 20:51, Kyle Meyer wrote: >> On Wed, Sep 17, 2025 at 09:02:55AM +0200, David Hildenbrand wrote: >>> >>>>> + >>>>> + 0 - Enable soft offline >>>>> + 1 - Disable soft offline for HugeTLB pages >>>>> + >>>>> +Supported values:: >>>>> + >>>>> + 0 - Soft offline is disabled >>>>> + 1 - Soft offline is enabled >>>>> + 3 - Soft offline is enabled (disabled for HugeTLB pages) >>>> >>>> This looks very adhoc even though existing behavior is preserved. >>>> >>>> - Are HugeTLB pages the only page types to be considered ? >>>> - How the remaining bits here are going to be used later ? >>>> >>> >>> What I proposed (that could be better documented here) is that all other >>> bits except the first one will be a disable mask when bit 0 is set. >>> >>> 2 - ... but yet disabled for hugetlb >>> 4 - ... but yet disabled for $WHATEVER >>> 8 - ... but yet disabled for $WHATEVERELSE >>> >>>> Also without a bit-wise usage roadmap, is not changing a procfs >>>> interface (ABI) bit problematic ? >>> >>> For now we failed setting it to values that are neither 0 or 1, IIUC >>> set_enable_soft_offline() correctly? >> >> Yes, -EINVAL will be returned. >> >>> So there should not be any problem, or which scenario do you have in mind? >> >> Here's an alternative approach. >> >> Do not modify the existing sysctl parameter: >> >> /proc/sys/vm/enable_soft_offline >> >> 0 - Soft offline is disabled >> 1 - Soft offline is enabled >> >> Instead, introduce a new sysctl parameter: >> >> /proc/sys/vm/enable_soft_offline_hugetlb >> >> 0 - Soft offline is disabled for HugeTLB pages >> 1 - Soft offline is enabled for HugeTLB pages >> >> and note in documentation that this setting only takes effect if >> enable_soft_offline is enabled. >> >> Anshuman (and David), would you prefer this? > > Hmm, at least I don't particularly like that. For each new exception we would create a new file, and the file has weird semantics such that it has no meaning when enable_soft_offline=0. Agree with David here. Adding a new procfs file for a particular page type's soft offline disable scenario does not really make sense. This will extend the ABI unnecessarily without adding much benefit.
On Wed, Sep 17, 2025 at 12:05 PM David Hildenbrand <david@redhat.com> wrote: > > On 17.09.25 20:51, Kyle Meyer wrote: > > On Wed, Sep 17, 2025 at 09:02:55AM +0200, David Hildenbrand wrote: > >> > >>>> + > >>>> + 0 - Enable soft offline > >>>> + 1 - Disable soft offline for HugeTLB pages > >>>> + > >>>> +Supported values:: > >>>> + > >>>> + 0 - Soft offline is disabled > >>>> + 1 - Soft offline is enabled > >>>> + 3 - Soft offline is enabled (disabled for HugeTLB pages) > >>> > >>> This looks very adhoc even though existing behavior is preserved. > >>> > >>> - Are HugeTLB pages the only page types to be considered ? > >>> - How the remaining bits here are going to be used later ? > >>> > >> > >> What I proposed (that could be better documented here) is that all other > >> bits except the first one will be a disable mask when bit 0 is set. > >> > >> 2 - ... but yet disabled for hugetlb > >> 4 - ... but yet disabled for $WHATEVER > >> 8 - ... but yet disabled for $WHATEVERELSE > >> > >>> Also without a bit-wise usage roadmap, is not changing a procfs > >>> interface (ABI) bit problematic ? > >> > >> For now we failed setting it to values that are neither 0 or 1, IIUC > >> set_enable_soft_offline() correctly? > > > > Yes, -EINVAL will be returned. > > > >> So there should not be any problem, or which scenario do you have in mind? > > > > Here's an alternative approach. > > > > Do not modify the existing sysctl parameter: > > > > /proc/sys/vm/enable_soft_offline > > > > 0 - Soft offline is disabled > > 1 - Soft offline is enabled > > > > Instead, introduce a new sysctl parameter: > > > > /proc/sys/vm/enable_soft_offline_hugetlb > > > > 0 - Soft offline is disabled for HugeTLB pages > > 1 - Soft offline is enabled for HugeTLB pages > > > > and note in documentation that this setting only takes effect if > > enable_soft_offline is enabled. > > > > Anshuman (and David), would you prefer this? > > Hmm, at least I don't particularly like that. For each new exception we +1. Given /proc/sys/vm/enable_soft_offline is extensible, I would prefer a compact userspace API. > would create a new file, and the file has weird semantics such that it > has no meaning when enable_soft_offline=0. > > -- > Cheers > > David / dhildenb >
On Wed, Sep 17, 2025 at 12:32:47PM -0700, Jiaqi Yan wrote: > +1. Given /proc/sys/vm/enable_soft_offline is extensible, I would > prefer a compact userspace API. > > > would create a new file, and the file has weird semantics such that it > > has no meaning when enable_soft_offline=0. So the expand the bitmask idea from earlier in this thread? Bit0 0 = soft offline disabled. 1 = Enabled (but see other bits) Bit1 0 = allow offline of 4K pages, 1 = suppress 4K offline Bit2 0 = allow offline of hugetlb, 1 = suppress hugetlb offline Bit3 0 = allow breakup of transparent huge pages to just offline 4K, 1 = suppress transparent breakup Bit4+ Reserved for suppressing other page types we invent in the future Values 0 and 1 keep their original meaning. Value 5 means: offline 4K, keep hugetlb, breakup transparent huge pages. -Tony
On 18/09/25 1:24 AM, Luck, Tony wrote: > On Wed, Sep 17, 2025 at 12:32:47PM -0700, Jiaqi Yan wrote: >> +1. Given /proc/sys/vm/enable_soft_offline is extensible, I would >> prefer a compact userspace API. >> >>> would create a new file, and the file has weird semantics such that it >>> has no meaning when enable_soft_offline=0. > > So the expand the bitmask idea from earlier in this thread? > > Bit0 0 = soft offline disabled. 1 = Enabled (but see other bits) > Bit1 0 = allow offline of 4K pages, 1 = suppress 4K offline > Bit2 0 = allow offline of hugetlb, 1 = suppress hugetlb offline > Bit3 0 = allow breakup of transparent huge pages to just offline 4K, 1 = suppress transparent breakup > Bit4+ Reserved for suppressing other page types we invent in the future > > Values 0 and 1 keep their original meaning. > > Value 5 means: offline 4K, keep hugetlb, breakup transparent huge pages. This disable bitmask (but when generally enabled via bit[0] = 1) method seems much better. But I am not sure about page size being a valid page type classification though. Just to start with, defining first two bits in this bitmask should be good enough, which will atleast help document and validate this new interface properly. Bit1 0 = allow offline of hugetlb, 1 = suppress hugetlb offline Bit2 0 = allow breakup of transparent huge pages to just offline base pages, 1 = suppress transparent breakup Bit3+ Reserved for suppressing other page types we invent in the future
On Sun, Sep 21, 2025 at 05:06:31PM +0530, Anshuman Khandual wrote: > > > On 18/09/25 1:24 AM, Luck, Tony wrote: > > On Wed, Sep 17, 2025 at 12:32:47PM -0700, Jiaqi Yan wrote: > >> +1. Given /proc/sys/vm/enable_soft_offline is extensible, I would > >> prefer a compact userspace API. > >> > >>> would create a new file, and the file has weird semantics such that it > >>> has no meaning when enable_soft_offline=0. > > > > So the expand the bitmask idea from earlier in this thread? > > > > Bit0 0 = soft offline disabled. 1 = Enabled (but see other bits) > > Bit1 0 = allow offline of 4K pages, 1 = suppress 4K offline > > Bit2 0 = allow offline of hugetlb, 1 = suppress hugetlb offline > > Bit3 0 = allow breakup of transparent huge pages to just offline 4K, 1 = suppress transparent breakup > > Bit4+ Reserved for suppressing other page types we invent in the future > > > > Values 0 and 1 keep their original meaning. > > > > Value 5 means: offline 4K, keep hugetlb, breakup transparent huge pages. > > This disable bitmask (but when generally enabled via bit[0] = 1) method > seems much better. But I am not sure about page size being a valid page > type classification though. Just to start with, defining first two bits > in this bitmask should be good enough, which will atleast help document > and validate this new interface properly. > > Bit1 0 = allow offline of hugetlb, 1 = suppress hugetlb offline > Bit2 0 = allow breakup of transparent huge pages to just offline base pages, 1 = suppress transparent breakup > Bit3+ Reserved for suppressing other page types we invent in the future The current patch is already applied to mm-git and supports the following bits: 0 - Enable soft offline 1 - Disable soft offline for HugeTLB pages https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-new&id=9ae6eefa4b6bd3c3e7ef417a6507dce4b55101b4 Are any immediate changes needed? Support for additional page types, such as transparent huge pages, can be added later as needed. Thanks, Kyle Meyer
On Wed, Sep 17, 2025 at 12:54:09PM -0700, Luck, Tony wrote: > On Wed, Sep 17, 2025 at 12:32:47PM -0700, Jiaqi Yan wrote: > > +1. Given /proc/sys/vm/enable_soft_offline is extensible, I would > > prefer a compact userspace API. > > > > > would create a new file, and the file has weird semantics such that it > > > has no meaning when enable_soft_offline=0. > > So the expand the bitmask idea from earlier in this thread? > > Bit0 0 = soft offline disabled. 1 = Enabled (but see other bits) > Bit1 0 = allow offline of 4K pages, 1 = suppress 4K offline > Bit2 0 = allow offline of hugetlb, 1 = suppress hugetlb offline > Bit3 0 = allow breakup of transparent huge pages to just offline 4K, 1 = suppress transparent breakup > Bit4+ Reserved for suppressing other page types we invent in the future > > Values 0 and 1 keep their original meaning. > > Value 5 means: offline 4K, keep hugetlb, breakup transparent huge pages. Do you happen to have any use cases or reasoning for why someone might want to disable soft offline for 4K pages or transparent huge pages? I'd like to understand the motivation for adding the extra bits. Thanks, Kyle Meyer
On Wed, Sep 17, 2025 at 2:39 PM Kyle Meyer <kyle.meyer@hpe.com> wrote: > > On Wed, Sep 17, 2025 at 12:54:09PM -0700, Luck, Tony wrote: > > On Wed, Sep 17, 2025 at 12:32:47PM -0700, Jiaqi Yan wrote: > > > +1. Given /proc/sys/vm/enable_soft_offline is extensible, I would > > > prefer a compact userspace API. > > > > > > > would create a new file, and the file has weird semantics such that it > > > > has no meaning when enable_soft_offline=0. > > > > So the expand the bitmask idea from earlier in this thread? > > > > Bit0 0 = soft offline disabled. 1 = Enabled (but see other bits) > > Bit1 0 = allow offline of 4K pages, 1 = suppress 4K offline > > Bit2 0 = allow offline of hugetlb, 1 = suppress hugetlb offline > > Bit3 0 = allow breakup of transparent huge pages to just offline 4K, 1 = suppress transparent breakup > > Bit4+ Reserved for suppressing other page types we invent in the future > > > > Values 0 and 1 keep their original meaning. > > > > Value 5 means: offline 4K, keep hugetlb, breakup transparent huge pages. > > Do you happen to have any use cases or reasoning for why someone might want > to disable soft offline for 4K pages or transparent huge pages? I'd like to > understand the motivation for adding the extra bits. Not sure if making sense, but something I can think of are: one may really not want performance impact as THP will be split, THP and 4K pages will be migrated, and even wildly willing to defragment with 4K pages with corrected errors? > > Thanks, > Kyle Meyer
On Mon, 15 Sep 2025 19:27:41 -0500 Kyle Meyer <kyle.meyer@hpe.com> wrote: > Soft offlining a HugeTLB page reduces the HugeTLB page pool. > > Commit 56374430c5dfc ("mm/memory-failure: userspace controls soft-offlining pages") > introduced the following sysctl interface to control soft offline: > > /proc/sys/vm/enable_soft_offline > > The interface does not distinguish between page types: > > 0 - Soft offline is disabled > 1 - Soft offline is enabled > > Convert enable_soft_offline to a bitmask and support disabling soft > offline for HugeTLB pages: > > Bits: > > 0 - Enable soft offline > 1 - Disable soft offline for HugeTLB pages > > Supported values: > > 0 - Soft offline is disabled > 1 - Soft offline is enabled > 3 - Soft offline is enabled (disabled for HugeTLB pages) > > Existing behavior is preserved. um, why? What benefit does this patch provide to our users? Use-cases, before-and-after scenarios, etc? > Update documentation and HugeTLB soft offline self tests. > > Reported-by: Shawn Fan <shawn.fan@intel.com> Interesting. What did Shawn report? (Closes:!). > Suggested-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> > > ... > > .../ABI/testing/sysfs-memory-page-offline | 3 ++ > Documentation/admin-guide/sysctl/vm.rst | 28 ++++++++++++++++--- > mm/memory-failure.c | 17 +++++++++-- > .../selftests/mm/hugetlb-soft-offline.c | 19 ++++++++++--- > 4 files changed, 56 insertions(+), 11 deletions(-) I'll add it because testing, but please do explain why I added it?
On Mon, Sep 15, 2025 at 08:16:18PM -0700, Andrew Morton wrote: > On Mon, 15 Sep 2025 19:27:41 -0500 Kyle Meyer <kyle.meyer@hpe.com> wrote: > > > Soft offlining a HugeTLB page reduces the HugeTLB page pool. > > > > Commit 56374430c5dfc ("mm/memory-failure: userspace controls soft-offlining pages") > > introduced the following sysctl interface to control soft offline: > > > > /proc/sys/vm/enable_soft_offline > > > > The interface does not distinguish between page types: > > > > 0 - Soft offline is disabled > > 1 - Soft offline is enabled > > > > Convert enable_soft_offline to a bitmask and support disabling soft > > offline for HugeTLB pages: > > > > Bits: > > > > 0 - Enable soft offline > > 1 - Disable soft offline for HugeTLB pages > > > > Supported values: > > > > 0 - Soft offline is disabled > > 1 - Soft offline is enabled > > 3 - Soft offline is enabled (disabled for HugeTLB pages) > > > > Existing behavior is preserved. > > um, why? What benefit does this patch provide to our users? > Use-cases, before-and-after scenarios, etc? Thank you for the feedback. Some BIOS suppress ("cloak") corrected memory errors until a threshold is reached. Once that threshold is reached, BIOS reports a CPER with the "error threshold exceeded" bit set via GHES and the corresponding page is soft offlined. BIOS does not know the page type of the corresponding page. If the corresponding page happens to be a HugeTLB page, it will be dissolved, permanently reducing the HugeTLB page pool. This can be problematic for workloads that depend on a fixed number of HugeTLB pages. Currently, soft offline must be disabled to prevent HugeTLB pages from being soft offlined. This patch provides a middle ground. Soft offline can be disabled for HugeTLB pages while remaining enabled for non-HugeTLB pages, preserving the benefits of soft offline without the risk of BIOS soft offlining HugeTLB pages. > > Update documentation and HugeTLB soft offline self tests. > > > > Reported-by: Shawn Fan <shawn.fan@intel.com> > > Interesting. What did Shawn report? (Closes:!). Tony or Shawn, could you please point me to the original report? Thanks! > > Suggested-by: Tony Luck <tony.luck@intel.com> > > Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> > > > > ... > > > > .../ABI/testing/sysfs-memory-page-offline | 3 ++ > > Documentation/admin-guide/sysctl/vm.rst | 28 ++++++++++++++++--- > > mm/memory-failure.c | 17 +++++++++-- > > .../selftests/mm/hugetlb-soft-offline.c | 19 ++++++++++--- > > 4 files changed, 56 insertions(+), 11 deletions(-) > > I'll add it because testing, but please do explain why I added it? Thanks, Kyle Meyer
>> > Reported-by: Shawn Fan <shawn.fan@intel.com> >> >> Interesting. What did Shawn report? (Closes:!). > > Tony or Shawn, could you please point me to the original report? Thanks! Original report is internal to Intel, so no useful link for the community (but I still wanted to give credit). Recap of original problem is that some BIOS keep track of error threshold per-rank and use this GHES mechanism to report threshold exceeded on the rank. Systems that stay up a long time can accumulate enough soft errors to trigger this threshold. But the action of taking a page offline isn't going to help. For a 4K page this is merely annoying. For 1G page it can mess things up badly. My original patch for this just skipped the GHES->offline process for huge pages. But I wasn't aware of the sysctl control. That provides a better solution. -Tony
On Tue, Sep 16, 2025 at 03:20:49PM +0000, Luck, Tony wrote: > >> > Reported-by: Shawn Fan <shawn.fan@intel.com> > >> > >> Interesting. What did Shawn report? (Closes:!). > > > > Tony or Shawn, could you please point me to the original report? Thanks! > > Original report is internal to Intel, so no useful link for the community (but > I still wanted to give credit). > > Recap of original problem is that some BIOS keep track of error threshold > per-rank and use this GHES mechanism to report threshold exceeded on > the rank. > > Systems that stay up a long time can accumulate enough soft errors > to trigger this threshold. But the action of taking a page offline isn't > going to help. For a 4K page this is merely annoying. For 1G page > it can mess things up badly. > > My original patch for this just skipped the GHES->offline process > for huge pages. But I wasn't aware of the sysctl control. That provides > a better solution. Tony, does that mean you're OK with using the existing sysctl interface? If so, I'll just send a separate patch to update the sysfs-memory-page-offline documentation and drop the rest. Thanks, Kyle Meyer
>> My original patch for this just skipped the GHES->offline process >> for huge pages. But I wasn't aware of the sysctl control. That provides >> a better solution. > > Tony, does that mean you're OK with using the existing sysctl interface? If > so, I'll just send a separate patch to update the sysfs-memory-page-offline > documentation and drop the rest. Kyle, It depends on which camp the external customer that reported this falls into: 1) "I'm OK disabling all soft offline requests". or the: 2) "I'd like 4K pages to still go offline if the BIOS asks, just not any huge pages". Shawn: Can you please find out? -Tony
>> My original patch for this just skipped the GHES->offline process >> for huge pages. But I wasn't aware of the sysctl control. That provides >> a better solution. > > Tony, does that mean you're OK with using the existing sysctl interface? If > so, I'll just send a separate patch to update the sysfs-memory-page-offline > documentation and drop the rest. Kyle, It depends on which camp the external customer that reported this falls into: 1) "I'm OK disabling all soft offline requests". or the: 2) "I'd like 4K pages to still go offline if the BIOS asks, just not any huge pages". Shawn: Can you please find out? -> Prefer the 2nd option, "4K pages still go offline if the BIOS asks, just not any huge pages." Shawn
On Wed, Sep 17, 2025 at 06:35:14AM +0000, Fan, Shawn wrote: > >> My original patch for this just skipped the GHES->offline process > >> for huge pages. But I wasn't aware of the sysctl control. That provides > >> a better solution. > > > > Tony, does that mean you're OK with using the existing sysctl interface? If > > so, I'll just send a separate patch to update the sysfs-memory-page-offline > > documentation and drop the rest. > > Kyle, > > It depends on which camp the external customer that reported this > falls into: > > 1) "I'm OK disabling all soft offline requests". > > or the: > > 2) "I'd like 4K pages to still go offline if the BIOS asks, just not any huge pages". > > Shawn: Can you please find out? > > > -> Prefer the 2nd option, "4K pages still go offline if the BIOS asks, just not any huge pages." OK, thank you. Does that mean they want to avoid offlining transparent huge pages as well? Thanks, Kyle Meyer
在 2025/9/18 02:59, Kyle Meyer 写道: > On Wed, Sep 17, 2025 at 06:35:14AM +0000, Fan, Shawn wrote: >>>> My original patch for this just skipped the GHES->offline process >>>> for huge pages. But I wasn't aware of the sysctl control. That provides >>>> a better solution. >>> >>> Tony, does that mean you're OK with using the existing sysctl interface? If >>> so, I'll just send a separate patch to update the sysfs-memory-page-offline >>> documentation and drop the rest. >> >> Kyle, >> >> It depends on which camp the external customer that reported this >> falls into: >> >> 1) "I'm OK disabling all soft offline requests". >> >> or the: >> >> 2) "I'd like 4K pages to still go offline if the BIOS asks, just not any huge pages". >> >> Shawn: Can you please find out? >> >> >> -> Prefer the 2nd option, "4K pages still go offline if the BIOS asks, just not any huge pages." > > OK, thank you. > > Does that mean they want to avoid offlining transparent huge pages as well? > > Thanks, > Kyle Meyer Hi, Shawn, As memory access is typically interleaved between channels. When the per-rank threshold is exceeded, soft-offlining the last accessed address seems unreasonable - regardless of whether it's a 4KB page or a huge page. The error accumulation happens at the rank level, but the action is taken on a specific page that happened to trigger the threshold, which doesn't address the underlying issue. I prefer the first option that disabling all soft offline requests from GHES driver. Thanks. Shuai
On Thu, Sep 18, 2025 at 1:34 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote: > > > > 在 2025/9/18 02:59, Kyle Meyer 写道: > > On Wed, Sep 17, 2025 at 06:35:14AM +0000, Fan, Shawn wrote: > >>>> My original patch for this just skipped the GHES->offline process > >>>> for huge pages. But I wasn't aware of the sysctl control. That provides > >>>> a better solution. > >>> > >>> Tony, does that mean you're OK with using the existing sysctl interface? If > >>> so, I'll just send a separate patch to update the sysfs-memory-page-offline > >>> documentation and drop the rest. > >> > >> Kyle, > >> > >> It depends on which camp the external customer that reported this > >> falls into: > >> > >> 1) "I'm OK disabling all soft offline requests". > >> > >> or the: > >> > >> 2) "I'd like 4K pages to still go offline if the BIOS asks, just not any huge pages". > >> > >> Shawn: Can you please find out? > >> > >> > >> -> Prefer the 2nd option, "4K pages still go offline if the BIOS asks, just not any huge pages." > > > > OK, thank you. > > > > Does that mean they want to avoid offlining transparent huge pages as well? > > > > Thanks, > > Kyle Meyer > > > Hi, Shawn, > > As memory access is typically interleaved between channels. When the > per-rank threshold is exceeded, soft-offlining the last accessed address > seems unreasonable - regardless of whether it's a 4KB page or a huge > page. The error accumulation happens at the rank level, but the action > is taken on a specific page that happened to trigger the threshold, > which doesn't address the underlying issue. Does it mean the soft offline action taken by the kernel is almost useless from hw's PoV? Or, the current signals/info about the corrected errors kernel get from firmware are insufficient to make the kernel do anything meaningful? > > I prefer the first option that disabling all soft offline requests from > GHES driver. > > Thanks. > Shuai
© 2016 - 2025 Red Hat, Inc.