Large block size (LBS) folios cannot be split to order-0 folios but
min_order_for_folio(). Current split fails directly, but that is not
optimal. Split the folio to min_order_for_folio(), so that, after split,
only the folio containing the poisoned page becomes unusable instead.
For soft offline, do not split the large folio if it cannot be split to
order-0. Since the folio is still accessible from userspace and premature
split might lead to potential performance loss.
Suggested-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/memory-failure.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f698df156bf8..443df9581c24 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
* there is still more to do, hence the page refcount we took earlier
* is still needed.
*/
-static int try_to_split_thp_page(struct page *page, bool release)
+static int try_to_split_thp_page(struct page *page, unsigned int new_order,
+ bool release)
{
int ret;
lock_page(page);
- ret = split_huge_page(page);
+ ret = split_huge_page_to_list_to_order(page, NULL, new_order);
unlock_page(page);
if (ret && release)
@@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
folio_unlock(folio);
if (folio_test_large(folio)) {
+ int new_order = min_order_for_split(folio);
/*
* The flag must be set after the refcount is bumped
* otherwise it may race with THP split.
@@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
* page is a valid handlable page.
*/
folio_set_has_hwpoisoned(folio);
- if (try_to_split_thp_page(p, false) < 0) {
+ /*
+ * If the folio cannot be split to order-0, kill the process,
+ * but split the folio anyway to minimize the amount of unusable
+ * pages.
+ */
+ if (try_to_split_thp_page(p, new_order, false) || new_order) {
+ /* get folio again in case the original one is split */
+ folio = page_folio(p);
res = -EHWPOISON;
kill_procs_now(p, pfn, flags, folio);
put_page(p);
@@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
};
if (!huge && folio_test_large(folio)) {
- if (try_to_split_thp_page(page, true)) {
+ int new_order = min_order_for_split(folio);
+
+ /*
+ * If the folio cannot be split to order-0, do not split it at
+ * all to retain the still accessible large folio.
+ * NOTE: if getting free memory is perferred, split it like it
+ * is done in memory_failure().
+ */
+ if (new_order || try_to_split_thp_page(page, new_order, true)) {
pr_info("%#lx: thp split failed\n", pfn);
return -EBUSY;
}
--
2.51.0
Hi Zi,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.17 next-20251010]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-huge_memory-do-not-change-split_huge_page-target-order-silently/20251011-014145
base: linus/master
patch link: https://lore.kernel.org/r/20251010173906.3128789-3-ziy%40nvidia.com
patch subject: [PATCH 2/2] mm/memory-failure: improve large block size folio handling.
config: parisc-allmodconfig (https://download.01.org/0day-ci/archive/20251011/202510111805.rg0AewVk-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251011/202510111805.rg0AewVk-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510111805.rg0AewVk-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/memory-failure.c: In function 'memory_failure':
>> mm/memory-failure.c:2278:33: error: implicit declaration of function 'min_order_for_split' [-Wimplicit-function-declaration]
2278 | int new_order = min_order_for_split(folio);
| ^~~~~~~~~~~~~~~~~~~
vim +/min_order_for_split +2278 mm/memory-failure.c
2147
2148 /**
2149 * memory_failure - Handle memory failure of a page.
2150 * @pfn: Page Number of the corrupted page
2151 * @flags: fine tune action taken
2152 *
2153 * This function is called by the low level machine check code
2154 * of an architecture when it detects hardware memory corruption
2155 * of a page. It tries its best to recover, which includes
2156 * dropping pages, killing processes etc.
2157 *
2158 * The function is primarily of use for corruptions that
2159 * happen outside the current execution context (e.g. when
2160 * detected by a background scrubber)
2161 *
2162 * Must run in process context (e.g. a work queue) with interrupts
2163 * enabled and no spinlocks held.
2164 *
2165 * Return:
2166 * 0 - success,
2167 * -ENXIO - memory not managed by the kernel
2168 * -EOPNOTSUPP - hwpoison_filter() filtered the error event,
2169 * -EHWPOISON - the page was already poisoned, potentially
2170 * kill process,
2171 * other negative values - failure.
2172 */
2173 int memory_failure(unsigned long pfn, int flags)
2174 {
2175 struct page *p;
2176 struct folio *folio;
2177 struct dev_pagemap *pgmap;
2178 int res = 0;
2179 unsigned long page_flags;
2180 bool retry = true;
2181 int hugetlb = 0;
2182
2183 if (!sysctl_memory_failure_recovery)
2184 panic("Memory failure on page %lx", pfn);
2185
2186 mutex_lock(&mf_mutex);
2187
2188 if (!(flags & MF_SW_SIMULATED))
2189 hw_memory_failure = true;
2190
2191 p = pfn_to_online_page(pfn);
2192 if (!p) {
2193 res = arch_memory_failure(pfn, flags);
2194 if (res == 0)
2195 goto unlock_mutex;
2196
2197 if (pfn_valid(pfn)) {
2198 pgmap = get_dev_pagemap(pfn);
2199 put_ref_page(pfn, flags);
2200 if (pgmap) {
2201 res = memory_failure_dev_pagemap(pfn, flags,
2202 pgmap);
2203 goto unlock_mutex;
2204 }
2205 }
2206 pr_err("%#lx: memory outside kernel control\n", pfn);
2207 res = -ENXIO;
2208 goto unlock_mutex;
2209 }
2210
2211 try_again:
2212 res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
2213 if (hugetlb)
2214 goto unlock_mutex;
2215
2216 if (TestSetPageHWPoison(p)) {
2217 res = -EHWPOISON;
2218 if (flags & MF_ACTION_REQUIRED)
2219 res = kill_accessing_process(current, pfn, flags);
2220 if (flags & MF_COUNT_INCREASED)
2221 put_page(p);
2222 action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
2223 goto unlock_mutex;
2224 }
2225
2226 /*
2227 * We need/can do nothing about count=0 pages.
2228 * 1) it's a free page, and therefore in safe hand:
2229 * check_new_page() will be the gate keeper.
2230 * 2) it's part of a non-compound high order page.
2231 * Implies some kernel user: cannot stop them from
2232 * R/W the page; let's pray that the page has been
2233 * used and will be freed some time later.
2234 * In fact it's dangerous to directly bump up page count from 0,
2235 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
2236 */
2237 if (!(flags & MF_COUNT_INCREASED)) {
2238 res = get_hwpoison_page(p, flags);
2239 if (!res) {
2240 if (is_free_buddy_page(p)) {
2241 if (take_page_off_buddy(p)) {
2242 page_ref_inc(p);
2243 res = MF_RECOVERED;
2244 } else {
2245 /* We lost the race, try again */
2246 if (retry) {
2247 ClearPageHWPoison(p);
2248 retry = false;
2249 goto try_again;
2250 }
2251 res = MF_FAILED;
2252 }
2253 res = action_result(pfn, MF_MSG_BUDDY, res);
2254 } else {
2255 res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
2256 }
2257 goto unlock_mutex;
2258 } else if (res < 0) {
2259 res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
2260 goto unlock_mutex;
2261 }
2262 }
2263
2264 folio = page_folio(p);
2265
2266 /* filter pages that are protected from hwpoison test by users */
2267 folio_lock(folio);
2268 if (hwpoison_filter(p)) {
2269 ClearPageHWPoison(p);
2270 folio_unlock(folio);
2271 folio_put(folio);
2272 res = -EOPNOTSUPP;
2273 goto unlock_mutex;
2274 }
2275 folio_unlock(folio);
2276
2277 if (folio_test_large(folio)) {
> 2278 int new_order = min_order_for_split(folio);
2279 /*
2280 * The flag must be set after the refcount is bumped
2281 * otherwise it may race with THP split.
2282 * And the flag can't be set in get_hwpoison_page() since
2283 * it is called by soft offline too and it is just called
2284 * for !MF_COUNT_INCREASED. So here seems to be the best
2285 * place.
2286 *
2287 * Don't need care about the above error handling paths for
2288 * get_hwpoison_page() since they handle either free page
2289 * or unhandlable page. The refcount is bumped iff the
2290 * page is a valid handlable page.
2291 */
2292 folio_set_has_hwpoisoned(folio);
2293 /*
2294 * If the folio cannot be split to order-0, kill the process,
2295 * but split the folio anyway to minimize the amount of unusable
2296 * pages.
2297 */
2298 if (try_to_split_thp_page(p, new_order, false) || new_order) {
2299 /* get folio again in case the original one is split */
2300 folio = page_folio(p);
2301 res = -EHWPOISON;
2302 kill_procs_now(p, pfn, flags, folio);
2303 put_page(p);
2304 action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED);
2305 goto unlock_mutex;
2306 }
2307 VM_BUG_ON_PAGE(!page_count(p), p);
2308 folio = page_folio(p);
2309 }
2310
2311 /*
2312 * We ignore non-LRU pages for good reasons.
2313 * - PG_locked is only well defined for LRU pages and a few others
2314 * - to avoid races with __SetPageLocked()
2315 * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
2316 * The check (unnecessarily) ignores LRU pages being isolated and
2317 * walked by the page reclaim code, however that's not a big loss.
2318 */
2319 shake_folio(folio);
2320
2321 folio_lock(folio);
2322
2323 /*
2324 * We're only intended to deal with the non-Compound page here.
2325 * The page cannot become compound pages again as folio has been
2326 * splited and extra refcnt is held.
2327 */
2328 WARN_ON(folio_test_large(folio));
2329
2330 /*
2331 * We use page flags to determine what action should be taken, but
2332 * the flags can be modified by the error containment action. One
2333 * example is an mlocked page, where PG_mlocked is cleared by
2334 * folio_remove_rmap_*() in try_to_unmap_one(). So to determine page
2335 * status correctly, we save a copy of the page flags at this time.
2336 */
2337 page_flags = folio->flags.f;
2338
2339 /*
2340 * __munlock_folio() may clear a writeback folio's LRU flag without
2341 * the folio lock. We need to wait for writeback completion for this
2342 * folio or it may trigger a vfs BUG while evicting inode.
2343 */
2344 if (!folio_test_lru(folio) && !folio_test_writeback(folio))
2345 goto identify_page_state;
2346
2347 /*
2348 * It's very difficult to mess with pages currently under IO
2349 * and in many cases impossible, so we just avoid it here.
2350 */
2351 folio_wait_writeback(folio);
2352
2353 /*
2354 * Now take care of user space mappings.
2355 * Abort on fail: __filemap_remove_folio() assumes unmapped page.
2356 */
2357 if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
2358 res = action_result(pfn, MF_MSG_UNMAP_FAILED, MF_FAILED);
2359 goto unlock_page;
2360 }
2361
2362 /*
2363 * Torn down by someone else?
2364 */
2365 if (folio_test_lru(folio) && !folio_test_swapcache(folio) &&
2366 folio->mapping == NULL) {
2367 res = action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
2368 goto unlock_page;
2369 }
2370
2371 identify_page_state:
2372 res = identify_page_state(pfn, p, page_flags);
2373 mutex_unlock(&mf_mutex);
2374 return res;
2375 unlock_page:
2376 folio_unlock(folio);
2377 unlock_mutex:
2378 mutex_unlock(&mf_mutex);
2379 return res;
2380 }
2381 EXPORT_SYMBOL_GPL(memory_failure);
2382
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 11 Oct 2025, at 6:23, kernel test robot wrote: > Hi Zi, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on linus/master] > [also build test ERROR on v6.17 next-20251010] > [cannot apply to akpm-mm/mm-everything] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-huge_memory-do-not-change-split_huge_page-target-order-silently/20251011-014145 > base: linus/master > patch link: https://lore.kernel.org/r/20251010173906.3128789-3-ziy%40nvidia.com > patch subject: [PATCH 2/2] mm/memory-failure: improve large block size folio handling. > config: parisc-allmodconfig (https://download.01.org/0day-ci/archive/20251011/202510111805.rg0AewVk-lkp@intel.com/config) > compiler: hppa-linux-gcc (GCC) 15.1.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251011/202510111805.rg0AewVk-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202510111805.rg0AewVk-lkp@intel.com/ > > All errors (new ones prefixed by >>): > > mm/memory-failure.c: In function 'memory_failure': >>> mm/memory-failure.c:2278:33: error: implicit declaration of function 'min_order_for_split' [-Wimplicit-function-declaration] > 2278 | int new_order = min_order_for_split(folio); > | ^~~~~~~~~~~~~~~~~~~ > min_order_for_split() is missing in the !CONFIG_TRANSPARENT_HUGEPAGE case. Will add one to get rid of this error. Thanks. -- Best Regards, Yan, Zi
On 2025/10/11 1:39, Zi Yan wrote:
> Large block size (LBS) folios cannot be split to order-0 folios but
> min_order_for_folio(). Current split fails directly, but that is not
> optimal. Split the folio to min_order_for_folio(), so that, after split,
> only the folio containing the poisoned page becomes unusable instead.
>
> For soft offline, do not split the large folio if it cannot be split to
> order-0. Since the folio is still accessible from userspace and premature
> split might lead to potential performance loss.
Thanks for your patch.
>
> Suggested-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> mm/memory-failure.c | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f698df156bf8..443df9581c24 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> * there is still more to do, hence the page refcount we took earlier
> * is still needed.
> */
> -static int try_to_split_thp_page(struct page *page, bool release)
> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
> + bool release)
> {
> int ret;
>
> lock_page(page);
> - ret = split_huge_page(page);
> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
> unlock_page(page);
>
> if (ret && release)
> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
> folio_unlock(folio);
>
> if (folio_test_large(folio)) {
> + int new_order = min_order_for_split(folio);
> /*
> * The flag must be set after the refcount is bumped
> * otherwise it may race with THP split.
> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
> * page is a valid handlable page.
> */
> folio_set_has_hwpoisoned(folio);
> - if (try_to_split_thp_page(p, false) < 0) {
> + /*
> + * If the folio cannot be split to order-0, kill the process,
> + * but split the folio anyway to minimize the amount of unusable
> + * pages.
> + */
> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
> + /* get folio again in case the original one is split */
> + folio = page_folio(p);
If original folio A is split and the after-split new folio is B (A != B), will the
refcnt of folio A held above be missing? I.e. get_hwpoison_page() held the extra refcnt
of folio A, but we put the refcnt of folio B below. Is this a problem or am I miss
something?
Thanks.
.
On Sat, Oct 11, 2025 at 12:12:12PM +0800, Miaohe Lin wrote:
> > folio_set_has_hwpoisoned(folio);
> > - if (try_to_split_thp_page(p, false) < 0) {
> > + /*
> > + * If the folio cannot be split to order-0, kill the process,
> > + * but split the folio anyway to minimize the amount of unusable
> > + * pages.
> > + */
> > + if (try_to_split_thp_page(p, new_order, false) || new_order) {
> > + /* get folio again in case the original one is split */
> > + folio = page_folio(p);
>
> If original folio A is split and the after-split new folio is B (A != B), will the
> refcnt of folio A held above be missing? I.e. get_hwpoison_page() held the extra refcnt
> of folio A, but we put the refcnt of folio B below. Is this a problem or am I miss
> something?
That's how split works.
Zi Yan, the kernel-doc for folio_split() could use some attention.
First, it's not kernel-doc; the comment opens with /* instead of /**.
Second, it says:
* After split, folio is left locked for caller.
which isn't actually true, right? The folio which contains
@split_at will be locked. Also, it will contain the additional
reference which was taken on @folio by the caller.
On 11 Oct 2025, at 1:00, Matthew Wilcox wrote:
> On Sat, Oct 11, 2025 at 12:12:12PM +0800, Miaohe Lin wrote:
>>> folio_set_has_hwpoisoned(folio);
>>> - if (try_to_split_thp_page(p, false) < 0) {
>>> + /*
>>> + * If the folio cannot be split to order-0, kill the process,
>>> + * but split the folio anyway to minimize the amount of unusable
>>> + * pages.
>>> + */
>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>> + /* get folio again in case the original one is split */
>>> + folio = page_folio(p);
>>
>> If original folio A is split and the after-split new folio is B (A != B), will the
>> refcnt of folio A held above be missing? I.e. get_hwpoison_page() held the extra refcnt
>> of folio A, but we put the refcnt of folio B below. Is this a problem or am I miss
>> something?
>
> That's how split works.
>
> Zi Yan, the kernel-doc for folio_split() could use some attention.
> First, it's not kernel-doc; the comment opens with /* instead of /**.
Got it.
> Second, it says:
>
> * After split, folio is left locked for caller.
>
> which isn't actually true, right? The folio which contains
No, folio is indeed left locked. Currently folio_split() is
used by truncate_inode_partial_folio() via try_folio_split()
and the folio passed into truncate_inode_partial_folio() is
already locked by the caller and is unlocked by the caller as well.
The caller does not know anything about @split_at, thus
cannot unlock the folio containing @split_at.
> @split_at will be locked. Also, it will contain the additional
> reference which was taken on @folio by the caller.
The same for the folio reference.
That is the reason we have @split_at and @lock_at for __folio_split().
I can see it is counter-intuitive. To change it, I might need
your help on how to change truncate_inode_partial_folio() callers,
since all of them are use @folio afterwards, without a reference,
I am not sure if their uses are safe anymore.
--
Best Regards,
Yan, Zi
On 2025/10/11 13:00, Matthew Wilcox wrote:
> On Sat, Oct 11, 2025 at 12:12:12PM +0800, Miaohe Lin wrote:
>>> folio_set_has_hwpoisoned(folio);
>>> - if (try_to_split_thp_page(p, false) < 0) {
>>> + /*
>>> + * If the folio cannot be split to order-0, kill the process,
>>> + * but split the folio anyway to minimize the amount of unusable
>>> + * pages.
>>> + */
>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>> + /* get folio again in case the original one is split */
>>> + folio = page_folio(p);
>>
>> If original folio A is split and the after-split new folio is B (A != B), will the
>> refcnt of folio A held above be missing? I.e. get_hwpoison_page() held the extra refcnt
>> of folio A, but we put the refcnt of folio B below. Is this a problem or am I miss
>> something?
>
> That's how split works.
I read the code and see how split works. Thanks for point this out.
>
> Zi Yan, the kernel-doc for folio_split() could use some attention.
That would be really helpful.
Thanks.
.
> First, it's not kernel-doc; the comment opens with /* instead of /**.
> Second, it says:
>
> * After split, folio is left locked for caller.
>
> which isn't actually true, right? The folio which contains
> @split_at will be locked. Also, it will contain the additional
> reference which was taken on @folio by the caller.
>
> .
>
On Fri, Oct 10, 2025 at 01:39:06PM -0400, Zi Yan wrote: > Large block size (LBS) folios cannot be split to order-0 folios but > min_order_for_folio(). Current split fails directly, but that is not > optimal. Split the folio to min_order_for_folio(), so that, after split, > only the folio containing the poisoned page becomes unusable instead. > > For soft offline, do not split the large folio if it cannot be split to > order-0. Since the folio is still accessible from userspace and premature > split might lead to potential performance loss. > > Suggested-by: Jane Chu <jane.chu@oracle.com> > Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Luis
© 2016 - 2025 Red Hat, Inc.