A long-term goal is supporting frozen PageOffline pages, and later
PageOffline pages that don't have a refcount at all. Some more work for
that is needed -- in particular around non-folio page migration and
memory ballooning drivers -- but let's start by handling PageOffline pages
that can be skipped during memory offlining differently.
Let's introduce a PageOffline specific page flag (PG_offline_skippable)
that for now reuses PG_owner_2. In the memdesc future, it will be one of
a small number of per-memdesc flags stored alongside the type.
By setting PG_offline_skippable, a driver indicates that it can
restore the PageOffline state of these specific pages when re-onlining a
memory block: it knows that these pages are supposed to be PageOffline()
without the information in the vmemmap, so it can filter them out and
not expose them to the buddy -> they stay PageOffline().
While PG_offline_offlineable might be clearer, it is also super
confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
So let's use "skippable" for now.
The flag is not supposed to be used for movable PageOffline pages as
used for balloon compaction; movable PageOffline() pages can simply be
migrated during the memory offlining stage.
Let's convert the single user from our MEM_GOING_OFFLINE approach
to the new PG_offline_skippable approach: virtio-mem. Fortunately,
this simplifies the code quite a lot.
What if someone decides to grab a reference on these pages although they
really shouldn't? After all, we'll now keep the refcount at 1 (until we
can properly stop using the refcount completely).
Well, less worse things will happen than would currently: currently,
if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
we would run into the
if (WARN_ON(!page_ref_dec_and_test(page)))
dump_page(page, "fake-offline page referenced");
And once that unexpected reference would get dropped, we would end up
freeing that page to the buddy: ouch.
Now, we'll allow for offlining that memory, and when that unexpected
reference would get dropped, we would not end up freeing that page to
the buddy. Once we have frozen PageOffline() pages, it will all get a
lot cleaner.
Note that we didn't see the existing WARN_ON so far, because nobody
should ever be referencing such pages.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
drivers/virtio/virtio_mem.c | 111 +-----------------------------------
include/linux/page-flags.h | 29 +++++++---
mm/memory_hotplug.c | 12 ++--
mm/page_alloc.c | 8 +--
mm/page_isolation.c | 21 +++----
5 files changed, 42 insertions(+), 139 deletions(-)
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 56d0dbe621637..77b72843c4b53 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -278,10 +278,6 @@ static DEFINE_MUTEX(virtio_mem_mutex);
static LIST_HEAD(virtio_mem_devices);
static void virtio_mem_online_page_cb(struct page *page, unsigned int order);
-static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
- unsigned long nr_pages);
-static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
- unsigned long nr_pages);
static void virtio_mem_retry(struct virtio_mem *vm);
static int virtio_mem_create_resource(struct virtio_mem *vm);
static void virtio_mem_delete_resource(struct virtio_mem *vm);
@@ -924,64 +920,6 @@ static void virtio_mem_sbm_notify_online(struct virtio_mem *vm,
virtio_mem_sbm_set_mb_state(vm, mb_id, new_state);
}
-static void virtio_mem_sbm_notify_going_offline(struct virtio_mem *vm,
- unsigned long mb_id)
-{
- const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
- unsigned long pfn;
- int sb_id;
-
- for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
- if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
- continue;
- pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
- sb_id * vm->sbm.sb_size);
- virtio_mem_fake_offline_going_offline(pfn, nr_pages);
- }
-}
-
-static void virtio_mem_sbm_notify_cancel_offline(struct virtio_mem *vm,
- unsigned long mb_id)
-{
- const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
- unsigned long pfn;
- int sb_id;
-
- for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
- if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
- continue;
- pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
- sb_id * vm->sbm.sb_size);
- virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
- }
-}
-
-static void virtio_mem_bbm_notify_going_offline(struct virtio_mem *vm,
- unsigned long bb_id,
- unsigned long pfn,
- unsigned long nr_pages)
-{
- /*
- * When marked as "fake-offline", all online memory of this device block
- * is allocated by us. Otherwise, we don't have any memory allocated.
- */
- if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
- VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
- return;
- virtio_mem_fake_offline_going_offline(pfn, nr_pages);
-}
-
-static void virtio_mem_bbm_notify_cancel_offline(struct virtio_mem *vm,
- unsigned long bb_id,
- unsigned long pfn,
- unsigned long nr_pages)
-{
- if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
- VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
- return;
- virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
-}
-
/*
* This callback will either be called synchronously from add_memory() or
* asynchronously (e.g., triggered via user space). We have to be careful
@@ -1040,12 +978,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
break;
}
vm->hotplug_active = true;
- if (vm->in_sbm)
- virtio_mem_sbm_notify_going_offline(vm, id);
- else
- virtio_mem_bbm_notify_going_offline(vm, id,
- mhp->start_pfn,
- mhp->nr_pages);
break;
case MEM_GOING_ONLINE:
mutex_lock(&vm->hotplug_mutex);
@@ -1094,12 +1026,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
case MEM_CANCEL_OFFLINE:
if (!vm->hotplug_active)
break;
- if (vm->in_sbm)
- virtio_mem_sbm_notify_cancel_offline(vm, id);
- else
- virtio_mem_bbm_notify_cancel_offline(vm, id,
- mhp->start_pfn,
- mhp->nr_pages);
vm->hotplug_active = false;
mutex_unlock(&vm->hotplug_mutex);
break;
@@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
SetPageDirty(page);
else
__SetPageOffline(page);
+ __SetPageOfflineSkippable(page);
VM_WARN_ON_ONCE(!PageOffline(page));
}
page_offline_end();
@@ -1172,6 +1099,7 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);
+ __ClearPageOfflineSkippable(page);
if (!onlined)
/* generic_online_page() will clear PageOffline(). */
ClearPageDirty(page);
@@ -1261,41 +1189,6 @@ static int virtio_mem_fake_offline(struct virtio_mem *vm, unsigned long pfn,
return -EBUSY;
}
-/*
- * Handle fake-offline pages when memory is going offline - such that the
- * pages can be skipped by mm-core when offlining.
- */
-static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
- unsigned long nr_pages)
-{
- struct page *page;
- unsigned long i;
-
- /* Drop our reference to the pages so the memory can get offlined. */
- for (i = 0; i < nr_pages; i++) {
- page = pfn_to_page(pfn + i);
- if (WARN_ON(!page_ref_dec_and_test(page)))
- dump_page(page, "fake-offline page referenced");
- }
-}
-
-/*
- * Handle fake-offline pages when memory offlining is canceled - to undo
- * what we did in virtio_mem_fake_offline_going_offline().
- */
-static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
- unsigned long nr_pages)
-{
- unsigned long i;
-
- /*
- * Get the reference again that we dropped via page_ref_dec_and_test()
- * when going offline.
- */
- for (i = 0; i < nr_pages; i++)
- page_ref_inc(pfn_to_page(pfn + i));
-}
-
static void virtio_mem_online_page(struct virtio_mem *vm,
struct page *page, unsigned int order)
{
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c1d49554c71a..581510ae8e83a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -178,6 +178,9 @@ enum pageflags {
PG_vmemmap_self_hosted = PG_owner_priv_1,
#endif
+ /* The driver allows for skipping these pages during memory offlining */
+ PG_offline_skippable = PG_owner_2,
+
/*
* Flags only valid for compound pages. Stored in first tail page's
* flags word. Cannot use the first 8 flags or any flag marked as
@@ -1029,14 +1032,23 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* refcount of 1 and PageOffline(). generic_online_page() will
* take care of clearing PageOffline().
*
- * If a driver wants to allow to offline unmovable PageOffline() pages without
- * putting them back to the buddy, it can do so via the memory notifier by
- * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
- * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
- * pages (now with a reference count of zero) are treated like free (unmanaged)
- * pages, allowing the containing memory block to get offlined. A driver that
- * relies on this feature is aware that re-onlining the memory block will
- * require not giving them to the buddy via generic_online_page().
+ * If a driver wants to allow for offlining unmovable PageOffline() pages
+ * when offlining a memory block without exposing these pages as "free" to
+ * the buddy, it can mark them as PG_offline_skippable.
+ *
+ * By marking these PageOffline pages PG_offline_skippable, the driver
+ * acknowledges that it
+ * (a) knows which pages are PageOffline even without the memmap.
+ * (b) implements the online_page_callback_t callback to exclude these pages
+ * from getting exposed to the buddy, so they will stay PageOffline()
+ * when onlining a memory block.
+ * (c) synchronizes against concurrent memory onlining/offlining whenever
+ * adjusting the PG_offline_skippable flag.
+ *
+ * Note that the refcount of these pages will for now be assumed to always
+ * be 1: nobody except the owner should be referencing these pages.
+ * Offlining code will complain if the refcount is not 1. In the future,
+ * these pages will always be frozen and not have a refcount.
*
* Memory offlining code will not adjust the managed page count for any
* PageOffline() pages, treating them like they were never exposed to the
@@ -1048,6 +1060,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* page_offline_freeze()/page_offline_thaw().
*/
PAGE_TYPE_OPS(Offline, offline, offline)
+__PAGEFLAG(OfflineSkippable, offline_skippable, PF_NO_COMPOUND)
extern void page_offline_freeze(void);
extern void page_offline_thaw(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b1caedbade5b1..0cc5537f234bb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1767,12 +1767,10 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
goto found;
/*
- * PageOffline() pages that are not marked __PageMovable() and
- * have a reference count > 0 (after MEM_GOING_OFFLINE) are
- * definitely unmovable. If their reference count would be 0,
- * they could at least be skipped when offlining memory.
+ * PageOffline() pages that are neither "movable" nor
+ * "skippable" prevent memory offlining.
*/
- if (PageOffline(page) && page_count(page))
+ if (PageOffline(page) && !PageOfflineSkippable(page))
return -EBUSY;
if (!PageHuge(page))
@@ -1807,6 +1805,10 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
struct page *page;
page = pfn_to_page(pfn);
+
+ if (PageOffline(page) && PageOfflineSkippable(page))
+ continue;
+
folio = page_folio(page);
if (!folio_try_get(folio))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a6fe1e9b95941..325b51c905076 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
continue;
}
/*
- * At this point all remaining PageOffline() pages have a
- * reference count of 0 and can simply be skipped.
+ * At this point all remaining PageOffline() pages must be
+ * "skippable" and have exactly one reference.
*/
if (PageOffline(page)) {
- BUG_ON(page_count(page));
- BUG_ON(PageBuddy(page));
+ WARN_ON_ONCE(!PageOfflineSkippable(page));
+ WARN_ON_ONCE(page_count(page) != 1);
already_offline++;
pfn++;
continue;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b2fc5266e3d26..debd898b794ea 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
continue;
/*
- * We treat all PageOffline() pages as movable when offlining
- * to give drivers a chance to decrement their reference count
- * in MEM_GOING_OFFLINE in order to indicate that these pages
- * can be offlined as there are no direct references anymore.
- * For actually unmovable PageOffline() where the driver does
- * not support this, we will fail later when trying to actually
- * move these pages that still have a reference count > 0.
- * (false negatives in this function only)
+ * PageOffline() pages that are marked as "skippable"
+ * are treated like movable pages for memory offlining.
*/
- if ((flags & MEMORY_OFFLINE) && PageOffline(page))
+ if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
+ PageOfflineSkippable(page))
continue;
if (__PageMovable(page) || PageLRU(page))
@@ -577,11 +572,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
/* A HWPoisoned page cannot be also PageBuddy */
pfn++;
else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
- !page_count(page))
+ PageOfflineSkippable(page))
/*
- * The responsible driver agreed to skip PageOffline()
- * pages when offlining memory by dropping its
- * reference in MEM_GOING_OFFLINE.
+ * If the page is a skippable PageOffline() page,
+ * we can offline the memory block, as the driver will
+ * re-discover them when re-onlining the memory.
*/
pfn++;
else
--
2.49.0
On 14 May 2025, at 7:15, David Hildenbrand wrote:
> A long-term goal is supporting frozen PageOffline pages, and later
> PageOffline pages that don't have a refcount at all. Some more work for
> that is needed -- in particular around non-folio page migration and
> memory ballooning drivers -- but let's start by handling PageOffline pages
> that can be skipped during memory offlining differently.
>
> Let's introduce a PageOffline specific page flag (PG_offline_skippable)
> that for now reuses PG_owner_2. In the memdesc future, it will be one of
> a small number of per-memdesc flags stored alongside the type.
>
> By setting PG_offline_skippable, a driver indicates that it can
> restore the PageOffline state of these specific pages when re-onlining a
> memory block: it knows that these pages are supposed to be PageOffline()
> without the information in the vmemmap, so it can filter them out and
> not expose them to the buddy -> they stay PageOffline().
>
> While PG_offline_offlineable might be clearer, it is also super
> confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
> So let's use "skippable" for now.
>
> The flag is not supposed to be used for movable PageOffline pages as
> used for balloon compaction; movable PageOffline() pages can simply be
> migrated during the memory offlining stage.
>
> Let's convert the single user from our MEM_GOING_OFFLINE approach
> to the new PG_offline_skippable approach: virtio-mem. Fortunately,
> this simplifies the code quite a lot.
>
> What if someone decides to grab a reference on these pages although they
> really shouldn't? After all, we'll now keep the refcount at 1 (until we
> can properly stop using the refcount completely).
>
> Well, less worse things will happen than would currently: currently,
> if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
> we would run into the
> if (WARN_ON(!page_ref_dec_and_test(page)))
> dump_page(page, "fake-offline page referenced");
>
> And once that unexpected reference would get dropped, we would end up
> freeing that page to the buddy: ouch.
>
> Now, we'll allow for offlining that memory, and when that unexpected
> reference would get dropped, we would not end up freeing that page to
> the buddy. Once we have frozen PageOffline() pages, it will all get a
> lot cleaner.
>
> Note that we didn't see the existing WARN_ON so far, because nobody
> should ever be referencing such pages.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> drivers/virtio/virtio_mem.c | 111 +-----------------------------------
> include/linux/page-flags.h | 29 +++++++---
> mm/memory_hotplug.c | 12 ++--
> mm/page_alloc.c | 8 +--
> mm/page_isolation.c | 21 +++----
> 5 files changed, 42 insertions(+), 139 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 56d0dbe621637..77b72843c4b53 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -278,10 +278,6 @@ static DEFINE_MUTEX(virtio_mem_mutex);
> static LIST_HEAD(virtio_mem_devices);
>
> static void virtio_mem_online_page_cb(struct page *page, unsigned int order);
> -static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
> - unsigned long nr_pages);
> -static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
> - unsigned long nr_pages);
> static void virtio_mem_retry(struct virtio_mem *vm);
> static int virtio_mem_create_resource(struct virtio_mem *vm);
> static void virtio_mem_delete_resource(struct virtio_mem *vm);
> @@ -924,64 +920,6 @@ static void virtio_mem_sbm_notify_online(struct virtio_mem *vm,
> virtio_mem_sbm_set_mb_state(vm, mb_id, new_state);
> }
>
> -static void virtio_mem_sbm_notify_going_offline(struct virtio_mem *vm,
> - unsigned long mb_id)
> -{
> - const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
> - unsigned long pfn;
> - int sb_id;
> -
> - for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
> - if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
> - continue;
> - pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
> - sb_id * vm->sbm.sb_size);
> - virtio_mem_fake_offline_going_offline(pfn, nr_pages);
> - }
> -}
> -
> -static void virtio_mem_sbm_notify_cancel_offline(struct virtio_mem *vm,
> - unsigned long mb_id)
> -{
> - const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
> - unsigned long pfn;
> - int sb_id;
> -
> - for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
> - if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
> - continue;
> - pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
> - sb_id * vm->sbm.sb_size);
> - virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
> - }
> -}
> -
> -static void virtio_mem_bbm_notify_going_offline(struct virtio_mem *vm,
> - unsigned long bb_id,
> - unsigned long pfn,
> - unsigned long nr_pages)
> -{
> - /*
> - * When marked as "fake-offline", all online memory of this device block
> - * is allocated by us. Otherwise, we don't have any memory allocated.
> - */
> - if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
> - VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
> - return;
> - virtio_mem_fake_offline_going_offline(pfn, nr_pages);
> -}
> -
> -static void virtio_mem_bbm_notify_cancel_offline(struct virtio_mem *vm,
> - unsigned long bb_id,
> - unsigned long pfn,
> - unsigned long nr_pages)
> -{
> - if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
> - VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
> - return;
> - virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
> -}
> -
> /*
> * This callback will either be called synchronously from add_memory() or
> * asynchronously (e.g., triggered via user space). We have to be careful
> @@ -1040,12 +978,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
> break;
> }
> vm->hotplug_active = true;
> - if (vm->in_sbm)
> - virtio_mem_sbm_notify_going_offline(vm, id);
> - else
> - virtio_mem_bbm_notify_going_offline(vm, id,
> - mhp->start_pfn,
> - mhp->nr_pages);
> break;
> case MEM_GOING_ONLINE:
> mutex_lock(&vm->hotplug_mutex);
> @@ -1094,12 +1026,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
> case MEM_CANCEL_OFFLINE:
> if (!vm->hotplug_active)
> break;
> - if (vm->in_sbm)
> - virtio_mem_sbm_notify_cancel_offline(vm, id);
> - else
> - virtio_mem_bbm_notify_cancel_offline(vm, id,
> - mhp->start_pfn,
> - mhp->nr_pages);
> vm->hotplug_active = false;
> mutex_unlock(&vm->hotplug_mutex);
> break;
> @@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
> SetPageDirty(page);
> else
> __SetPageOffline(page);
> + __SetPageOfflineSkippable(page);
> VM_WARN_ON_ONCE(!PageOffline(page));
> }
> page_offline_end();
> @@ -1172,6 +1099,7 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
> for (; nr_pages--; pfn++) {
> struct page *page = pfn_to_page(pfn);
>
> + __ClearPageOfflineSkippable(page);
> if (!onlined)
> /* generic_online_page() will clear PageOffline(). */
> ClearPageDirty(page);
> @@ -1261,41 +1189,6 @@ static int virtio_mem_fake_offline(struct virtio_mem *vm, unsigned long pfn,
> return -EBUSY;
> }
>
> -/*
> - * Handle fake-offline pages when memory is going offline - such that the
> - * pages can be skipped by mm-core when offlining.
> - */
> -static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
> - unsigned long nr_pages)
> -{
> - struct page *page;
> - unsigned long i;
> -
> - /* Drop our reference to the pages so the memory can get offlined. */
> - for (i = 0; i < nr_pages; i++) {
> - page = pfn_to_page(pfn + i);
> - if (WARN_ON(!page_ref_dec_and_test(page)))
> - dump_page(page, "fake-offline page referenced");
> - }
> -}
> -
> -/*
> - * Handle fake-offline pages when memory offlining is canceled - to undo
> - * what we did in virtio_mem_fake_offline_going_offline().
> - */
> -static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
> - unsigned long nr_pages)
> -{
> - unsigned long i;
> -
> - /*
> - * Get the reference again that we dropped via page_ref_dec_and_test()
> - * when going offline.
> - */
> - for (i = 0; i < nr_pages; i++)
> - page_ref_inc(pfn_to_page(pfn + i));
> -}
> -
> static void virtio_mem_online_page(struct virtio_mem *vm,
> struct page *page, unsigned int order)
> {
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1c1d49554c71a..581510ae8e83a 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -178,6 +178,9 @@ enum pageflags {
> PG_vmemmap_self_hosted = PG_owner_priv_1,
> #endif
>
> + /* The driver allows for skipping these pages during memory offlining */
> + PG_offline_skippable = PG_owner_2,
> +
> /*
> * Flags only valid for compound pages. Stored in first tail page's
> * flags word. Cannot use the first 8 flags or any flag marked as
> @@ -1029,14 +1032,23 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
> * refcount of 1 and PageOffline(). generic_online_page() will
> * take care of clearing PageOffline().
> *
> - * If a driver wants to allow to offline unmovable PageOffline() pages without
> - * putting them back to the buddy, it can do so via the memory notifier by
> - * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
> - * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
> - * pages (now with a reference count of zero) are treated like free (unmanaged)
> - * pages, allowing the containing memory block to get offlined. A driver that
> - * relies on this feature is aware that re-onlining the memory block will
> - * require not giving them to the buddy via generic_online_page().
> + * If a driver wants to allow for offlining unmovable PageOffline() pages
> + * when offlining a memory block without exposing these pages as "free" to
> + * the buddy, it can mark them as PG_offline_skippable.
> + *
> + * By marking these PageOffline pages PG_offline_skippable, the driver
> + * acknowledges that it
> + * (a) knows which pages are PageOffline even without the memmap.
> + * (b) implements the online_page_callback_t callback to exclude these pages
> + * from getting exposed to the buddy, so they will stay PageOffline()
> + * when onlining a memory block.
> + * (c) synchronizes against concurrent memory onlining/offlining whenever
> + * adjusting the PG_offline_skippable flag.
> + *
> + * Note that the refcount of these pages will for now be assumed to always
> + * be 1: nobody except the owner should be referencing these pages.
> + * Offlining code will complain if the refcount is not 1. In the future,
> + * these pages will always be frozen and not have a refcount.
> *
> * Memory offlining code will not adjust the managed page count for any
> * PageOffline() pages, treating them like they were never exposed to the
> @@ -1048,6 +1060,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
> * page_offline_freeze()/page_offline_thaw().
> */
> PAGE_TYPE_OPS(Offline, offline, offline)
> +__PAGEFLAG(OfflineSkippable, offline_skippable, PF_NO_COMPOUND)
>
> extern void page_offline_freeze(void);
> extern void page_offline_thaw(void);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b1caedbade5b1..0cc5537f234bb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1767,12 +1767,10 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
> goto found;
>
> /*
> - * PageOffline() pages that are not marked __PageMovable() and
> - * have a reference count > 0 (after MEM_GOING_OFFLINE) are
> - * definitely unmovable. If their reference count would be 0,
> - * they could at least be skipped when offlining memory.
> + * PageOffline() pages that are neither "movable" nor
> + * "skippable" prevent memory offlining.
> */
> - if (PageOffline(page) && page_count(page))
> + if (PageOffline(page) && !PageOfflineSkippable(page))
> return -EBUSY;
>
> if (!PageHuge(page))
> @@ -1807,6 +1805,10 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> struct page *page;
>
> page = pfn_to_page(pfn);
> +
> + if (PageOffline(page) && PageOfflineSkippable(page))
> + continue;
> +
Some comment like "Skippable PageOffline() pages are not migratable but are
skipped during memory offline" might help understand the change.
Some thoughts after reading the related code:
offline_pages() is a little convoluted, since it has two steps to make
sure a range of memory can be offlined:
1. start_isolate_page_range() checks unmovable (maybe not migratable
is more precise) pages using has_unmovable_pages(), but leaves unmovable
PageOffline() page handling to the driver;
2. scan_movable_pages() and do_migrate_range() migrate pages and handle
different types of PageOffline() pages.
It might make the logic cleaner if start_isolate_page_range() can
have a callback to allow driver to handle PageOffline() pages.
Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder
why offline_pages() takes care of them. Shouldn't virtio-mem driver
handle them? I also realize that the two steps in offline_pages()
are very similar to alloc_contig_range() and virtio-mem is using
alloc_contig_range(). I wonder if the two steps in offline_pages()
can be abstracted out and shared with virtio-mem. Yes, offline_pages()
operates at memory section granularity, different from the granularity,
pageblock size, of alloc_contig_range() used in virtio-mem, but
both are trying to check unmovable pages and migrate movable pages.
> folio = page_folio(page);
>
> if (!folio_try_get(folio))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a6fe1e9b95941..325b51c905076 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
> continue;
> }
> /*
> - * At this point all remaining PageOffline() pages have a
> - * reference count of 0 and can simply be skipped.
> + * At this point all remaining PageOffline() pages must be
> + * "skippable" and have exactly one reference.
> */
> if (PageOffline(page)) {
> - BUG_ON(page_count(page));
> - BUG_ON(PageBuddy(page));
> + WARN_ON_ONCE(!PageOfflineSkippable(page));
> + WARN_ON_ONCE(page_count(page) != 1);
> already_offline++;
> pfn++;
> continue;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b2fc5266e3d26..debd898b794ea 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
> continue;
>
> /*
> - * We treat all PageOffline() pages as movable when offlining
> - * to give drivers a chance to decrement their reference count
> - * in MEM_GOING_OFFLINE in order to indicate that these pages
> - * can be offlined as there are no direct references anymore.
> - * For actually unmovable PageOffline() where the driver does
> - * not support this, we will fail later when trying to actually
> - * move these pages that still have a reference count > 0.
> - * (false negatives in this function only)
> + * PageOffline() pages that are marked as "skippable"
> + * are treated like movable pages for memory offlining.
> */
> - if ((flags & MEMORY_OFFLINE) && PageOffline(page))
> + if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
> + PageOfflineSkippable(page))
> continue;
With this change, we are no longer give non-virtio-mem driver a chance
to decrease PageOffline(page) refcount? Or virtio-mem is the only driver
doing this?
>
> if (__PageMovable(page) || PageLRU(page))
> @@ -577,11 +572,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
> /* A HWPoisoned page cannot be also PageBuddy */
> pfn++;
> else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
> - !page_count(page))
> + PageOfflineSkippable(page))
The same question as above.
> /*
> - * The responsible driver agreed to skip PageOffline()
> - * pages when offlining memory by dropping its
> - * reference in MEM_GOING_OFFLINE.
> + * If the page is a skippable PageOffline() page,
> + * we can offline the memory block, as the driver will
> + * re-discover them when re-onlining the memory.
> */
> pfn++;
> else
> --
> 2.49.0
Otherwise, LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
Thanks a bucnh for the review!
>> +
>> + if (PageOffline(page) && PageOfflineSkippable(page))
>> + continue;
>> +
>
> Some comment like "Skippable PageOffline() pages are not migratable but are
> skipped during memory offline" might help understand the change.
I can add a comment like for the other cases.
>
> Some thoughts after reading the related code:
>
> offline_pages() is a little convoluted, since it has two steps to make
> sure a range of memory can be offlined:
> 1. start_isolate_page_range() checks unmovable (maybe not migratable
> is more precise) pages using has_unmovable_pages(), but leaves unmovable
> PageOffline() page handling to the driver;
Right, it's best-effort. For ZONE_MOVABLE we're skipping the checks
completely.
I was wondering for a longer time that -- with the isolate flag being a
separate bit soon :) -- we could simply isolate the whole range, and
then fail if we stumble over an unmovable page during migration. That
is, get rid of has_unmovable_pages() entirely. Un-doing the isolation
would then properly preserve the migratetype -- no harm done?
Certainly worth a look. What do you think about that?
> 2. scan_movable_pages() and do_migrate_range() migrate pages and handle
> different types of PageOffline() pages.
Right, migrate what we can, skip these special once, and bail out on any
others (at least in this patch, patch #2 restores the pre-virtio-mem
behavior).
>
> It might make the logic cleaner if start_isolate_page_range() can
> have a callback to allow driver to handle PageOffline() pages.
We have to identify them repeadetly, so start_isolate_page_range() would
not be sufficient. Also, callbacks are rather tricky for the case where
we cannot really stabilize the page.
>
> Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder
> why offline_pages() takes care of them. Shouldn't virtio-mem driver
> handle them?
> I also realize that the two steps in offline_pages()> are very
similar to alloc_contig_range() and virtio-mem is using
> alloc_contig_range(). I wonder if the two steps in offline_pages()
> can be abstracted out and shared with virtio-mem.Yes, offline_pages()> operates at memory section granularity, different from the granularity,
> pageblock size, of alloc_contig_range() used in virtio-mem, but
> both are trying to check unmovable pages and migrate movable pages.
Not sure I get completely what you mean. virtio-mem really wants to use
alloc_contig_range() to allocate ranges it wants to unplug, because it
will fail fast and allows for smaller ranges.
offline_pages() is primarily also about offlining the memory section,
which virtio-mem must do in order to remove the Linux memory block.
>
>
>> folio = page_folio(page);
>>
>> if (!folio_try_get(folio))
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index a6fe1e9b95941..325b51c905076 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
>> continue;
>> }
>> /*
>> - * At this point all remaining PageOffline() pages have a
>> - * reference count of 0 and can simply be skipped.
>> + * At this point all remaining PageOffline() pages must be
>> + * "skippable" and have exactly one reference.
>> */
>> if (PageOffline(page)) {
>> - BUG_ON(page_count(page));
>> - BUG_ON(PageBuddy(page));
>> + WARN_ON_ONCE(!PageOfflineSkippable(page));
>> + WARN_ON_ONCE(page_count(page) != 1);
>> already_offline++;
>> pfn++;
>> continue;
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index b2fc5266e3d26..debd898b794ea 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>> continue;
>>
>> /*
>> - * We treat all PageOffline() pages as movable when offlining
>> - * to give drivers a chance to decrement their reference count
>> - * in MEM_GOING_OFFLINE in order to indicate that these pages
>> - * can be offlined as there are no direct references anymore.
>> - * For actually unmovable PageOffline() where the driver does
>> - * not support this, we will fail later when trying to actually
>> - * move these pages that still have a reference count > 0.
>> - * (false negatives in this function only)
>> + * PageOffline() pages that are marked as "skippable"
>> + * are treated like movable pages for memory offlining.
>> */
>> - if ((flags & MEMORY_OFFLINE) && PageOffline(page))
>> + if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
>> + PageOfflineSkippable(page))
>> continue;
>
> With this change, we are no longer give non-virtio-mem driver a chance
> to decrease PageOffline(page) refcount? Or virtio-mem is the only driver
> doing this?
The only in-tree driver making use of this so far, yes.
There is one tweak I'll have to perform in the virtio-mem driver to
cover one corner case: when force-unloading the virtio-mem driver, we
have to make sure that memory blocks with fake-offline pages cannot get
offlined (re-onlining would be bad). I'll fix that up.
--
Cheers,
David / dhildenb
On 14 May 2025, at 15:51, David Hildenbrand wrote:
> Thanks a bucnh for the review!
>
>>> +
>>> + if (PageOffline(page) && PageOfflineSkippable(page))
>>> + continue;
>>> +
>>
>> Some comment like "Skippable PageOffline() pages are not migratable but are
>> skipped during memory offline" might help understand the change.
>
> I can add a comment like for the other cases.
Thanks.
>>
>> Some thoughts after reading the related code:
>>
>> offline_pages() is a little convoluted, since it has two steps to make
>> sure a range of memory can be offlined:
>> 1. start_isolate_page_range() checks unmovable (maybe not migratable
>> is more precise) pages using has_unmovable_pages(), but leaves unmovable
>> PageOffline() page handling to the driver;
>
> Right, it's best-effort. For ZONE_MOVABLE we're skipping the checks completely.
>
> I was wondering for a longer time that -- with the isolate flag being a separate bit soon :) -- we could simply isolate the whole range, and then fail if we stumble over
Talking about that, I will need your input on my change to move_pfn_range_to_zone()
in online_pages()[1]. Making MIGRATE_ISOLATE a separate bit means if you isolate
a pageblock without a migratetype, unisolating it will give an unpredictable
migratetype.
[1] https://lore.kernel.org/linux-mm/20250509200111.3372279-3-ziy@nvidia.com/
> an unmovable page during migration. That is, get rid of has_unmovable_pages() entirely. Un-doing the isolation would then properly preserve the migratetype -- no harm done?
>
> Certainly worth a look. What do you think about that?
In principle, the method should work and simplifies the code. But it suffers more
penalty (pages are migrated) when an unmovable page is encountered after the
isolation, since before nothing will be migrated. To mitigate this,
we probably would want some retry mechanism. For example, register a callback
to each unmovable page and once the unmovable page is deallocated,
alloc_contig_range() can move forward progress.
>
>> 2. scan_movable_pages() and do_migrate_range() migrate pages and handle
>> different types of PageOffline() pages.
>
> Right, migrate what we can, skip these special once, and bail out on any others (at least in this patch, patch #2 restores the pre-virtio-mem behavior).
>
>>
>> It might make the logic cleaner if start_isolate_page_range() can
>> have a callback to allow driver to handle PageOffline() pages.
>
> We have to identify them repeadetly, so start_isolate_page_range() would not be sufficient. Also, callbacks are rather tricky for the case where we cannot really stabilize the page.
During start_isolate_page_range(), all pageblocks are isolated, so
free pages cannot be used by anyone else, meaning no new PageOffline()
pages or any other unmovable pages, right?
>
>>
>> Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder
>> why offline_pages() takes care of them. Shouldn't virtio-mem driver
>> handle them?
>> I also realize that the two steps in offline_pages()> are very similar to alloc_contig_range() and virtio-mem is using
>> alloc_contig_range(). I wonder if the two steps in offline_pages()
>> can be abstracted out and shared with virtio-mem.Yes, offline_pages()> operates at memory section granularity, different from the granularity,
>> pageblock size, of alloc_contig_range() used in virtio-mem, but
>> both are trying to check unmovable pages and migrate movable pages.
>
> Not sure I get completely what you mean. virtio-mem really wants to use alloc_contig_range() to allocate ranges it wants to unplug, because it will fail fast and allows for smaller ranges.
>
> offline_pages() is primarily also about offlining the memory section, which virtio-mem must do in order to remove the Linux memory block.
To clarify, I mean the steps of start_isolate_page_range(), scan_movable_pages(),
do_migrate_range(), dissolve_free_hugetlb_folios() and test_pages_isolated() in
offline_pages() is very similar to the steps of start_isolate_page_range(),
__alloc_contig_migrate_range(), replace_free_hugepage_folios(),
and test_pages_isolate() in alloc_contig_range(). So I wonder if a common
function routine can be shared.
>
>>
>>
>>> folio = page_folio(page);
>>>
>>> if (!folio_try_get(folio))
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index a6fe1e9b95941..325b51c905076 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
>>> continue;
>>> }
>>> /*
>>> - * At this point all remaining PageOffline() pages have a
>>> - * reference count of 0 and can simply be skipped.
>>> + * At this point all remaining PageOffline() pages must be
>>> + * "skippable" and have exactly one reference.
>>> */
>>> if (PageOffline(page)) {
>>> - BUG_ON(page_count(page));
>>> - BUG_ON(PageBuddy(page));
>>> + WARN_ON_ONCE(!PageOfflineSkippable(page));
>>> + WARN_ON_ONCE(page_count(page) != 1);
>>> already_offline++;
>>> pfn++;
>>> continue;
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index b2fc5266e3d26..debd898b794ea 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>>> continue;
>>>
>>> /*
>>> - * We treat all PageOffline() pages as movable when offlining
>>> - * to give drivers a chance to decrement their reference count
>>> - * in MEM_GOING_OFFLINE in order to indicate that these pages
>>> - * can be offlined as there are no direct references anymore.
>>> - * For actually unmovable PageOffline() where the driver does
>>> - * not support this, we will fail later when trying to actually
>>> - * move these pages that still have a reference count > 0.
>>> - * (false negatives in this function only)
>>> + * PageOffline() pages that are marked as "skippable"
>>> + * are treated like movable pages for memory offlining.
>>> */
>>> - if ((flags & MEMORY_OFFLINE) && PageOffline(page))
>>> + if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
>>> + PageOfflineSkippable(page))
>>> continue;
>>
>> With this change, we are no longer give non-virtio-mem driver a chance
>> to decrease PageOffline(page) refcount? Or virtio-mem is the only driver
>> doing this?
>
> The only in-tree driver making use of this so far, yes.
Got it. Thanks.
>
>
> There is one tweak I'll have to perform in the virtio-mem driver to cover one corner case: when force-unloading the virtio-mem driver, we have to make sure that memory blocks with fake-offline pages cannot get offlined (re-onlining would be bad). I'll fix that up.
--
Best Regards,
Yan, Zi
>>> Some thoughts after reading the related code: >>> >>> offline_pages() is a little convoluted, since it has two steps to make >>> sure a range of memory can be offlined: >>> 1. start_isolate_page_range() checks unmovable (maybe not migratable >>> is more precise) pages using has_unmovable_pages(), but leaves unmovable >>> PageOffline() page handling to the driver; >> >> Right, it's best-effort. For ZONE_MOVABLE we're skipping the checks completely. >> >> I was wondering for a longer time that -- with the isolate flag being a separate bit soon :) -- we could simply isolate the whole range, and then fail if we stumble over > > Talking about that, I will need your input on my change to move_pfn_range_to_zone() > in online_pages()[1]. Making MIGRATE_ISOLATE a separate bit means if you isolate > a pageblock without a migratetype, unisolating it will give an unpredictable > migratetype. Sorry for my late reply on that. I think, the rule should be that you initialize the migratetype once, before any isolation+un-isolation. So that should only require care when adding new pageblocks (memory hotplug). > > [1] https://lore.kernel.org/linux-mm/20250509200111.3372279-3-ziy@nvidia.com/ > >> an unmovable page during migration. That is, get rid of has_unmovable_pages() entirely. Un-doing the isolation would then properly preserve the migratetype -- no harm done? >> >> Certainly worth a look. What do you think about that? > > In principle, the method should work and simplifies the code. But it suffers more > penalty (pages are migrated) when an unmovable page is encountered after the > isolation, since before nothing will be migrated. To mitigate this, > we probably would want some retry mechanism. For example, register a callback > to each unmovable page and once the unmovable page is deallocated, > alloc_contig_range() can move forward progress. I was wondering if we could make the isolation code a bit simpler. For example, set all involved pageblocks isolated. If any is already isolated, we can easily back off. Once all are isolated, we could do the has_unmovable_pages() on the whole range, not a single pageblock. Then we could start migrating. I think the "issue" is, once we drop the zone lock, pages that are getting freed end up on the MIGRATE_ISOLATE list -- and I think we also must have moved the free pages already to the proper MIGRATE_ISOLATE list before we drop the zone lock. So the possible cleanups might be limited. > >> >>> 2. scan_movable_pages() and do_migrate_range() migrate pages and handle >>> different types of PageOffline() pages. >> >> Right, migrate what we can, skip these special once, and bail out on any others (at least in this patch, patch #2 restores the pre-virtio-mem behavior). >> >>> >>> It might make the logic cleaner if start_isolate_page_range() can >>> have a callback to allow driver to handle PageOffline() pages. >> >> We have to identify them repeadetly, so start_isolate_page_range() would not be sufficient. Also, callbacks are rather tricky for the case where we cannot really stabilize the page. > > During start_isolate_page_range(), all pageblocks are isolated, so > free pages cannot be used by anyone else, meaning no new PageOffline() > pages or any other unmovable pages, right? Yes, that is correct. Pages (incl. PageOffline) pages could get freed back to the buddy. But we don't want to store per-page callbacks either way. What would work is a new notifier chain (protected by RCU etc), that we can ask for each PageOffline page. Not sure I prefer that of the approach here that is significantly simpler -- and we do have the spare bit for PageOffline pages in the new memdec world (for PageOffline, we'll probably need 2/3 flags in the long run). > >> >>> >>> Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder >>> why offline_pages() takes care of them. Shouldn't virtio-mem driver >>> handle them? >>> I also realize that the two steps in offline_pages()> are very similar to alloc_contig_range() and virtio-mem is using >>> alloc_contig_range(). I wonder if the two steps in offline_pages() >>> can be abstracted out and shared with virtio-mem.Yes, offline_pages()> operates at memory section granularity, different from the granularity, >>> pageblock size, of alloc_contig_range() used in virtio-mem, but >>> both are trying to check unmovable pages and migrate movable pages. >> >> Not sure I get completely what you mean. virtio-mem really wants to use alloc_contig_range() to allocate ranges it wants to unplug, because it will fail fast and allows for smaller ranges. >> >> offline_pages() is primarily also about offlining the memory section, which virtio-mem must do in order to remove the Linux memory block. > > To clarify, I mean the steps of start_isolate_page_range(), scan_movable_pages(), > do_migrate_range(), dissolve_free_hugetlb_folios() and test_pages_isolated() in > offline_pages() is very similar to the steps of start_isolate_page_range(), > __alloc_contig_migrate_range(), replace_free_hugepage_folios(), > and test_pages_isolate() in alloc_contig_range(). So I wonder if a common > function routine can be shared. Ah, yes, I had the same idea a couple of years back, but never got to it. There are subtle differences (e.g., memory offlining keeps retrying and is allowed to dissolve free hugetlb folios), but these could likely be modified using a parameter (similar to how we already special-case offlining). Thanks! -- Cheers, David / dhildenb
© 2016 - 2026 Red Hat, Inc.