Introduce a page-table-like data structure for tracking preserved
memory pages, which will replace the current xarray-based
implementation.
The primary motivation for this change is to eliminate the need for
serialization. By marking preserved pages directly in these new tables
and passing them to the next kernel, the entire serialization process
can be removed. This ultimately allows for the removal of the KHO
finalize and abort states, simplifying the overall design.
The new KHO page table is a hierarchical structure that maps physical
addresses to preservation metadata. It begins with a root
`kho_order_table` that contains an entry for each page order. Each
entry points to a separate, multi-level tree of `kho_page_table`s that
splits a physical address into indices. The traversal terminates at a
`kho_bitmap_table`, where each bit represents a single preserved page.
This commit adds the core data structures for this hierarchy:
- kho_order_table: The root table, indexed by page order.
- kho_page_table: Intermediate-level tables.
- kho_bitmap_table: The lowest-level table where individual pages
are marked.
The new functions are not yet integrated with the public
`kho_preserve_*` APIs and are marked `__maybe_unused`. The full
integration and the removal of the old xarray code will follow in a
subsequent commit.
Signed-off-by: Jason Miu <jasonmiu@google.com>
---
kernel/kexec_handover.c | 344 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 344 insertions(+)
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index ecd1ac210dbd..0daed51c8fb7 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -46,6 +46,350 @@ static int __init kho_parse_enable(char *p)
}
early_param("kho", kho_parse_enable);
+/*
+ * KHO page tables provide a page-table-like data structure for tracking
+ * preserved memory pages. It is a hierarchical structure that starts with a
+ * `struct kho_order_table`. Each entry in this table points to the root of a
+ * `struct kho_page_table` tree, which tracks the preserved memory pages for a
+ * specific page order.
+ *
+ * Each entry in a `struct kho_page_table` points to the next level page table,
+ * until level 2, which points to a `struct kho_bitmap_table`. The lowest level
+ * (level 1) is a bitmap table where each bit represents a preserved page.
+ *
+ * The table hierarchy is shown as below.
+ *
+ * kho_order_table
+ * +-------------------------------+--------------------+
+ * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
+ * ++------------------------------+--------------------+
+ * |
+ * |
+ * v
+ * ++------+
+ * | Lv6 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv5 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv4 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv3 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +> | Lv2 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv1 | kho_bitmap_table
+ * +-------+
+ *
+ * The depth of the KHO page tables depends on the system's page size and the
+ * page order. Both larger page sizes and higher page orders result in
+ * shallower KHO page tables. For example, on a system with a 4KB native
+ * page size, 0-order tables have a depth of 6 levels.
+ *
+ * The following diagram illustrates how a physical address is split into
+ * indices for the different KHO page table levels and the final bitmap.
+ *
+ * 63 62:54 53:45 44:36 35:27 26:0
+ * +--------+--------+--------+--------+--------+-----------------+
+ * | Lv 6 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 (bitmap) |
+ * +--------+--------+--------+--------+--------+-----------------+
+ *
+ * For higher order pages, the bit fields for each level shift to the left by
+ * the page order.
+ *
+ * Each KHO page table and bitmap table is PAGE_SIZE in size. For 0-order
+ * pages, the bitmap table contains (PAGE_SIZE * 8) bits, covering a
+ * (PAGE_SIZE * 8 * PAGE_SIZE) memory range. For example, on a system with a
+ * 4KB native page size, the bitmap table contains 32768 bits and covers a
+ * 128MB memory range.
+ *
+ * Each KHO page table contains (PAGE_SIZE / 8) entries, where each entry is a
+ * descriptor (a physical address) pointing to the next level table.
+ * For example, with a 4KB page size, each page table holds 512 entries.
+ * The level 2 KHO page table is an exception, where each entry points to a
+ * KHO bitmap table instead.
+ *
+ * An entry of a KHO page table of a 4KB page system is shown as below as an
+ * example.
+ *
+ * 63:12 11:0
+ * +------------------------------+--------------+
+ * | descriptor to next table | zeros |
+ * +------------------------------+--------------+
+ */
+
+#define BITMAP_TABLE_SHIFT(_order) (PAGE_SHIFT + PAGE_SHIFT + 3 + (_order))
+#define BITMAP_TABLE_MASK(_order) ((1ULL << BITMAP_TABLE_SHIFT(_order)) - 1)
+#define PRESERVED_PAGE_OFFSET_SHIFT(_order) (PAGE_SHIFT + (_order))
+#define PAGE_TABLE_SHIFT_PER_LEVEL (ilog2(PAGE_SIZE / sizeof(unsigned long)))
+#define PAGE_TABLE_LEVEL_MASK ((1ULL << PAGE_TABLE_SHIFT_PER_LEVEL) - 1)
+#define PTR_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long))
+
+typedef int (*kho_walk_callback_t)(phys_addr_t pa, int order);
+
+struct kho_bitmap_table {
+ unsigned long bitmaps[PAGE_SIZE / sizeof(unsigned long)];
+};
+
+struct kho_page_table {
+ unsigned long tables[PTR_PER_LEVEL];
+};
+
+struct kho_order_table {
+ unsigned long orders[HUGETLB_PAGE_ORDER + 1];
+};
+
+/*
+ * `kho_order_table` points to a page that serves as the root of the KHO page
+ * table hierarchy. This page is allocated during KHO module initialization.
+ * Its physical address is written to the FDT and passed to the next kernel
+ * during kexec.
+ */
+static struct kho_order_table *kho_order_table;
+
+static unsigned long kho_page_table_level_shift(int level, int order)
+{
+ /*
+ * Calculate the cumulative bit shift required to extract the page table
+ * index for a given physical address at a specific `level` and `order`.
+ *
+ * - Level 1 is the bitmap table, which has its own indexing logic, so
+ * the shift is 0.
+ * - Level 2 and above: The base shift is `BITMAP_TABLE_SHIFT(order)`,
+ * which corresponds to the entire address space covered by a single
+ * level 1 bitmap table.
+ * - Each subsequent level adds `PAGE_TABLE_SHIFT_PER_LEVEL` to the
+ * total shift amount.
+ */
+ return level <= 1 ? 0 :
+ BITMAP_TABLE_SHIFT(order) + PAGE_TABLE_SHIFT_PER_LEVEL * (level - 2);
+}
+
+static int kho_get_bitmap_table_index(unsigned long pa, int order)
+{
+ /* 4KB (12bits of addr) + 8B per entries (6bits of addr) + order bits */
+ unsigned long idx = pa >> (PAGE_SHIFT + 6 + order);
+
+ return idx;
+}
+
+static int kho_get_page_table_index(unsigned long pa, int order, int level)
+{
+ unsigned long high_addr;
+ unsigned long page_table_offset;
+ unsigned long shift;
+
+ if (level == 1)
+ return kho_get_bitmap_table_index(pa, order);
+
+ shift = kho_page_table_level_shift(level, order);
+ high_addr = pa >> shift;
+
+ page_table_offset = high_addr & PAGE_TABLE_LEVEL_MASK;
+ return page_table_offset;
+}
+
+static int kho_table_level(int order)
+{
+ unsigned long bits_to_resolve;
+ int page_table_num;
+
+ /* We just need 1 bitmap table to cover all addresses */
+ if (BITMAP_TABLE_SHIFT(order) >= 64)
+ return 1;
+
+ bits_to_resolve = 64 - BITMAP_TABLE_SHIFT(order);
+
+ /*
+ * The level we need is the bits to resolve over the bits a page tabel
+ * can resolve. Get the ceiling as ceil(a/b) = (a + b - 1) / b.
+ * Total level is the all table levels plus the buttom
+ * bitmap level.
+ */
+ page_table_num = (bits_to_resolve + PAGE_TABLE_SHIFT_PER_LEVEL - 1)
+ / PAGE_TABLE_SHIFT_PER_LEVEL;
+ return page_table_num + 1;
+}
+
+static struct kho_page_table *kho_alloc_page_table(void)
+{
+ return (struct kho_page_table *)get_zeroed_page(GFP_KERNEL);
+}
+
+static void kho_set_preserved_page_bit(struct kho_bitmap_table *bitmap_table,
+ unsigned long pa, int order)
+{
+ int bitmap_table_index = kho_get_bitmap_table_index(pa, order);
+ int offset;
+
+ /* Get the bit offset in a 64bits bitmap entry */
+ offset = (pa >> PRESERVED_PAGE_OFFSET_SHIFT(order)) & 0x3f;
+
+ set_bit(offset,
+ (unsigned long *)&bitmap_table->bitmaps[bitmap_table_index]);
+}
+
+static unsigned long kho_pgt_desc(struct kho_page_table *va)
+{
+ return (unsigned long)virt_to_phys(va);
+}
+
+static struct kho_page_table *kho_page_table(unsigned long desc)
+{
+ return (struct kho_page_table *)phys_to_virt(desc);
+}
+
+static int __kho_preserve_page_table(unsigned long pa, int order)
+{
+ int num_table_level = kho_table_level(order);
+ struct kho_page_table *cur;
+ struct kho_page_table *next;
+ struct kho_bitmap_table *bitmap_table;
+ int i, page_table_index;
+ unsigned long page_table_desc;
+
+ if (!kho_order_table->orders[order]) {
+ cur = kho_alloc_page_table();
+ if (!cur)
+ return -ENOMEM;
+ page_table_desc = kho_pgt_desc(cur);
+ kho_order_table->orders[order] = page_table_desc;
+ }
+
+ cur = kho_page_table(kho_order_table->orders[order]);
+
+ /* Go from high level tables to low level tables */
+ for (i = num_table_level; i > 1; i--) {
+ page_table_index = kho_get_page_table_index(pa, order, i);
+
+ if (!cur->tables[page_table_index]) {
+ next = kho_alloc_page_table();
+ if (!next)
+ return -ENOMEM;
+ cur->tables[page_table_index] = kho_pgt_desc(next);
+ } else {
+ next = kho_page_table(cur->tables[page_table_index]);
+ }
+
+ cur = next;
+ }
+
+ /* Cur is now pointing to the level 1 bitmap table */
+ bitmap_table = (struct kho_bitmap_table *)cur;
+ kho_set_preserved_page_bit(bitmap_table,
+ pa & BITMAP_TABLE_MASK(order),
+ order);
+
+ return 0;
+}
+
+/*
+ * TODO: __maybe_unused is added to the functions:
+ * kho_preserve_page_table()
+ * kho_walk_tables()
+ * kho_memblock_reserve()
+ * since they are not actually being called in this change.
+ * __maybe_unused will be removed in the next patch.
+ */
+static __maybe_unused int kho_preserve_page_table(unsigned long pfn, int order)
+{
+ unsigned long pa = PFN_PHYS(pfn);
+
+ might_sleep();
+
+ return __kho_preserve_page_table(pa, order);
+}
+
+static int __kho_walk_bitmap_table(int order,
+ struct kho_bitmap_table *bitmap_table,
+ unsigned long pa,
+ kho_walk_callback_t cb)
+{
+ int i;
+ unsigned long offset;
+ int ret = 0;
+ int order_factor = 1 << order;
+ unsigned long *bitmap = (unsigned long *)bitmap_table;
+
+ for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
+ offset = (unsigned long)PAGE_SIZE * order_factor * i;
+ ret = cb(offset + pa, order);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int __kho_walk_page_tables(int order, int level,
+ struct kho_page_table *cur, unsigned long pa,
+ kho_walk_callback_t cb)
+{
+ struct kho_page_table *next;
+ struct kho_bitmap_table *bitmap_table;
+ int i;
+ unsigned long offset;
+ int ret = 0;
+
+ if (level == 1) {
+ bitmap_table = (struct kho_bitmap_table *)cur;
+ return __kho_walk_bitmap_table(order, bitmap_table, pa, cb);
+ }
+
+ for (i = 0; i < PTR_PER_LEVEL; i++) {
+ if (cur->tables[i]) {
+ next = kho_page_table(cur->tables[i]);
+ offset = i;
+ offset <<= kho_page_table_level_shift(level, order);
+ ret = __kho_walk_page_tables(order, level - 1,
+ next, offset + pa, cb);
+ if (ret < 0)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static __maybe_unused int kho_walk_page_tables(struct kho_page_table *top, int order,
+ kho_walk_callback_t cb)
+{
+ int num_table_level;
+
+ if (top) {
+ num_table_level = kho_table_level(order);
+ return __kho_walk_page_tables(order, num_table_level, top, 0, cb);
+ }
+
+ return 0;
+}
+
+static __maybe_unused int kho_memblock_reserve(phys_addr_t pa, int order)
+{
+ int sz = 1 << (order + PAGE_SHIFT);
+ struct page *page = phys_to_page(pa);
+
+ memblock_reserve(pa, sz);
+ memblock_reserved_mark_noinit(pa, sz);
+ page->private = order;
+
+ return 0;
+}
+
/*
* Keep track of memory that is to be preserved across KHO.
*
--
2.51.0.384.g4c02a37b29-goog
On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote: > + * kho_order_table > + * +-------------------------------+--------------------+ > + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER | > + * ++------------------------------+--------------------+ > + * | > + * | > + * v > + * ++------+ > + * | Lv6 | kho_page_table > + * ++------+ I seem to remember suggesting this could be simplified without the special case 7h level table table for order. Encode the phys address as: (order << 51) | (phys >> (PAGE_SHIFT + order)) Then you don't need another table for order, the 64 bits encode everything consistently. Order can't be > 52 so it is only 6 bits, meaning the result fits into at most 57 bits. > + * 63 62:54 53:45 44:36 35:27 26:0 > + * +--------+--------+--------+--------+--------+-----------------+ > + * | Lv 6 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 (bitmap) | > + * +--------+--------+--------+--------+--------+-----------------+ This isn't quite right, the 11:0 bits are must be zero and not used to index anything. Adjusting to reflect the above math, it would be like this: 63:60 59:51 50:42 41:33 32:34 23:15 14:0 +-----+--------+--------+--------+--------+--------+-----------------+ | 0 | Lv 6 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 (bitmap) | +-----+--------+--------+--------+--------+--------+-----------------+ The order level is just folded into lv 6 > + * For higher order pages, the bit fields for each level shift to the left by > + * the page order. This is probably an unncessary complexity. The table levels cost only 64 bytes, it isn't so valuable to eliminate them. So with the above math it shifts right not left. Level 1 is always the bitmap and it doesn't move around. I'd label this 0 in the code. If you also fix the sizes to be 64 bytes and 4096 bytes regardless of PAGE_SIZE then everything is easy and fixed, while still efficient on higher PAGE_SIZE architectures. Fruther, changing the formula to this: (1 << (63 - PAGE_SHIFT - order)) | (phys >> (PAGE_SHIFT + order)) Will shift the overhead levels to the top of the radix tree and share them across all orders, higher PAGE_SIZE arches will just get a single lvl 5 and an unecessary lvl 6 - cost 64 extra bytes who cares. > +#define BITMAP_TABLE_SHIFT(_order) (PAGE_SHIFT + PAGE_SHIFT + 3 + (_order)) > +#define BITMAP_TABLE_MASK(_order) ((1ULL << BITMAP_TABLE_SHIFT(_order)) - 1) > +#define PRESERVED_PAGE_OFFSET_SHIFT(_order) (PAGE_SHIFT + (_order)) > +#define PAGE_TABLE_SHIFT_PER_LEVEL (ilog2(PAGE_SIZE / sizeof(unsigned long))) > +#define PAGE_TABLE_LEVEL_MASK ((1ULL << PAGE_TABLE_SHIFT_PER_LEVEL) - 1) > +#define PTR_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long)) please use inlines and enums :( It looks like if you make the above algorithm changes most of the this code is deleted. Jason
On Wed, Sep 17, 2025 at 8:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote: > > + * kho_order_table > > + * +-------------------------------+--------------------+ > > + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER | > > + * ++------------------------------+--------------------+ > > + * | > > + * | > > + * v > > + * ++------+ > > + * | Lv6 | kho_page_table > > + * ++------+ > > I seem to remember suggesting this could be simplified without the > special case 7h level table table for order. > > Encode the phys address as: > > (order << 51) | (phys >> (PAGE_SHIFT + order)) Why 51 and not 52, this limits to 63bit address space, is it not? > > Then you don't need another table for order, the 64 bits encode > everything consistently. Order can't be > 52 so it is > only 6 bits, meaning the result fits into at most 57 bits. > Hi Jason, Nice packing. That's a really clever bit-packing scheme to create a unified address space. I like the idea, but I'm trying to find the benefits compared to the current per-order tree approach. 1. Packing adds a slight performance overhead for higher orders. With the current approach, preserving higher order pages only requires a 3/4-level page table. With bit-packing proposal we will always have extra loads during preserve/unpreserve operations. 2. It also adds insignificant memory overhead, as extra levels will have a couple extra pages. 3. It slightly complicates the logic in the new kernel. Instead of simply iterating a known tree for a specific order, the boot-time walker would need to reconstruct the per-order subtrees, and walk them. Perhaps I'm missing a key benefit of the unified tree? The current approach might not be as elegant as having everything packed into the same page table but it seems to be OK to me, and easy to understand. Pasha
On Wed, Sep 17, 2025 at 12:18:39PM -0400, Pasha Tatashin wrote: > On Wed, Sep 17, 2025 at 8:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote: > > > + * kho_order_table > > > + * +-------------------------------+--------------------+ > > > + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER | > > > + * ++------------------------------+--------------------+ > > > + * | > > > + * | > > > + * v > > > + * ++------+ > > > + * | Lv6 | kho_page_table > > > + * ++------+ > > > > I seem to remember suggesting this could be simplified without the > > special case 7h level table table for order. > > > > Encode the phys address as: > > > > (order << 51) | (phys >> (PAGE_SHIFT + order)) > > Why 51 and not 52, this limits to 63bit address space, is it not? Yeah, might have got the math off > I like the idea, but I'm trying to find the benefits compared to the > current per-order tree approach. It is probably about half the code compared to what I see here because everything is agressively simplified. > 3. It slightly complicates the logic in the new kernel. Instead of > simply iterating a known tree for a specific order, the boot-time > walker would need to reconstruct the per-order subtrees, and walk > them. The core walker just runs over a range, it is easy to compute the range. Jason
Hi Jason, On Wed, Sep 17, 2025 at 9:32 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Wed, Sep 17, 2025 at 12:18:39PM -0400, Pasha Tatashin wrote: > > On Wed, Sep 17, 2025 at 8:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote: > > > > + * kho_order_table > > > > + * +-------------------------------+--------------------+ > > > > + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER | > > > > + * ++------------------------------+--------------------+ > > > > + * | > > > > + * | > > > > + * v > > > > + * ++------+ > > > > + * | Lv6 | kho_page_table > > > > + * ++------+ > > > > > > I seem to remember suggesting this could be simplified without the > > > special case 7h level table table for order. > > > > > > Encode the phys address as: > > > > > > (order << 51) | (phys >> (PAGE_SHIFT + order)) > > > > Why 51 and not 52, this limits to 63bit address space, is it not? > > Yeah, might have got the math off > > > I like the idea, but I'm trying to find the benefits compared to the > > current per-order tree approach. > > It is probably about half the code compared to what I see here because > everything is agressively simplified. Thank you very much for providing feedback to me, and I think this is a very smart idea. > > 3. It slightly complicates the logic in the new kernel. Instead of > > simply iterating a known tree for a specific order, the boot-time > > walker would need to reconstruct the per-order subtrees, and walk > > them. > > The core walker just runs over a range, it is easy to compute the > range. I believe the "range" here refers to the specific portion of the tree relevant to the `target_order` being restored, while the `target_order` is the variable from 0 to MAX_PAGE_ORDER to be used in the tree walk in the new kernel. My current understanding of the walker for a given `target_order`: 1. Find the `start_level` from the `target_order`. (for example, target_order = 10, start_level = 4) 2. The path from the root down to the level above `start_level` is fixed (index 0 at each of these levels). 3. At `start_level`, the index is also fixed, by (1 << (63 - PAGE_SHIFT - order)) in a 9 bit slice. 4. Then, for all levels *below* `order_level`, the walker iterates through all 512 table entries, until the bitmap level. so the "range" is the subtrees under the start_level, is my understanding correct? -- Jason Miu
> 1. Find the `start_level` from the `target_order`. (for example, > target_order = 10, start_level = 4) > 2. The path from the root down to the level above `start_level` is > fixed (index 0 at each of these levels). > 3. At `start_level`, the index is also fixed, by (1 << (63 - > PAGE_SHIFT - order)) in a 9 bit slice. > 4. Then, for all levels *below* `order_level`, the walker iterates > through all 512 table entries, until the bitmap level. You don't need any special logic like that, that is my point, the whole thing is very simple: static int get_index(unsigned int level, u64 pos) { return (pos / (level * ITEMS_PER_TABLE * ITEMS_PER_BITMAP)) % ITEMS_PER_TABLE; } walk_table(u64 *table, unsigned int level, u64 start, u64 last) { unsigned int index = get_index(level, start); unsigned int last_index = get_index(level, last); do { if (table[index]) { u64 *next_table = phys_to_virt(table[index]); if (level == 1) walk_bitmap(next_table); else walk_table(next_table, level - 1, start, last); } index++; } while (index <= last_index); } insert_table(u64 *table, unsigned int level, u64 pos) { unsigned int index = get_index(level, start); u64 *next_table; if (!table[index]) { // allocate table[index] } else next_table = phys_to_virt(table[index]); if (level == 1) insert_bitmap(next_table, pos); else insert_table(next_table, level - 1, pos); } That's it.. No special cases requried. The above is very limited, it only works with certain formulations of start/last: start has only one bit set start & last == true, last ^ start has bits 0 -> N set N > log2(ITEMS_PER_BITMAP) Which align to my suggestion for encoding. Jason
© 2016 - 2025 Red Hat, Inc.