From: Hongru Zhang <zhanghongru@xiaomi.com>
This patch optimizes /proc/pagetypeinfo access by utilizing the
per-migratetype free page block counts already maintained by the buddy
allocator, instead of iterating through free lists under zone lock.
Accuracy. Both implementations have accuracy limitations. The previous
implementation required acquiring and releasing the zone lock for counting
each order and migratetype, making it potentially inaccurate. Under high
memory pressure, accuracy would further degrade due to zone lock
contention or fragmentation. The new implementation collects data within a
short time window, which helps maintain relatively small errors, and is
unaffected by memory pressure. Furthermore, user-space memory management
components inherently experience decision latency - by the time they
process the collected data and execute actions, the memory state has
already changed. This means that even perfectly accurate data at
collection time becomes stale by decision time. Considering these factors,
the accuracy trade-off introduced by the new implementation should be
acceptable for practical use cases, offering a balance between performance
and accuracy requirements.
Performance benefits:
System setup:
- 12th Gen Intel(R) Core(TM) i7-12700
- 1 NUMA node, 16G memory in total
- Turbo disabled
- cpufreq governor set to performance
1. Average latency over 10,000 /proc/pagetypeinfo accesses
+-----------------------+----------+------------+
| | no-patch | with-patch |
+-----------------------+----------+------------+
| Just after boot | 700.9 us | 268.6 us |
+-----------------------+----------+------------+
| After building kernel | 28.7 ms | 269.8 us |
+-----------------------+----------+------------+
2. Page alloc/free latency with concurrent /proc/pagetypeinfo access
Test setup:
- Using config-pagealloc-micro
- Monitor set to proc-pagetypeinfo, update frequency set to 10ms
- PAGEALLOC_ORDER_MIN=4, PAGEALLOC_ORDER_MAX=4
Without patch test results:
vanilla vanilla
no-monitor monitor
Min alloc-odr4-1 8539.00 ( 0.00%) 8762.00 ( -2.61%)
Min alloc-odr4-2 6501.00 ( 0.00%) 6683.00 ( -2.80%)
Min alloc-odr4-4 5537.00 ( 0.00%) 5873.00 ( -6.07%)
Min alloc-odr4-8 5030.00 ( 0.00%) 5361.00 ( -6.58%)
Min alloc-odr4-16 4782.00 ( 0.00%) 5162.00 ( -7.95%)
Min alloc-odr4-32 5838.00 ( 0.00%) 6499.00 ( -11.32%)
Min alloc-odr4-64 6565.00 ( 0.00%) 7413.00 ( -12.92%)
Min alloc-odr4-128 6896.00 ( 0.00%) 7898.00 ( -14.53%)
Min alloc-odr4-256 7303.00 ( 0.00%) 8163.00 ( -11.78%)
Min alloc-odr4-512 10179.00 ( 0.00%) 11985.00 ( -17.74%)
Min alloc-odr4-1024 11000.00 ( 0.00%) 12165.00 ( -10.59%)
Min free-odr4-1 820.00 ( 0.00%) 1230.00 ( -50.00%)
Min free-odr4-2 511.00 ( 0.00%) 952.00 ( -86.30%)
Min free-odr4-4 347.00 ( 0.00%) 434.00 ( -25.07%)
Min free-odr4-8 286.00 ( 0.00%) 399.00 ( -39.51%)
Min free-odr4-16 250.00 ( 0.00%) 405.00 ( -62.00%)
Min free-odr4-32 294.00 ( 0.00%) 405.00 ( -37.76%)
Min free-odr4-64 333.00 ( 0.00%) 363.00 ( -9.01%)
Min free-odr4-128 340.00 ( 0.00%) 412.00 ( -21.18%)
Min free-odr4-256 339.00 ( 0.00%) 329.00 ( 2.95%)
Min free-odr4-512 361.00 ( 0.00%) 409.00 ( -13.30%)
Min free-odr4-1024 300.00 ( 0.00%) 361.00 ( -20.33%)
Stddev alloc-odr4-1 7.29 ( 0.00%) 90.78 (-1146.00%)
Stddev alloc-odr4-2 3.87 ( 0.00%) 51.30 (-1225.75%)
Stddev alloc-odr4-4 3.20 ( 0.00%) 50.90 (-1491.24%)
Stddev alloc-odr4-8 4.67 ( 0.00%) 52.23 (-1019.35%)
Stddev alloc-odr4-16 5.72 ( 0.00%) 27.53 (-381.04%)
Stddev alloc-odr4-32 6.25 ( 0.00%) 641.23 (-10154.46%)
Stddev alloc-odr4-64 2.06 ( 0.00%) 386.99 (-18714.22%)
Stddev alloc-odr4-128 14.36 ( 0.00%) 52.39 (-264.77%)
Stddev alloc-odr4-256 32.42 ( 0.00%) 326.19 (-906.05%)
Stddev alloc-odr4-512 65.58 ( 0.00%) 184.49 (-181.31%)
Stddev alloc-odr4-1024 8.88 ( 0.00%) 153.01 (-1622.67%)
Stddev free-odr4-1 2.29 ( 0.00%) 152.27 (-6549.85%)
Stddev free-odr4-2 10.99 ( 0.00%) 73.10 (-564.89%)
Stddev free-odr4-4 1.99 ( 0.00%) 28.40 (-1324.45%)
Stddev free-odr4-8 2.51 ( 0.00%) 52.93 (-2007.64%)
Stddev free-odr4-16 2.85 ( 0.00%) 26.04 (-814.88%)
Stddev free-odr4-32 4.04 ( 0.00%) 27.05 (-569.79%)
Stddev free-odr4-64 2.10 ( 0.00%) 48.07 (-2185.66%)
Stddev free-odr4-128 2.63 ( 0.00%) 26.23 (-897.86%)
Stddev free-odr4-256 6.29 ( 0.00%) 37.04 (-488.71%)
Stddev free-odr4-512 2.56 ( 0.00%) 10.65 (-315.28%)
Stddev free-odr4-1024 0.95 ( 0.00%) 6.46 (-582.22%)
Max alloc-odr4-1 8564.00 ( 0.00%) 9099.00 ( -6.25%)
Max alloc-odr4-2 6511.00 ( 0.00%) 6844.00 ( -5.11%)
Max alloc-odr4-4 5549.00 ( 0.00%) 6038.00 ( -8.81%)
Max alloc-odr4-8 5045.00 ( 0.00%) 5551.00 ( -10.03%)
Max alloc-odr4-16 4800.00 ( 0.00%) 5257.00 ( -9.52%)
Max alloc-odr4-32 5861.00 ( 0.00%) 8115.00 ( -38.46%)
Max alloc-odr4-64 6571.00 ( 0.00%) 8292.00 ( -26.19%)
Max alloc-odr4-128 6930.00 ( 0.00%) 8081.00 ( -16.61%)
Max alloc-odr4-256 7372.00 ( 0.00%) 9150.00 ( -24.12%)
Max alloc-odr4-512 10333.00 ( 0.00%) 12636.00 ( -22.29%)
Max alloc-odr4-1024 11035.00 ( 0.00%) 12590.00 ( -14.09%)
Max free-odr4-1 828.00 ( 0.00%) 1724.00 (-108.21%)
Max free-odr4-2 543.00 ( 0.00%) 1192.00 (-119.52%)
Max free-odr4-4 354.00 ( 0.00%) 519.00 ( -46.61%)
Max free-odr4-8 293.00 ( 0.00%) 617.00 (-110.58%)
Max free-odr4-16 260.00 ( 0.00%) 483.00 ( -85.77%)
Max free-odr4-32 308.00 ( 0.00%) 488.00 ( -58.44%)
Max free-odr4-64 341.00 ( 0.00%) 505.00 ( -48.09%)
Max free-odr4-128 346.00 ( 0.00%) 497.00 ( -43.64%)
Max free-odr4-256 353.00 ( 0.00%) 463.00 ( -31.16%)
Max free-odr4-512 367.00 ( 0.00%) 442.00 ( -20.44%)
Max free-odr4-1024 303.00 ( 0.00%) 381.00 ( -25.74%)
With patch test results:
patched patched
no-monitor monitor
Min alloc-odr4-1 8488.00 ( 0.00%) 8514.00 ( -0.31%)
Min alloc-odr4-2 6551.00 ( 0.00%) 6527.00 ( 0.37%)
Min alloc-odr4-4 5536.00 ( 0.00%) 5591.00 ( -0.99%)
Min alloc-odr4-8 5008.00 ( 0.00%) 5098.00 ( -1.80%)
Min alloc-odr4-16 4760.00 ( 0.00%) 4857.00 ( -2.04%)
Min alloc-odr4-32 5827.00 ( 0.00%) 5919.00 ( -1.58%)
Min alloc-odr4-64 6561.00 ( 0.00%) 6680.00 ( -1.81%)
Min alloc-odr4-128 6898.00 ( 0.00%) 7014.00 ( -1.68%)
Min alloc-odr4-256 7311.00 ( 0.00%) 7464.00 ( -2.09%)
Min alloc-odr4-512 10181.00 ( 0.00%) 10286.00 ( -1.03%)
Min alloc-odr4-1024 11205.00 ( 0.00%) 11725.00 ( -4.64%)
Min free-odr4-1 789.00 ( 0.00%) 867.00 ( -9.89%)
Min free-odr4-2 490.00 ( 0.00%) 526.00 ( -7.35%)
Min free-odr4-4 350.00 ( 0.00%) 360.00 ( -2.86%)
Min free-odr4-8 272.00 ( 0.00%) 287.00 ( -5.51%)
Min free-odr4-16 247.00 ( 0.00%) 254.00 ( -2.83%)
Min free-odr4-32 298.00 ( 0.00%) 304.00 ( -2.01%)
Min free-odr4-64 334.00 ( 0.00%) 325.00 ( 2.69%)
Min free-odr4-128 334.00 ( 0.00%) 329.00 ( 1.50%)
Min free-odr4-256 336.00 ( 0.00%) 336.00 ( 0.00%)
Min free-odr4-512 360.00 ( 0.00%) 342.00 ( 5.00%)
Min free-odr4-1024 327.00 ( 0.00%) 355.00 ( -8.56%)
Stddev alloc-odr4-1 5.19 ( 0.00%) 45.38 (-775.09%)
Stddev alloc-odr4-2 6.99 ( 0.00%) 37.63 (-437.98%)
Stddev alloc-odr4-4 3.91 ( 0.00%) 17.85 (-356.28%)
Stddev alloc-odr4-8 5.15 ( 0.00%) 9.34 ( -81.47%)
Stddev alloc-odr4-16 3.83 ( 0.00%) 5.34 ( -39.34%)
Stddev alloc-odr4-32 1.96 ( 0.00%) 10.28 (-425.09%)
Stddev alloc-odr4-64 1.32 ( 0.00%) 333.30 (-25141.39%)
Stddev alloc-odr4-128 2.06 ( 0.00%) 7.37 (-258.28%)
Stddev alloc-odr4-256 15.56 ( 0.00%) 113.48 (-629.25%)
Stddev alloc-odr4-512 61.25 ( 0.00%) 165.09 (-169.53%)
Stddev alloc-odr4-1024 18.89 ( 0.00%) 2.93 ( 84.51%)
Stddev free-odr4-1 4.45 ( 0.00%) 40.12 (-800.98%)
Stddev free-odr4-2 1.50 ( 0.00%) 29.30 (-1850.31%)
Stddev free-odr4-4 1.27 ( 0.00%) 19.49 (-1439.40%)
Stddev free-odr4-8 0.97 ( 0.00%) 8.93 (-823.07%)
Stddev free-odr4-16 8.38 ( 0.00%) 4.51 ( 46.21%)
Stddev free-odr4-32 3.18 ( 0.00%) 6.59 (-107.42%)
Stddev free-odr4-64 2.40 ( 0.00%) 3.09 ( -28.50%)
Stddev free-odr4-128 1.55 ( 0.00%) 2.53 ( -62.92%)
Stddev free-odr4-256 0.41 ( 0.00%) 2.80 (-585.57%)
Stddev free-odr4-512 1.60 ( 0.00%) 4.84 (-202.08%)
Stddev free-odr4-1024 0.66 ( 0.00%) 1.19 ( -80.68%)
Max alloc-odr4-1 8505.00 ( 0.00%) 8676.00 ( -2.01%)
Max alloc-odr4-2 6572.00 ( 0.00%) 6651.00 ( -1.20%)
Max alloc-odr4-4 5552.00 ( 0.00%) 5646.00 ( -1.69%)
Max alloc-odr4-8 5024.00 ( 0.00%) 5131.00 ( -2.13%)
Max alloc-odr4-16 4774.00 ( 0.00%) 4875.00 ( -2.12%)
Max alloc-odr4-32 5834.00 ( 0.00%) 5950.00 ( -1.99%)
Max alloc-odr4-64 6565.00 ( 0.00%) 7434.00 ( -13.24%)
Max alloc-odr4-128 6907.00 ( 0.00%) 7034.00 ( -1.84%)
Max alloc-odr4-256 7347.00 ( 0.00%) 7843.00 ( -6.75%)
Max alloc-odr4-512 10315.00 ( 0.00%) 10866.00 ( -5.34%)
Max alloc-odr4-1024 11278.00 ( 0.00%) 11733.00 ( -4.03%)
Max free-odr4-1 803.00 ( 0.00%) 1009.00 ( -25.65%)
Max free-odr4-2 495.00 ( 0.00%) 607.00 ( -22.63%)
Max free-odr4-4 354.00 ( 0.00%) 417.00 ( -17.80%)
Max free-odr4-8 275.00 ( 0.00%) 313.00 ( -13.82%)
Max free-odr4-16 273.00 ( 0.00%) 272.00 ( 0.37%)
Max free-odr4-32 309.00 ( 0.00%) 324.00 ( -4.85%)
Max free-odr4-64 340.00 ( 0.00%) 335.00 ( 1.47%)
Max free-odr4-128 340.00 ( 0.00%) 338.00 ( 0.59%)
Max free-odr4-256 338.00 ( 0.00%) 346.00 ( -2.37%)
Max free-odr4-512 364.00 ( 0.00%) 359.00 ( 1.37%)
Max free-odr4-1024 329.00 ( 0.00%) 359.00 ( -9.12%)
Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
---
mm/page_alloc.c | 10 ++++++----
mm/vmstat.c | 30 +++++++-----------------------
2 files changed, 13 insertions(+), 27 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9431073e7255..a90f2bf735f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -818,7 +818,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
else
list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
- area->mt_nr_free[migratetype]++;
+ WRITE_ONCE(area->mt_nr_free[migratetype],
+ area->mt_nr_free[migratetype] + 1);
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
@@ -841,8 +842,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
get_pageblock_migratetype(page), old_mt, nr_pages);
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
- area->mt_nr_free[old_mt]--;
- area->mt_nr_free[new_mt]++;
+ WRITE_ONCE(area->mt_nr_free[old_mt], area->mt_nr_free[old_mt] - 1);
+ WRITE_ONCE(area->mt_nr_free[new_mt], area->mt_nr_free[new_mt] + 1);
account_freepages(zone, -nr_pages, old_mt);
account_freepages(zone, nr_pages, new_mt);
@@ -873,7 +874,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
__ClearPageBuddy(page);
set_page_private(page, 0);
area->nr_free--;
- area->mt_nr_free[migratetype]--;
+ WRITE_ONCE(area->mt_nr_free[migratetype],
+ area->mt_nr_free[migratetype] - 1);
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bb09c032eecf..9334bbbe1e16 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
zone->name,
migratetype_names[mtype]);
for (order = 0; order < NR_PAGE_ORDERS; ++order) {
- unsigned long freecount = 0;
- struct free_area *area;
- struct list_head *curr;
+ unsigned long freecount;
bool overflow = false;
- area = &(zone->free_area[order]);
-
- list_for_each(curr, &area->free_list[mtype]) {
- /*
- * Cap the free_list iteration because it might
- * be really large and we are under a spinlock
- * so a long time spent here could trigger a
- * hard lockup detector. Anyway this is a
- * debugging tool so knowing there is a handful
- * of pages of this order should be more than
- * sufficient.
- */
- if (++freecount >= 100000) {
- overflow = true;
- break;
- }
+ /* Keep the same output format for user-space tools compatibility */
+ freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+ if (freecount >= 100000) {
+ overflow = true;
+ freecount = 100000;
}
seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
- spin_unlock_irq(&zone->lock);
- cond_resched();
- spin_lock_irq(&zone->lock);
}
seq_putc(m, '\n');
}
@@ -1633,7 +1617,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg)
seq_printf(m, "%6d ", order);
seq_putc(m, '\n');
- walk_zones_in_node(m, pgdat, true, false, pagetypeinfo_showfree_print);
+ walk_zones_in_node(m, pgdat, true, true, pagetypeinfo_showfree_print);
}
static void pagetypeinfo_showblockcount_print(struct seq_file *m,
--
2.43.0
Hi, Hongru
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9431073e7255..a90f2bf735f6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -818,7 +818,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
> else
> list_add(&page->buddy_list, &area->free_list[migratetype]);
> area->nr_free++;
> - area->mt_nr_free[migratetype]++;
> + WRITE_ONCE(area->mt_nr_free[migratetype],
> + area->mt_nr_free[migratetype] + 1);
>
> if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
> @@ -841,8 +842,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
> get_pageblock_migratetype(page), old_mt, nr_pages);
>
> list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
> - area->mt_nr_free[old_mt]--;
> - area->mt_nr_free[new_mt]++;
> + WRITE_ONCE(area->mt_nr_free[old_mt], area->mt_nr_free[old_mt] - 1);
> + WRITE_ONCE(area->mt_nr_free[new_mt], area->mt_nr_free[new_mt] + 1);
>
> account_freepages(zone, -nr_pages, old_mt);
> account_freepages(zone, nr_pages, new_mt);
> @@ -873,7 +874,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
> __ClearPageBuddy(page);
> set_page_private(page, 0);
> area->nr_free--;
> - area->mt_nr_free[migratetype]--;
> + WRITE_ONCE(area->mt_nr_free[migratetype],
> + area->mt_nr_free[migratetype] - 1);
It doesn't seem like a good idea to use WRITE_ONCE on the hot path.
>
> if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index bb09c032eecf..9334bbbe1e16 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> zone->name,
> migratetype_names[mtype]);
> for (order = 0; order < NR_PAGE_ORDERS; ++order) {
> - unsigned long freecount = 0;
> - struct free_area *area;
> - struct list_head *curr;
> + unsigned long freecount;
> bool overflow = false;
>
> - area = &(zone->free_area[order]);
> -
> - list_for_each(curr, &area->free_list[mtype]) {
> - /*
> - * Cap the free_list iteration because it might
> - * be really large and we are under a spinlock
> - * so a long time spent here could trigger a
> - * hard lockup detector. Anyway this is a
> - * debugging tool so knowing there is a handful
> - * of pages of this order should be more than
> - * sufficient.
> - */
> - if (++freecount >= 100000) {
> - overflow = true;
> - break;
> - }
> + /* Keep the same output format for user-space tools compatibility */
> + freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
I think it might be better for using an array of size NR_PAGE_ORDERS to store
the free count for each order. Like the code below.
unsigned long freecount[NR_PAGE_ORDERS]
spin_lock_irq(&zone->lock)
for_each_order
freecount[order] = zone->free_area[order].mt_nr_free[mtype]
spin_unlock_irq(&zone->lock)
for_each_order
print freecount[order]
> + if (freecount >= 100000) {
> + overflow = true;
> + freecount = 100000;
> }
> seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> - spin_unlock_irq(&zone->lock);
> - cond_resched();
> - spin_lock_irq(&zone->lock);
> }
> seq_putc(m, '\n');
> }
> @@ -1633,7 +1617,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg)
> seq_printf(m, "%6d ", order);
> seq_putc(m, '\n');
>
> - walk_zones_in_node(m, pgdat, true, false, pagetypeinfo_showfree_print);
> + walk_zones_in_node(m, pgdat, true, true, pagetypeinfo_showfree_print);
> }
> > if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> > __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index bb09c032eecf..9334bbbe1e16 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > zone->name,
> > migratetype_names[mtype]);
> > for (order = 0; order < NR_PAGE_ORDERS; ++order) {
> > - unsigned long freecount = 0;
> > - struct free_area *area;
> > - struct list_head *curr;
> > + unsigned long freecount;
> > bool overflow = false;
> >
> > - area = &(zone->free_area[order]);
> > -
> > - list_for_each(curr, &area->free_list[mtype]) {
> > - /*
> > - * Cap the free_list iteration because it might
> > - * be really large and we are under a spinlock
> > - * so a long time spent here could trigger a
> > - * hard lockup detector. Anyway this is a
> > - * debugging tool so knowing there is a handful
> > - * of pages of this order should be more than
> > - * sufficient.
> > - */
> > - if (++freecount >= 100000) {
> > - overflow = true;
> > - break;
> > - }
> > + /* Keep the same output format for user-space tools compatibility */
> > + freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
>
> I think it might be better for using an array of size NR_PAGE_ORDERS to store
> the free count for each order. Like the code below.
Right. If we want the freecount to accurately reflect the current system
state, we still need to take the zone lock.
Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
correctness. They may ensure single-copy atomicity per access, but not for the
overall result.
Thanks
Barry
> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.
Yeah, as I mentioned in patch (2/3), this implementation has accuracy
limitation:
"Accuracy. Both implementations have accuracy limitations. The previous
implementation required acquiring and releasing the zone lock for counting
each order and migratetype, making it potentially inaccurate. Under high
memory pressure, accuracy would further degrade due to zone lock
contention or fragmentation. The new implementation collects data within a
short time window, which helps maintain relatively small errors, and is
unaffected by memory pressure. Furthermore, user-space memory management
components inherently experience decision latency - by the time they
process the collected data and execute actions, the memory state has
already changed. This means that even perfectly accurate data at
collection time becomes stale by decision time. Considering these factors,
the accuracy trade-off introduced by the new implementation should be
acceptable for practical use cases, offering a balance between performance
and accuracy requirements."
Additional data:
1. average latency of pagetypeinfo_showfree_print() over 1,000,000
times is 4.67 us
2. average latency is 125 ns, if seq_printf() is taken out of the loop
Example code:
+unsigned long total_lat = 0;
+unsigned long total_count = 0;
+
static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
int order, mtype;
+ ktime_t start;
+ u64 lat;
+ unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
+
+ start = ktime_get();
+ for (order = 0; order < NR_PAGE_ORDERS; ++order)
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+ freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+
+ lat = ktime_to_ns(ktime_sub(ktime_get(), start));
+ total_count++;
+ total_lat += lat;
for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
bool overflow = false;
/* Keep the same output format for user-space tools compatibility */
- freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+ freecount = freecounts[order][mtype];
if (freecount >= 100000) {
overflow = true;
freecount = 100000;
@@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
#endif /* CONFIG_PAGE_OWNER */
}
I think both are small time window (if IRQ is disabled, latency is more
deterministic).
> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.
I know this does not guarantee correctness of the overall result.
READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
store tearing and read tearing caused by compiler optimizations.
In fact, I have already noticed /proc/buddyinfo, which collects data under
zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
we could remove its zone lock as well, for the same reasons as
/proc/pagetypeinfo.
On Mon, Dec 1, 2025 at 8:29 PM Hongru Zhang <zhanghongru06@gmail.com> wrote:
>
> > Right. If we want the freecount to accurately reflect the current system
> > state, we still need to take the zone lock.
>
> Yeah, as I mentioned in patch (2/3), this implementation has accuracy
> limitation:
>
> "Accuracy. Both implementations have accuracy limitations. The previous
> implementation required acquiring and releasing the zone lock for counting
> each order and migratetype, making it potentially inaccurate. Under high
> memory pressure, accuracy would further degrade due to zone lock
> contention or fragmentation. The new implementation collects data within a
> short time window, which helps maintain relatively small errors, and is
> unaffected by memory pressure. Furthermore, user-space memory management
> components inherently experience decision latency - by the time they
> process the collected data and execute actions, the memory state has
> already changed. This means that even perfectly accurate data at
> collection time becomes stale by decision time. Considering these factors,
> the accuracy trade-off introduced by the new implementation should be
> acceptable for practical use cases, offering a balance between performance
> and accuracy requirements."
>
> Additional data:
> 1. average latency of pagetypeinfo_showfree_print() over 1,000,000
> times is 4.67 us
>
> 2. average latency is 125 ns, if seq_printf() is taken out of the loop
>
> Example code:
>
> +unsigned long total_lat = 0;
> +unsigned long total_count = 0;
> +
> static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + ktime_t start;
> + u64 lat;
> + unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
> +
> + start = ktime_get();
> + for (order = 0; order < NR_PAGE_ORDERS; ++order)
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
> + freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +
> + lat = ktime_to_ns(ktime_sub(ktime_get(), start));
> + total_count++;
> + total_lat += lat;
>
> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> bool overflow = false;
>
> /* Keep the same output format for user-space tools compatibility */
> - freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> + freecount = freecounts[order][mtype];
> if (freecount >= 100000) {
> overflow = true;
> freecount = 100000;
> @@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
> #endif /* CONFIG_PAGE_OWNER */
> }
>
> I think both are small time window (if IRQ is disabled, latency is more
> deterministic).
>
> > Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> > correctness. They may ensure single-copy atomicity per access, but not for the
> > overall result.
>
> I know this does not guarantee correctness of the overall result.
> READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
> store tearing and read tearing caused by compiler optimizations.
Yes, I realized that correctness might not be a major concern, so I sent a
follow-up email [1] after replying to you.
>
> In fact, I have already noticed /proc/buddyinfo, which collects data under
> zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
> we could remove its zone lock as well, for the same reasons as
> /proc/pagetypeinfo.
That might be correct. However, if it doesn’t significantly affect performance
and buddyinfo is accessed much less frequently than the buddy list, we may
just leave it as is.
[1] https://lore.kernel.org/linux-mm/CAGsJ_4wUQdQyB_3y0Buf3uG34hvgpMAP3qHHwJM3=R01RJOuvw@mail.gmail.com/
Thanks
Barry
On Sat, Nov 29, 2025 at 8:00 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > > if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> > > __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index bb09c032eecf..9334bbbe1e16 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > > zone->name,
> > > migratetype_names[mtype]);
> > > for (order = 0; order < NR_PAGE_ORDERS; ++order) {
> > > - unsigned long freecount = 0;
> > > - struct free_area *area;
> > > - struct list_head *curr;
> > > + unsigned long freecount;
> > > bool overflow = false;
> > >
> > > - area = &(zone->free_area[order]);
> > > -
> > > - list_for_each(curr, &area->free_list[mtype]) {
> > > - /*
> > > - * Cap the free_list iteration because it might
> > > - * be really large and we are under a spinlock
> > > - * so a long time spent here could trigger a
> > > - * hard lockup detector. Anyway this is a
> > > - * debugging tool so knowing there is a handful
> > > - * of pages of this order should be more than
> > > - * sufficient.
> > > - */
> > > - if (++freecount >= 100000) {
> > > - overflow = true;
> > > - break;
> > > - }
> > > + /* Keep the same output format for user-space tools compatibility */
> > > + freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> >
> > I think it might be better for using an array of size NR_PAGE_ORDERS to store
> > the free count for each order. Like the code below.
>
> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.
>
> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.
On second thought, the original code releases and re-acquires the spinlock
for each order, so cross-variable consistency may not be a real issue.
Adding data_race() to silence KCSAN warnings should be sufficient?
I mean something like the following.
@@ -843,8 +842,8 @@ static inline void move_to_free_list(struct page
*page, struct zone *zone,
get_pageblock_migratetype(page), old_mt, nr_pages);
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
- WRITE_ONCE(area->mt_nr_free[old_mt], area->mt_nr_free[old_mt] - 1);
- WRITE_ONCE(area->mt_nr_free[new_mt], area->mt_nr_free[new_mt] + 1);
+ area->mt_nr_free[old_mt]--;
+ area->mt_nr_free[new_mt]++;
account_freepages(zone, -nr_pages, old_mt);
account_freepages(zone, nr_pages, new_mt);
@@ -875,8 +874,7 @@ static inline void
__del_page_from_free_list(struct page *page, struct zone *zon
__ClearPageBuddy(page);
set_page_private(page, 0);
area->nr_free--;
- WRITE_ONCE(area->mt_nr_free[migratetype],
- area->mt_nr_free[migratetype] - 1);
+ area->mt_nr_free[migratetype]--;
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7e1e931eb209..d74004eb8c4d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1599,7 +1599,7 @@ static void pagetypeinfo_showfree_print(struct
seq_file *m,
bool overflow = false;
/* Keep the same output format for user-space
tools compatibility */
- freecount =
READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+ freecount =
data_race(zone->free_area[order].mt_nr_free[mtype]);
if (freecount >= 100000) {
overflow = true;
freecount = 100000;
Thanks
Barry
© 2016 - 2025 Red Hat, Inc.